IDEA I/O home  | about  | cheat sheets  | github

Andrej Karpathy’s Deep Dive into LLMs - Visual Summary

Introduction

I recently watched Andrej Karpathy’s Deep Dive into LLMs like ChatGPT — an excellent hands-on walkthrough with concrete examples. Here’s my visual summary for future reference.

Visual Summary

Andrej Karpathy’s Deep Dive into LLMs - Visual Summary

Examples for Each Stage

PRETRAINING

Tokenization (Byte Pair Encoding)

The process of tokenization can be visualized by the Tiktokenizer App.

"Hello world" → [15496, 995]     (2 tokens)
"hello world" → [31373, 995]     (2 tokens - different IDs due to case!)
"HelloWorld"  → [15496, 10603]   (2 tokens - camelCase split)

The Inference Loop (Autoregressive Generation)

How LLMs generate text (token by token):

Input: "The cat sat on the"

1. Model predicts probability distribution for next token
   → "mat": 15%, "floor": 12%, "roof": 8%, "dog": 0.1%...

2. Sample from distribution (stochastic!)
   → Selected: "mat"

3. Append to context: "The cat sat on the mat"

4. REPEAT until <end> token or max length
   → "The cat sat on the mat and purred softly..."

→ Each run produces slightly different output (stochastic)
→ Context window limits how much the model can "remember"

POST-TRAINING

Supervised Fine-Tuning - Chat Template (ChatML)

<|im_start|>system<|im_sep|>You are a helpful assistant<|im_end|>
<|im_start|>user<|im_sep|>What is 4 + 4?<|im_end|>
<|im_start|>assistant<|im_sep|>4 + 4 = 8<|im_end|>

Tool Use Training (The Search Loop):

Model learns to use tools when uncertain:

User: "Who is Orson Kovacs?"

1. Model recognizes it doesn't know
   → Outputs: <SEARCH_START>Who is Orson Kovacs?<SEARCH_END>

2. System executes search, injects results
   → [Wikipedia article, news articles, etc.]

3. Model generates answer using search results
   → "Orson Kovacs is a fictional character..."

4. If still unsure → REPEAT search with refined query

→ Uses working memory (context) instead of vague recall (params)
→ Dramatically reduces hallucinations for factual queries

Factuality Training Loop (Meta’s Approach):

Teaching models to recognize their knowledge limits:

1. EXTRACT: Take a snippet from training data
   → "The Eiffel Tower weighs approximately 10,100 tons"

2. GENERATE: Create a factual question about it
   → "What is the mass of the Eiffel Tower?"

3. ANSWER: Have the model answer the question
   → Model says: "The Eiffel Tower weighs 7,300 tons"

4. SCORE: Compare answer against original source
   → ❌ Wrong! Model hallucinated.

5. TRAIN: Teach model to refuse or use tools when unsure
   → "I don't have reliable information about..."

6. REPEAT → Model learns its knowledge boundaries

→ Model learns WHEN to say "I don't know"
→ Model learns WHEN to use search instead of guessing

REINFORCEMENT LEARNING

Verifiable RL - Math Domain (The Training Loop)

Problem: Emily buys 3 apples and 2 oranges. Each orange costs $2.
         Total cost is $13. What is the cost of each apple?

🔄 THE RL LOOP (repeated millions of times):

1. GENERATE: Model produces many solutions
   → Solution A: "The answer is $3."
   → Solution B: "Oranges: 2×$2=$4, Apples: $13-$4=$9..."
   → Solution C: "Each apple costs $5." (wrong)

2. SCORE: Check which solutions are correct
   → A: ✅ correct but no reasoning
   → B: ✅ correct WITH step-by-step reasoning
   → C: ❌ wrong

3. TRAIN: Update model weights on winning solutions
   → Model learns to prefer B's reasoning style

4. REPEAT → Model gets better at reasoning over time

→ No human labels needed - correctness is automatically verified
→ This is how models develop "Aha moments" (DeepSeek paper)

RL from Human Feedback Loop (RLHF)

For unverifiable domains (jokes, writing, style):

PHASE 1: Train the Reward Model
─────────────────────────────────────────────────────────
1. Task: "Write a joke about pelicans"
2. Model generates 5 different jokes
3. Humans rank them: B > D > A > E > C
4. Train reward model on these preferences
5. REPEAT until reward model is accurate

PHASE 2: Use Reward Model at Scale
─────────────────────────────────────────────────────────
1. GENERATE: Model produces many responses
2. SCORE: Reward model rates each one
3. TRAIN: Update on highest-scored responses
4. REPEAT (but CAPPED to ~100s of iterations)

⚠️ WHY CAPPED?
After too many iterations, model finds adversarial exploits
→ Might output "the the the the" to game reward model
→ This is called "reward hacking"

→ Humans can't label millions of examples, but reward model can
→ Exploits "discriminator-generator gap" (easier to judge than create)