I recently watched Andrej Karpathy’s Deep Dive into LLMs like ChatGPT — an excellent hands-on walkthrough with concrete examples. Here’s my visual summary for future reference.
The process of tokenization can be visualized by the Tiktokenizer App.
"Hello world" → [15496, 995] (2 tokens)
"hello world" → [31373, 995] (2 tokens - different IDs due to case!)
"HelloWorld" → [15496, 10603] (2 tokens - camelCase split)
How LLMs generate text (token by token):
Input: "The cat sat on the"
1. Model predicts probability distribution for next token
→ "mat": 15%, "floor": 12%, "roof": 8%, "dog": 0.1%...
2. Sample from distribution (stochastic!)
→ Selected: "mat"
3. Append to context: "The cat sat on the mat"
4. REPEAT until <end> token or max length
→ "The cat sat on the mat and purred softly..."
→ Each run produces slightly different output (stochastic)
→ Context window limits how much the model can "remember"
<|im_start|>system<|im_sep|>You are a helpful assistant<|im_end|>
<|im_start|>user<|im_sep|>What is 4 + 4?<|im_end|>
<|im_start|>assistant<|im_sep|>4 + 4 = 8<|im_end|>
Model learns to use tools when uncertain:
User: "Who is Orson Kovacs?"
1. Model recognizes it doesn't know
→ Outputs: <SEARCH_START>Who is Orson Kovacs?<SEARCH_END>
2. System executes search, injects results
→ [Wikipedia article, news articles, etc.]
3. Model generates answer using search results
→ "Orson Kovacs is a fictional character..."
4. If still unsure → REPEAT search with refined query
→ Uses working memory (context) instead of vague recall (params)
→ Dramatically reduces hallucinations for factual queries
Teaching models to recognize their knowledge limits:
1. EXTRACT: Take a snippet from training data
→ "The Eiffel Tower weighs approximately 10,100 tons"
2. GENERATE: Create a factual question about it
→ "What is the mass of the Eiffel Tower?"
3. ANSWER: Have the model answer the question
→ Model says: "The Eiffel Tower weighs 7,300 tons"
4. SCORE: Compare answer against original source
→ ❌ Wrong! Model hallucinated.
5. TRAIN: Teach model to refuse or use tools when unsure
→ "I don't have reliable information about..."
6. REPEAT → Model learns its knowledge boundaries
→ Model learns WHEN to say "I don't know"
→ Model learns WHEN to use search instead of guessing
Problem: Emily buys 3 apples and 2 oranges. Each orange costs $2.
Total cost is $13. What is the cost of each apple?
🔄 THE RL LOOP (repeated millions of times):
1. GENERATE: Model produces many solutions
→ Solution A: "The answer is $3."
→ Solution B: "Oranges: 2×$2=$4, Apples: $13-$4=$9..."
→ Solution C: "Each apple costs $5." (wrong)
2. SCORE: Check which solutions are correct
→ A: ✅ correct but no reasoning
→ B: ✅ correct WITH step-by-step reasoning
→ C: ❌ wrong
3. TRAIN: Update model weights on winning solutions
→ Model learns to prefer B's reasoning style
4. REPEAT → Model gets better at reasoning over time
→ No human labels needed - correctness is automatically verified
→ This is how models develop "Aha moments" (DeepSeek paper)
For unverifiable domains (jokes, writing, style):
PHASE 1: Train the Reward Model
─────────────────────────────────────────────────────────
1. Task: "Write a joke about pelicans"
2. Model generates 5 different jokes
3. Humans rank them: B > D > A > E > C
4. Train reward model on these preferences
5. REPEAT until reward model is accurate
PHASE 2: Use Reward Model at Scale
─────────────────────────────────────────────────────────
1. GENERATE: Model produces many responses
2. SCORE: Reward model rates each one
3. TRAIN: Update on highest-scored responses
4. REPEAT (but CAPPED to ~100s of iterations)
⚠️ WHY CAPPED?
After too many iterations, model finds adversarial exploits
→ Might output "the the the the" to game reward model
→ This is called "reward hacking"
→ Humans can't label millions of examples, but reward model can
→ Exploits "discriminator-generator gap" (easier to judge than create)