IDEA I/O | home | about | cheat sheets | github

Andrej Karpathy’s Deep Dive into LLMs - Visual Summary

Date: 2016-09-18

Introduction

I recently watched Andrej Karpathy’s Deep Dive into LLMs like ChatGPT — an excellent hands-on walkthrough with concrete examples. Here’s my visual summary for future reference.

Visual Summary

Examples for Each Stage

PRETRAINING

Tokenization (Byte Pair Encoding)

The process of tokenization can be visualized by the Tiktokenizer App.

"Hello world" → [15496, 995]     (2 tokens)
"hello world" → [31373, 995]     (2 tokens - different IDs due to case!)
"HelloWorld"  → [15496, 10603]   (2 tokens - camelCase split)

The Inference Loop (Autoregressive Generation)

How LLMs generate text (token by token):

Input: "The cat sat on the"

1. Model predicts probability distribution for next token
   - "mat": 15%, "floor": 12%, "roof": 8%, "dog": 0.1%...

2. Sample from distribution (stochastic!)
   - Selected: "mat"

3. Append to context: "The cat sat on the mat"

4. REPEAT until <end> token or max length
   - "The cat sat on the mat and purred softly..."

- Each run produces slightly different output (stochastic)
- Context window limits how much the model can "remember"

POST-TRAINING

Supervised Fine-Tuning - Chat Template (ChatML)

<|im_start|>system<|im_sep|>You are a helpful assistant<|im_end|>
<|im_start|>user<|im_sep|>What is 4 + 4?<|im_end|>
<|im_start|>assistant<|im_sep|>4 + 4 = 8<|im_end|>

Tool Use Training (The Search Loop):

Model learns to use tools when uncertain:

User: "Who is Orson Kovacs?"

1. Model recognizes it doesn't know
   - Outputs: <SEARCH_START>Who is Orson Kovacs?<SEARCH_END>

2. System executes search, injects results
   - [Wikipedia article, news articles, etc.]

3. Model generates answer using search results
   - "Orson Kovacs is a fictional character..."

4. If still unsure → REPEAT search with refined query

- Uses working memory (context) instead of vague recall (params)
- Dramatically reduces hallucinations for factual queries

Factuality Training Loop (Meta’s Approach):

Teaching models to recognize their knowledge limits:

1. EXTRACT: Take a snippet from training data
   - "The Eiffel Tower weighs approximately 10,100 tons"

2. GENERATE: Create a factual question about it
   - "What is the mass of the Eiffel Tower?"

3. ANSWER: Have the model answer the question
   - Model says: "The Eiffel Tower weighs 7,300 tons"

4. SCORE: Compare answer against original source
   - Wrong! Model hallucinated.

5. TRAIN: Teach model to refuse or use tools when unsure
   - "I don't have reliable information about..."

6. REPEAT → Model learns its knowledge boundaries

- Model learns WHEN to say "I don't know"
- Model learns WHEN to use search instead of guessing

REINFORCEMENT LEARNING

Verifiable RL - Math Domain (The Training Loop)

Problem: Emily buys 3 apples and 2 oranges. Each orange costs $2.
         Total cost is $13. What is the cost of each apple?

THE RL LOOP (repeated millions of times):

1. GENERATE: Model produces many solutions
   - Solution A: "The answer is $3."
   - Solution B: "Oranges: 2×$2=$4, Apples: $13-$4=$9..."
   - Solution C: "Each apple costs $5." (wrong)

2. SCORE: Check which solutions are correct
   - A: correct but no reasoning
   - B: correct WITH step-by-step reasoning
   - C: wrong

3. TRAIN: Update model weights on winning solutions
   - Model learns to prefer B's reasoning style

4. REPEAT → Model gets better at reasoning over time

- No human labels needed - correctness is automatically verified
- This is how models develop "Aha moments" (DeepSeek paper)

RL from Human Feedback Loop (RLHF)

For unverifiable domains (jokes, writing, style):

PHASE 1: Train the Reward Model
─────────────────────────────────────────────────────────
1. Task: "Write a joke about pelicans"
2. Model generates 5 different jokes
3. Humans rank them: B > D > A > E > C
4. Train reward model on these preferences
5. REPEAT until reward model is accurate

PHASE 2: Use Reward Model at Scale
─────────────────────────────────────────────────────────
1. GENERATE: Model produces many responses
2. SCORE: Reward model rates each one
3. TRAIN: Update on highest-scored responses
4. REPEAT (but CAPPED to ~100s of iterations)

WHY CAPPED?
After too many iterations, model finds adversarial exploits
- Might output "the the the the" to game reward model
- This is called "reward hacking"

- Humans can't label millions of examples, but reward model can
- Exploits "discriminator-generator gap" (easier to judge than create)