2025-03-30
Section III: Foundations for Modern Language Modeling
Focus:
Chapter 13: Tokenization
Chapter 14: Positional Encoding
Goal: Understand key concepts for training modern LLMs
“Tokenization is at the heart of much weirdness of LLMs” - Karpathy
Pros: Intuitive, shorter sequences
Cons: Fixed vocab → unseen words/misspellings
Solution: Split words into subword units.
Example: “playing” → “play” + “##ing”
Algorithm: Byte-Pair Encoding (BPE)
Starts with character-level tokens
Iteratively merges frequent pairs
Analogy: Huffman coding (but dynamic)
Result: Compact, adaptive vocabulary
See Andrej Karpathy’s video for intuition
l o w
, l o w e r
, l o w e s t
lo
→ low
low
+ e
→ lowe
low
, lowe
, r
, s
, t