2025-03-30
Section III: Foundations for Modern Language Modeling
Focus:
Chapter 13: Tokenization
Chapter 14: Positional Encoding
Goal: Understand key concepts for training modern LLMs
“Tokenization is at the heart of much weirdness of LLMs” - Karpathy
Pros: Intuitive, shorter sequences
Cons: Fixed vocab → unseen words/misspellings

Solution: Split words into subword units.
Example: “playing” → “play” + “##ing”
Algorithm: Byte-Pair Encoding (BPE)
Starts with character-level tokens
Iteratively merges frequent pairs
Analogy: Huffman coding (but dynamic)
Result: Compact, adaptive vocabulary
See Andrej Karpathy’s video for intuition
l o w, l o w e r, l o w e s tlo → lowlow + e → lowelow, lowe, r, s, t