14. Positional Encoding
2025-03-30
Chapter 14: Positional Encoding
Why Positional Encoding?
Transformers lack recurrence (unlike RNNs)
No inherent sense of word order
Solution: Add positional info to token embeddings

The Power of Positional Encoding
Naive Approaches
- One-hot encoding: Unique vector per position
- Issue: Doesn’t scale, no ordinality
- Linear indices: 1, 2, 3, …
- Issue: Large values, poor generalization
- Need: Unique, bounded, relative-aware encoding
Sinusoidal Positional Encoding
- Proposed in “Attention is All You Need” Vaswani et al.
- For position ( k ), dimension ( d ):
- ( P(k, 2i) = (k / 10000^{2i/d}) )
- ( P(k, 2i+1) = (k / 10000^{2i/d}) )
- Vector per position, added to token embedding
How It Works
- Sine/cosine pairs at varying frequencies
- Wavelengths: ( 2) to ( 2 )
- Properties:
- Unique per position
- Consistent distances
- Generalizes to longer sequences
Visualization
- Imagine a 512-dim encoding for 100-token sequence
- See Mehreen Saeed’s blog for a plot
- Each row = position, each column = dimension
- Gradual frequency shift encodes order
Modern Twist: RoPE
- Rotary Positional Encoding (RoPE):
- Rotates embeddings based on position
- More stable, faster to learn
- Widely used in frontier LLMs (e.g., LLaMA)
- Check Eleuther AI’s blog for details
14. Positional Encoding Bryan Tegomoh 2025-03-30