14. Positional Encoding

Bryan Tegomoh

2025-03-30

Chapter 14: Positional Encoding

Why Positional Encoding?

Transformers lack recurrence (unlike RNNs)
No inherent sense of word order
Solution: Add positional info to token embeddings

The Power of Positional Encoding

Transformer Architecture

Naive Approaches

One-hot encoding: Unique vector per position
- Issue: Doesn’t scale, no ordinality
Linear indices: 1, 2, 3, …
- Issue: Large values, poor generalization
Need: Unique, bounded, relative-aware encoding

Sinusoidal Positional Encoding

Proposed in “Attention is All You Need” Vaswani et al.
For position ( k ), dimension ( d ):
- ( P(k, 2i) = (k / 10000^{2i/d}) )
- ( P(k, 2i+1) = (k / 10000^{2i/d}) )
Vector per position, added to token embedding

How It Works

Sine/cosine pairs at varying frequencies
Wavelengths: ( 2) to ( 2 )
Properties:
- Unique per position
- Consistent distances
- Generalizes to longer sequences

Visualization

Imagine a 512-dim encoding for 100-token sequence
See Mehreen Saeed’s blog for a plot
Each row = position, each column = dimension
Gradual frequency shift encodes order

Modern Twist: RoPE

Rotary Positional Encoding (RoPE):
- Rotates embeddings based on position
- More stable, faster to learn
Widely used in frontier LLMs (e.g., LLaMA)
Check Eleuther AI’s blog for details

DSLC.io/genai | DSLC.io

14. Positional Encoding Bryan Tegomoh 2025-03-30

Slides
Tools
Close

14. Positional Encoding
Chapter 14: Positional Encoding
Why Positional Encoding?
The Power of Positional Encoding
Transformer Architecture
Naive Approaches
Sinusoidal Positional Encoding
How It Works
Visualization
Modern Twist: RoPE

f Fullscreen
s Speaker View
o Slide Overview
e PDF Export Mode
r Scroll View Mode
? Keyboard Help