23. Context Scaling
Context Scaling
Learning objectives
Adapt to long context inputs
Review RoPE
Sources
Following the
Gen AI Handbook
, we looked at
blog post
by Belandros Pan (Hugging Face)
blog post
by Honglu Fan, et al. (Eleuther AI)
blog post
by Gradient
The Problem
Issues with Long Contexts
batch alignment
memory usage
attention space:
O
(
N
2
)
Fixes
Grouped Query Attention (GQA): multiple query matrices, but shares keys and values
Gradient Checkpoint: only saves results only after
√
N
layers
LoRA
Distributed Training
Sample packing: sequences share long chains
Flash Attention:
O
(
N
)
Rotary Position Embedding
RoPE
RoPE matrix
PE vectors maintain magnitude
resilient with test data
>
training data
image credit:
Eleuther AI
Experiments
Gradient
H. Liu et al.: “1M-32K, 10M-131k, 10M-262k, 25M-524k, 50M-1048k (
theta-context length
) schedule”
Llama with YaRN
Llama with Yarn
image credit:
Eleuther AI
23. Context Scaling
Slides
Tools
Close
23. Context Scaling
Context Scaling
Learning objectives
Sources
The Problem
Issues with Long Contexts
Fixes
Rotary Position Embedding
RoPE
Experiments
Gradient
Llama with YaRN
f
Fullscreen
s
Speaker View
o
Slide Overview
e
PDF Export Mode
r
Scroll View Mode
?
Keyboard Help