23. Context Scaling

Context Scaling

Learning objectives

Adapt to long context inputs
Review RoPE

not a long section?

Sources

Following the Gen AI Handbook, we looked at

blog post by Belandros Pan (Hugging Face)
blog post by Honglu Fan, et al. (Eleuther AI)
blog post by Gradient

The Problem

Issues with Long Contexts

batch alignment

memory usage
attention space: $O(N^{2})$

Fixes

Grouped Query Attention (GQA): multiple query matrices, but shares keys and values
Gradient Checkpoint: only saves results only after $\sqrt{N}$ layers
LoRA
Distributed Training
Sample packing: sequences share long chains
Flash Attention: $O(N)$

Rotary Position Embedding

RoPE

RoPE matrix

PE vectors maintain magnitude
resilient with test data $>$ training data
image credit: Eleuther AI

Experiments

Gradient

H. Liu et al.: “1M-32K, 10M-131k, 10M-262k, 25M-524k, 50M-1048k (theta-context length) schedule”

Llama with YaRN

Llama with Yarn

image credit: Eleuther AI

DSLC.io/genai | DSLC.io

23. Context Scaling

Slides
Tools
Close

23. Context Scaling
Context Scaling
Learning objectives
Sources
The Problem
Issues with Long Contexts
Fixes
Rotary Position Embedding
RoPE
Experiments
Gradient
Llama with YaRN

f Fullscreen
s Speaker View
o Slide Overview
e PDF Export Mode
r Scroll View Mode
? Keyboard Help