2025-04-13
Goal: Understand how scale predicts LLM performance
Builds on: Pretraining, Distributed Training
We’ll cover:
What are scaling laws?
Power laws: the foundation
Scaling laws in deep learning
The Chinchilla scaling law Implications and limitations
The “bitter lesson” of AI research
Power laws: mathematical equations of the form: y = bx^a
Describe how one quantity varies as the power of another
On a log-log plot, Power laws = straight lines
Model Size (N):
Number of parameters (weights) in the model
Measures model capacity
Dataset Size (D):
Number of tokens the model is trained on
Tokens can be words, subwords, pixels, etc.
Compute (C):
Floating point operations (FLOPs) used during training
Enables both larger models and more training data
Small models with more training data can outperform larger models
Chinchilla (70B params) beat larger models with more tokens
Key observation: Scaling beats intricate, expert-designed systems.
Simpler systems trained on more data outperform complex, human-designed approaches.
Lesson: Invest in compute and data rather than manual feature engineering.
Scaling often beats complex designs
Resource allocation: Optimize the ratio of model size to dataset size
Investment focus: Compute is the fundamental bottleneck
Research priorities: Algorithm improvements can shift the scaling curve
Future capabilities: May be predictable through extrapolation
AI timelines: Scaling laws inform predictions about AI progress
True or False: Scaling laws apply to all ML models
False. Generative models (e.g., LLMs) follow clear scaling laws, but discriminative models (e.g., image classifiers) often don’t.
How does compute relate to scaling?
Compute drives scaling by enabling larger models and datasets. More FLOPs = more parameters or tokens.
What’s a power law?
A power law is ( y = bx^a ), where ( y ) (e.g., loss) scales with ( x ) (e.g., params) raised to an exponent. It’s linear on a log-log plot.