Generative AI Handbook Book Club

Chapter 17: Scaling Laws

We’ll cover:

Scaling Laws describe how changes in model size, dataset size, or compute affect AI model performance.
They are essential for understanding and optimizing AI models.
Predict how LLM performance (loss) changes with:
- Model size (parameters)
- Dataset size (tokens)
- Compute (FLOPs)
Allow us to make reliable predictions about model performance
Enable optimization of hyperparameters without expensive grid searches

Power laws: mathematical equations of the form: y = bx^a
Describe how one quantity varies as the power of another
- y : Performance (e.g., loss), x : Scale (e.g., parameters)
On a log-log plot, Power laws = straight lines

In deep learning, performance scales according to:
- N: Model size (number of parameters)
- D: Dataset size (tokens, pixels, etc.)
- C: Compute (FLOPs)
As these variables increase, model loss decreases following power laws
Scaling is primarily bottlenecked by computational resources

Model Size (N):

Number of parameters (weights) in the model

Measures model capacity
Dataset Size (D):

Number of tokens the model is trained on

Tokens can be words, subwords, pixels, etc.
Compute (C):

Floating point operations (FLOPs) used during training

Enables both larger models and more training data

Key observation: Scaling beats intricate, expert-designed systems.
Simpler systems trained on more data outperform complex, human-designed approaches.
Lesson: Invest in compute and data rather than manual feature engineering.
Scaling often beats complex designs
“More compute + data > human ingenuity alone” (Sutton)

Chinchilla’s exact numbers contested
Scaling laws are not universal.
Equations are debated and subject to refinement.
Future trends: Scaling laws may evolve as models and tasks become more complex.
Replication issues (see Eleuther AI post)
Future laws may refine predictions

True or False: Scaling laws apply to all ML models

False. Generative models (e.g., LLMs) follow clear scaling laws, but discriminative models (e.g., image classifiers) often don’t.

How does compute relate to scaling?

Compute drives scaling by enabling larger models and datasets. More FLOPs = more parameters or tokens.

What’s a power law?

A power law is ( y = bx^a ), where ( y ) (e.g., loss) scales with ( x ) (e.g., params) raised to an exponent. It’s linear on a log-log plot.