25. Benchmarking

Learning objectives

Get to know some of the common LLM evals

Evals

From: Successful language model evals by Jason Wei

Examples of evaluation benchmarks

HugginFace OpenLLM Leaderboard

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) and BLEU (Bilingual Evaluation Understudy): Traditional metrics to evaluate the quality of text generated by LLMs.
GLUE/SuperGLUE (General Language Understanding Evaluation). Evaluation on tasks that require deeper understanding of the language: text similarity, natural language inference, and sentiment analysis.

MMLU (Massive Multi-task Language Understanding). Favorite eval of DeepMind and Google.
- It tests the model’s ability to generalize across different fields: maths, science, history, etc.
- How well the model que transfer knowledge cross different domains.
GSM8K (Grade School Math 8K). To eval reasoning combined with chain-of-thought.
- Dataset of 8,500 high-quality, linguistically diverse grade school math word problems.

MATH. Used in most LLM papers.
- 12,500 problems sourced from high school math competition.
HumanEval. For coding. Developed by OpenAI.
- It focuses on the models’ ability to comprehend language, reason, and solve problems related to algorithms and simple mathematics.

What makes a good eval?

They should measure things central to intelligence, be on a meaningful task.
It should be easy to understand.
It shouldn’t fluctuate based on the prompt.
It shouldn’t make mistakes.
It shouldn’t take too much work to run.
The grading should be correct.
It should provide examples.

DSLC.io/genai | DSLC.io

25. Benchmarking

Slides
Tools
Close

25. Benchmarking
Learning objectives
Evals
Examples of evaluation benchmarks
MMLU (Massive Multi-task...
MATH. Used in most...
What makes a good eval?

f Fullscreen
s Speaker View
o Slide Overview
e PDF Export Mode
r Scroll View Mode
? Keyboard Help