25. Benchmarking

Learning objectives

  • Get to know some of the common LLM evals

Evals

From: Successful language model evals by Jason Wei

Examples of evaluation benchmarks

HugginFace OpenLLM Leaderboard

  • ROUGE (Recall-Oriented Understudy for Gisting Evaluation) and BLEU (Bilingual Evaluation Understudy): Traditional metrics to evaluate the quality of text generated by LLMs.
  • GLUE/SuperGLUE (General Language Understanding Evaluation). Evaluation on tasks that require deeper understanding of the language: text similarity, natural language inference, and sentiment analysis.

  • MMLU (Massive Multi-task Language Understanding). Favorite eval of DeepMind and Google.
    • It tests the model’s ability to generalize across different fields: maths, science, history, etc.
    • How well the model que transfer knowledge cross different domains.
  • GSM8K (Grade School Math 8K). To eval reasoning combined with chain-of-thought.
    • Dataset of 8,500 high-quality, linguistically diverse grade school math word problems.

  • MATH. Used in most LLM papers.
    • 12,500 problems sourced from high school math competition.
  • HumanEval. For coding. Developed by OpenAI.
    • It focuses on the models’ ability to comprehend language, reason, and solve problems related to algorithms and simple mathematics.

What makes a good eval?

  • They should measure things central to intelligence, be on a meaningful task.
  • It should be easy to understand.
  • It shouldn’t fluctuate based on the prompt.
  • It shouldn’t make mistakes.
  • It shouldn’t take too much work to run.
  • The grading should be correct.
  • It should provide examples.