Examples of evaluation benchmarks
HugginFace OpenLLM Leaderboard
- ROUGE (Recall-Oriented Understudy for Gisting Evaluation) and BLEU (Bilingual Evaluation Understudy): Traditional metrics to evaluate the quality of text generated by LLMs.
- GLUE/SuperGLUE (General Language Understanding Evaluation). Evaluation on tasks that require deeper understanding of the language: text similarity, natural language inference, and sentiment analysis.
- MMLU (Massive Multi-task Language Understanding). Favorite eval of DeepMind and Google.
- It tests the model’s ability to generalize across different fields: maths, science, history, etc.
- How well the model que transfer knowledge cross different domains.
- GSM8K (Grade School Math 8K). To eval reasoning combined with chain-of-thought.
- Dataset of 8,500 high-quality, linguistically diverse grade school math word problems.
- MATH. Used in most LLM papers.
- 12,500 problems sourced from high school math competition.
- HumanEval. For coding. Developed by OpenAI.
- It focuses on the models’ ability to comprehend language, reason, and solve problems related to algorithms and simple mathematics.
What makes a good eval?
- They should measure things central to intelligence, be on a meaningful task.
- It should be easy to understand.
- It shouldn’t fluctuate based on the prompt.
- It shouldn’t make mistakes.
- It shouldn’t take too much work to run.
- The grading should be correct.
- It should provide examples.