18. Mixture-of-Experts

Bryan Tegomoh

2025-04-13

Mixture-of-Experts: The Basics

  • Chapter 18 from GenAI Handbook
  • Goal: Learn MoE for efficient LLMs
  • Key idea: Use only some parameters per task
  • Builds on: Scaling laws, pretraining

How MoE Works

  • Unlike dense models (e.g., Llama3), MoE is sparse
  • Experts: Specialized sub-networks per layer
  • Router: Picks a few experts per input
  • Result: Big model, less compute

Mixture of Experts

Why It Matters

  • Efficiency: Less memory/time than dense models
  • Scale: Grow “knowledge” without slowing down
  • Examples: Mixtral (8x7B, 8x22B), maybe GPT-4