22. Direct Preference Optimization Methods

DPO

Learning objectives

  • Continue discussing fine tuning
  • Introduce Direct Preference Optimzation
  • Move beyond RLHF

Sources

Following the Gen AI Handbook, we looked at

Policy Iteration

Before: RLHF

RLHF

Now: DPO

DPO

Policy Iteration

DPO math

Benefits

Motivations

DPO on IMDb data set

  • quality data: no need for reward model
  • dynamic: update with new data
  • precise: can avoid certain topics
  • image source: Rafailov et al, 2024

Newer Designs

Zephyr models