22. Direct Preference Optimization Methods
Learning objectives
- Continue discussing fine tuning
- Introduce Direct Preference Optimzation
- Move beyond RLHF
Before: RLHF

RLHF
Now: DPO

DPO
Policy Iteration

DPO math
Motivations
![]()
DPO on IMDb data set
- quality data: no need for reward model
- dynamic: update with new data
- precise: can avoid certain topics
- image source: Rafailov et al, 2024
Newer Designs
![]()
Zephyr models
22. Direct Preference Optimization Methods