21. Reward Models and RLHF
Learning objectives
- Continue discussing fine tuning
- Motivate reward models
- Elaborate on RLHF’s role in AI history
Continued Pre-Training

continued pre-training
RL
![]()
reinforcement learning
Preference Classification
![]()
preference classification
Loss Function
- sw=rθ(x,yw): reward for winning response
- sℓ=rθ(x,yℓ): reward for losing response
- Goal: minimize expected loss
−Exlog(σ(sw−sℓ))
That is, reward model should not have sw<<sℓ
Reinforcement Learning with Human Feedback
RLHF
![]()
RLHF workflow
Toward DPO
![]()
toward DPO
Human Feedback (moderation)
![]()
human feedback for moderation
Quality of Model Outputs
![]()
Bai et al., 2022
Selling Points
![]()
Schulman, 2023
21. Reward Models and RLHF