Following the Gen AI Handbook, we looked at
continued pre-training
reinforcement learning
preference classification
\[-\text{E}_{x}\text{log}(\sigma(s_{w} - s_{\ell}))\]
That is, reward model should not have \(s_{w} << s_{\ell}\)
RLHF workflow
toward DPO
human feedback for moderation
Bai et al., 2022
Schulman, 2023