24. Distillation and Merging
Learning objectives
- Finish module on fine tuning
- Learn to train a “student” with a “teacher”
Temperature
Another hyperparameter is the softmax temperature
pi=exp(zi/T)∑jexp(zj/T)
- T→0: one-hot target vector
- T→∞: uniform distribution (random guessing)
KL Loss
Kullback-Leibler loss is defined as
KL(p||q)=Ep[logpq] =∑ipi⋅log(pi)−∑ipi⋅log(qi)
Encoders vs Decoders
![]()
BERT vs GPT
Applications
BERT
- text classification
- data labeling
- recommender
- sentiment analysis
GPT
- content generation
- conversational chatbots
Fine Tuning
![]()
distillation motivation
Teacher and Student
![]()
teacher and student
Chain of Thought
![]()
chain of thought
24. Distillation and Merging