Following the Gen AI Handbook, we looked at
Another hyperparameter is the softmax temperature
\[p_{i} = \frac{exp(z_{i}/T)}{\sum_{j} exp(z_{j}/T)}\]
Kullback-Leibler loss is defined as
\[\begin{array}{rcl} KL(p||q) & = & \text{E}_{p}[\log \frac{p}{q}] \\ ~ & = & \displaystyle\sum_{i} p_{i} \cdot \log(p_{i}) - \sum_{i} p_{i} \cdot \log(q_{i}) \end{array}\]
bidirectional encoder representations from transformers
generative pre-trained transformer
BERT vs GPT
BERT
GPT
distillation motivation
teacher and student
chain of thought
DistilBERT performance