Ch 34: Linear Representation Hypotheses (10 min)

Original Google Slides (well-formatted) Quarto version may have some slides missing or poorly formatted

The “linear representation hypothesis” in language models suggests that high-level semantic concepts are represented as directions in a model’s representation space. This implies that understanding and manipulating these concepts might be achieved by moving along these directions. For example, if a model understands the concept of “gender,” it might represent “king” and “queen” as points in a space where the difference between them is a linear direction corresponding to the concept of “gender”.

34. Linear Representation Hypotheses