32. Representation Engineering

David Lohmann

Ch 32: Representation Engineering (10 min)

Original Google Slides (well-formatted) Quarto version may have some slides missing or poorly formatted

Representation Engineering __ - Detecting how LLM activations vary along a characteristic of interest (such as honesty, humor, power, kindness, sarcasm, etc), or then adding calculated “control vectors” to manipulate model activations to “steer” the model to modify the output in a particular way.__

Observing and manipulating the internal representations - weights or activations - that AI uses to understand and process information. Identifying __ and controlling specific sets of activations within an AI system that correspond to a model’s behavior.__

Typically requires no extra LLM training, and can increase or reduce a natural-language trait such as honesty, skepticism, humility, power, etc. Typically relies on using prompt engineering to make a model generate output that varies across the desired trait of interest, measuring the activations, and subtracting their difference.

Pioneered by the Center for AI Safety.

2 Rival Approaches __ to Understanding how Neural Networks Work, by reverse engineering to understand how internal workings affect LLM output.__

In “__ Representation Engineering : __ A __ Top-Down __ Approach to AI Transparency”, Zou et borrowed from research in cognitive neuroscience and found that a “top down” approach to understanding a particular LLM phenomena, measured in an appropriate unit of analysis, in terms of a neural representation could reveal generalizable rules at the level of that phenomena.

The “top down” representation engineering approach of detects how LLM outputs correspond to and are affected by internal LLM representations. Figures out how a change in the internal representation affects a change in output.

By contrast to the “ bottom up ” mechanistic interpretability __ approach seeking to detect how neuron to neuron connections and circuits correspond to and affect more complex phenomena. Studies individual neurons or features. Specific circuits have been identified for various capabilities, including vision, learning, and language.__

Linear Artificial Tomography (LAT): Algorithm to determine what weight and activation patterns represent a given concept. Design stimulus prompts that cause LLM output that varies in terms of a given concept (ie honesty). Then “scan” the weights to detect differences in the weight and activation representations when the LLM processes this concept.

The LAT measures neural activity related to the target concept or function (“honesty” in this example). The

reading vectors acquired be used to extract and monitor model internals for the target concept or function.

Reading vector: The vector associated with a difference in the neural activation associated with a particular concept. Can be used to

Measure (“read”) how the LLM is activating for a particular concept.

Boost (“control”) how the much neural network activates for a particular concept.

With the reading vector for neural activity __ of a particular concept represented in the LLM, we can detect how the specific pattern of neural activations relates to a concept through:__

Correlation between __ reading vector activation and concept.__

Causation between __ manipulating neural activity and affecting the output response.__

Necessity when removing the neural activity affects the output response.

Sufficiency when removing a concept or function from the LLM, and then reintroducing the neural activity to assess the output response.

This is measured relative to a baseline LLM activation.

With the reading vector we can measure how much the LLM activity in terms of a particular concept. Then we can increase or decrease the activity for that concept.

1 / 2

32. Representation Engineering David Lohmann