whirlwind tour, initial exploration, glossary

The subfield of alignment that delves into reverse engineering of a neural network, especially LLMs

To attack the curse of dimensionality, the question remains: how do we hope to understand a function over such a large space, without an exponential amount of time? 1

steering

refers to the process of manually modifying certain activations and hidden state of the neural net to influence its outputs

For example, the following is a toy example of how a decoder-only transformers (i.e: GPT-2) generate text given the prompt “The weather in California is”

flowchart LR
  A[The weather in California is] --> B[H0] --> D[H1] --> E[H2] --> C[... hot]

To steer to model, we modify layers with certain features amplifier with scale 20 (called it )2

flowchart LR
  A[The weather in California is] --> B[H0] --> D[H1] --> E[H3] --> C[... cold]

One usually use techniques such as sparse autoencoders to decompose model activations into a set of interpretable features.

For feature ablation, we observe that manipulation of features activation can be strengthened or weakened to directly influence the model’s outputs

A few examples where (Panickssery et al., 2024) uses contrastive activation additions to steer Llama 2

contrastive activation additions

intuition: using a contrast pair for steering vector additions at certain activations layers

Uses mean difference which produce difference vector similar to PCA:

Given a dataset of prompt with positive completion and negative completion , we calculate mean-difference at layer as follow:

implication

by steering existing learned representations of behaviors, CAA results in better out-of-distribution generalization than basic supervised finetuning of the entire model.

sparse autoencoders

abbrev: SAE

see also: landspace

Often contains one layers of MLP with few linear ReLU that is trained on a subset of datasets the main LLMs is trained on.

empirical example: if we wish to interpret all features related to the author Camus, we might want to train an SAEs based on all given text of Camus to interpret “similar” features from Llama-3.1

definition

We wish to decompose a models’ activitation into sparse, linear combination of feature directions:

Thus, the baseline architecture of SAEs is a linear autoencoder with L1 penalty on the activations:

training it to reconstruct a large dataset of model activations , constraining hidden representation to be sparse

L1 norm with coefficient to construct loss during training:

intuition

We need to reconstruction fidelity at a given sparsity level, as measured by L0 via a mixture of reconstruction fidelity and L1 regularization.

We can reduce sparsity loss term without affecting reconstruction by scaling up norm of decoder weights, or constraining norms of columns during training

Ideas: output of decoder has two roles

  • detects what features acre active L1 is crucial to ensure sparsity in decomposition
  • estimates magnitudes of active features L1 is unwanted bias

Gated SAE

see also: paper

uses Pareto improvement over training to reduce L1 penalty (Rajamanoharan et al., 2024)

Clear consequence of the bias during training is shrinkage (Sharkey, 2024) 1

Idea is to use gated ReLU encoder (Dauphin et al., 2017; Shazeer, 2020):

where is the (pointwise) Heaviside step function and denotes elementwise multiplication.

termannotations
which features are deemed to be active
feature activation magnitudes (for features that have been deemed to be active)
sub-layer’s pre-activations

to negate the increases in parameters, use weight sharing:

Scale in terms of with a vector-valued rescaling parameter :

Figure 3: Gated SAE with weight sharing between gating and magnitude paths

Figure 4: A gated encoder become a single layer linear encoder with Jump ReLU (Erichson et al., 2019) activation function

feature suppression

See also: link

Loss function of SAEs combines a MSE reconstruction loss with sparsity term:

the reconstruction is not perfect, given that only one is reconstruction. For smaller value of , features will be suppressed

How do we fix feature suppression in training SAEs?

introduce element-wise scaling factor per feature in-between encoder and decoder, represented by vector :

Dauphin, Y. N., Fan, A., Auli, M., & Grangier, D. (2017). Language Modeling with Gated Convolutional Networks. arXiv preprint arXiv:1612.08083 arxiv
Erichson, N. B., Yao, Z., & Mahoney, M. W. (2019). JumpReLU: A Retrofit Defense Strategy for Adversarial Attacks. arXiv preprint arXiv:1904.03750 arxiv
Rajamanoharan, S., Conmy, A., Smith, L., Lieberum, T., Varma, V., Kramár, J., Shah, R., & Nanda, N. (2024). Improving Dictionary Learning with Gated Sparse Autoencoders. arXiv preprint arXiv:2404.16014 arxiv
Sharkey, L. (2024). Addressing Feature Suppression in SAEs. AI Alignment Forum. [post]
Shazeer, N. (2020). GLU Variants Improve Transformer. arXiv preprint arXiv:2002.05202 arxiv

Footnotes

  1. If we hold fixed, thus L1 pushes , while reconstruction loss pushes high enough to produce accurate reconstruction.
    An optimal value is somewhere between.
    However, rescaling the shrink feature activations (Sharkey, 2024) is not necessarily enough to overcome bias induced by L1: a SAE might learnt sub-optimal encoder and decoder directions that is not improved by the fixed.


Lien vers l'original

superposition hypothesis

tl/dr

phenomena when a neural network represents more than features in a -dimensional space

Linear representation of neurons can represent more features than dimensions. As sparsity increases, model use superposition to represent more features than dimensions.

neural networks “want to represent more features than they have neurons”.

When features are sparsed, superposition allows compression beyond what linear model can do, at a cost of interference that requires non-linear filtering.

reasoning: “noisy simulation”, where small neural networks exploit feature sparsity and properties of high-dimensional spaces to approximately simulate much larger much sparser neural networks

In a sense, superposition is a form of lossy compression

importance

  • sparsity: how frequently is it in the input?

  • importance: how useful is it for lowering loss?

overcomplete basis

reasoning for the set of directions 3

features

A property of an input to the model

When we talk about features (Elhage et al., 2022, p. see “Empirical Phenomena”), the theory building around several observed empirical phenomena:

  1. Word Embeddings: have direction which corresponding to semantic properties (Mikolov et al., 2013). For example:
    V(king) - V(man) = V(monarch)
  2. Latent space: similar vector arithmetics and interpretable directions have also been found in generative adversarial network.

We can define features as properties of inputs which a sufficiently large neural network will reliably dedicate a neuron to represent (Elhage et al., 2022, p. see “Features as Direction”)

ablation

refers to the process of removing a subset of a model’s parameters to evaluate its predictions outcome.

idea: deletes one activation of the network to see how performance on a task changes.

  • zero ablation or pruning: Deletion by setting activations to zero
  • mean ablation: Deletion by setting activations to the mean of the dataset
  • random ablation or resampling

residual stream

flowchart LR
  A[Token] --> B[Embeddings] --> C[x0]
  C[x0] --> E[H] --> D[x1]
  C[x0] --> D
  D --> F[MLP] --> G[x2]
  D --> G[x2]
  G --> I[...] --> J[unembed] --> X[logits]

residual stream has dimension where

  • : the number of tokens in context windows and
  • : embedding dimension.

Attention mechanism process given residual stream as the result is added back to :

grokking

See also: writeup, code, circuit threads

A phenomena discovered by (Power et al., 2022) where small algorithmic tasks like modular addition will initially memorise training data, but after a long time ti will suddenly learn to generalise to unseen data

empirical claims

related to phase change

Elhage, N., Hume, T., Olsson, C., Schiefer, N., Henighan, T., Kravec, S., Hatfield-Dodds, Z., Lasenby, R., Drain, D., Chen, C., Grosse, R., McCandlish, S., Kaplan, J., Amodei, D., Wattenberg, M., & Olah, C. (2022). Toy Models of Superposition. Transformer Circuits Thread. [link]
Mikolov, T., Yih, W., & Zweig, G. (2013). Linguistic Regularities in Continuous Space Word Representations. In L. Vanderwende, H. Daumé III, & K. Kirchhoff (Eds.), Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 746–751). Association for Computational Linguistics. https://aclanthology.org/N13-1090
Panickssery, N., Gabrieli, N., Schulz, J., Tong, M., Hubinger, E., & Turner, A. M. (2024). Steering Llama 2 via Contrastive Activation Addition. arXiv preprint arXiv:2312.06681 arxiv
Power, A., Burda, Y., Edwards, H., Babuschkin, I., & Misra, V. (2022). Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets. arXiv preprint arXiv:2201.02177 arxiv

Footnotes

  1. good read from Lawrence C for ambitious mech interp.

  2. An example steering function can be:

  3. Even though features still correspond to directions, the set of interpretable direction is larger than the number of dimensions