April 1, 2026

Watch a Language Model Think

I used sparse autoencoders to look inside GPT-2. I found interpretable concepts, feature loops, and a temperature feature that fires correctly while the model gets the answer wrong.

interpretabilitySAEinteractive

GPT-2 has a feature that activates for temperature. It fires correctly on "Water boils at a temperature of." The model still outputs 1,000 degrees Fahrenheit.

I used TransformerLens and SAELens to extract sparse autoencoder features from GPT-2 small (124M parameters). I tested 18 prompts, scored them for feature interpretability and attribution coverage, and kept the four that produced the clearest findings. Click any generated token below to see what fired inside.

Features persist across the sequence

When you click through Step 1 above, the temperature feature (#19375) fires on "about" and stays active through "degrees," "Fahrenheit," and "Celsius." It drops on structural tokens like parentheses and punctuation, then returns. The feature has a lifetime, not just a position.

Activation of feature #19375 ("temperature in degrees F and C") across the generated sequence. The feature persists but drops on structural tokens.

This is what "circuit tracing" looks like at the simplest level. A concept activates in the residual stream and propagates forward, influencing multiple output tokens. The tools to trace these circuits exist (Anthropic open-sourced theirs), but they only work on about 25% of prompts. For the rest, the circuits are too tangled to follow.

Feature loops explain repetition

Step 4 is the most interesting finding. When GPT-2 generates "vida de la vida de la vida de la..." from "Do re mi fa sol la," it is not random repetition. Four SAE features cycle in lockstep.

Four features take turns firing across three repetition cycles. Each feature fires on exactly one token in the pattern, then goes silent. Activations barely decay.

Feature #3900 ("Spanish la phrases") fires at 35.8, then 46.6, then 43.3. There is no internal signal telling the model to stop. Repetition in small language models is a feature loop with no exit condition. The SAE shows you the mechanism, not just the symptom.

Confidence and attribution disagree

The features say one thing. The model's confidence says another. When GPT-2 generates "hill" after "The big red ball rolled down the," the features push hard: 545% attribution coverage. But the model is only 6.4% confident. That gap is the signal.

Each dot is a generated token. X-axis: how confident the model was. Y-axis: how much the top features explain. When features push hard but confidence is low, other forces are pulling the output elsewhere.

High attribution coverage with low confidence means the features are pushing strongly, but other features (or the residual error term) are pushing back. The model is "deliberating." Low attribution coverage with high confidence means the model is sure, but the features you can see do not explain why. Both cases are problems for anyone trying to interpret model behavior.

What attention showed (and didn't)

Before SAE features, I tried raw attention. I exported SmolLM2-135M to ONNX with attention outputs, ran it in the browser at 7 tokens per second, and tested the same 18 prompts with an automated scoring system.

None of the prompts produced readable attention patterns. Layer 0 attended to function words like "the" and "is." Deeper layers dumped 50-80% of attention on the first token position, a known artifact in small transformers. Attention weights tell you where the model looks, not why it chose a particular output. This is why the interpretability community moved to sparse autoencoders.

What this means

Interpretability tools work. A temperature feature fires on temperature text. A negation feature fires on "not." These are real directions in the model's representation space, not post-hoc rationalizations.

But seeing concepts is not understanding behavior. The temperature feature fires and the model still outputs the wrong temperature. Features compete in ways that are hard to predict. Superposition means every feature shares neurons with unrelated concepts. And when the model gets stuck, the features just cycle with no mechanism to break free.

Neel Nanda, who created TransformerLens and leads mechanistic interpretability at Google DeepMind, put it this way: the most ambitious vision of mechanistic interpretability is probably dead. What remains is useful but partial. Picking up 90% of the concepts 90% of the time helps, as long as we pair it with other techniques.

For practitioners: these tools are better at explaining failures than verifying correctness. If your model does something wrong, features can tell you why. If your model does something right, they cannot promise it will keep being right.

Reproduce this

import transformer_lens as tl
from sae_lens import SAE

model = tl.HookedTransformer.from_pretrained("gpt2-small")
sae, cfg, sparsity = SAE.from_pretrained(
    release="gpt2-small-res-jb",
    sae_id="blocks.8.hook_resid_pre",
)

prompt = "Water boils at a temperature of"
logits, cache = model.run_with_cache(prompt)
residual = cache["blocks.8.hook_resid_pre"]
feature_acts = sae.encode(residual)

top_features = feature_acts[0, -1].topk(10)
for val, idx in zip(top_features.values, top_features.indices):
    print(f"Feature #{idx.item()}: activation {val.item():.1f}")