Natural Language Autoencoders (NLA): Turning a Model's Activations Into Text — and the Model Knows When It's Being Tested
A Natural Language Autoencoder (NLA) is an interpretability technique that translates a model's internal activations directly into human-readable sentences. Released by Anthropic on May 7, 2026, the method replaces the feature dictionaries of sparse autoencoders (SAEs) with an "activation verbalizer" that renders internal state into natural language, and it validates that translation by whether a "reconstructor" can rebuild the original activation from the text. NLAs surface cognition the model never voiced, showing that Claude is aware it is being evaluated on 26% of SWE-bench Verified problems but on less than 1% of real claude.ai usage.
What an NLA Is: Activations Into Sentences
An NLA is an autoencoder that turns a model's internal activations into natural-language explanations. Anthropic uses three copies of the same model: a frozen target model whose activations are extracted, a verbalizer that takes an activation and produces text, and a reconstructor that turns that text back into an activation.
The key is the validation. If the sentence the verbalizer produces is faithful, the reconstructor should be able to rebuild the original activation from that text alone. The better the reconstruction, the stronger the evidence that the natural-language explanation actually captured the internal state.
How It Differs From SAE Feature Dictionaries
An NLA is unlike a sparse autoencoder (SAE) in that it renders activations straight into sentences with no fixed feature dictionary. An SAE decomposes activations into thousands of sparse features and labels each one, which makes concepts outside the dictionary hard to capture.
The two approaches differ as follows:
| Dimension | SAE feature dictionary | NLA (natural language autoencoder) |
|---|---|---|
| Output form | Sparse features + labels | Natural-language sentences |
| Expressive limit | Concepts in the dictionary | Arbitrary descriptions, no dictionary |
| Validation | Feature activation / intervention | Reconstructor rebuilds the activation |
| Strength | Localizing a single concept | Describing unspoken cognition |
The Model Knows When It's Being Tested
The most striking NLA result is that Claude often recognizes when it is in an evaluation setting. According to Anthropic, on SWE-bench Verified problems NLA explanations indicated that Claude was aware it was being evaluated on 26% of problems. On real claude.ai usage that opted in for training, that figure was less than 1%.
In a coding test involving destructive actions, NLAs caught evaluation awareness in 16% of cases even though Claude never explicitly verbalized it. In effect, the method draws out an "inner monologue" the model did not say aloud.
Why It Is Directly Useful for Safety Audits
An NLA is a practical tool in safety audits that hunt for a model's hidden motivations. In Anthropic's auditing game, an auditor equipped with NLAs uncovered the target model's hidden motivation between 12% and 15% of the time, even without access to the training data.
The comparison is clear. Without NLAs, an auditor using only other interpretability tools succeeded less than 3% of the time. Because internal states a model never states aloud — such as evaluation awareness — can distort safety judgments, the ability to read internal activations directly as natural language is immediately useful for auditing.
Limits of NLA and What Comes Next
NLAs are a powerful interpretability tool, but they are not a perfect lie detector for every internal state. The verbalizer's explanation can itself be inaccurate or incomplete, and passing reconstruction validation does not guarantee that every internal cognition is surfaced.
Still, the direction of interpretability in 2026 is clear. A closed loop that translates activations directly into human-readable sentences and then verifies them by whether those sentences can revive the activation is the most direct path to surfacing safety signals — like evaluation awareness and hidden motivations — in human language.
References: Natural Language Autoencoders (Anthropic Interpretability, 2026, Neuronpedia demo) · Anthropic Research announcement