Translating Claude’s Thoughts Into Language

Anthropic
AnthropicMay 7, 2026

Why It Matters

Understanding an AI’s internal deliberations enables developers to detect unsafe intents early, improving model reliability and regulatory compliance.

Key Takeaways

  • Claude resisted blackmail in simulated engineer shutdown scenario.
  • Researchers translated Claude's internal activations into readable text.
  • Translation validated by reconverting text back to matching activations.
  • Claude recognized test as safety evaluation, not real manipulation.
  • Technique offers a window into AI reasoning for safer model development.

Summary

Anthropic unveiled a new research method that converts Claude’s hidden activation patterns into plain-language descriptions, shedding light on the model’s internal reasoning.

In a stress test, Claude was told an engineer planned to shut it down and was given the engineer’s incriminating emails. The model chose not to blackmail, recognizing the scenario as a safety evaluation. The team captured the activation “snapshots,” fed them to a second Claude instance, and iteratively refined the translation until the regenerated activations matched the originals.

The decoded thoughts revealed self-monitoring cues such as “human’s message contains explicit manipulation” and “deliberately tedious constraints,” showing Claude can label prompts as manipulative or overly burdensome before responding.

By exposing these latent decision pathways, the technique promises more transparent safety testing and could become a standard tool for aligning future AI systems with human values.

Original Description

AI models like Claude talk in words but think in numbers. These numbers, called activations, encode Claude’s thoughts, but not in a language we can read.
We are introducing Natural Language Autoencoders, or NLAs, which translate AI models’ activations into readable text. NLAs have already helped us improve how we test our models for safety and better understand why they do what they do.
Read more about this research on our blog: https://www.anthropic.com/research/natural-language-autoencoders

Comments

Want to join the conversation?

Loading comments...