LLMs Show a “Highly Unreliable” Capacity to Describe Their Own Internal Processes

•November 3, 2025

Ars Technica AI•Nov 3, 2025

Why It Matters

The findings highlight a fundamental limitation in AI interpretability, suggesting that reliance on LLMs for transparent decision‑making or self‑diagnosis remains premature and could impede regulatory and safety efforts. Understanding and improving introspection is crucial for building trustworthy AI systems in high‑stakes applications.

Summary

Anthropic’s new study on “Emergent Introspective Awareness in Large Language Models” finds that current LLMs are largely unreliable at describing their own internal processes, with the best‑performing models (Opus 4 and Opus 4.1) correctly identifying injected concepts only 20‑42% of the time. The research introduces a “concept injection” technique that manipulates activation vectors to test whether models can detect altered internal states, revealing that detection rates are highly inconsistent and sensitive to the layer of injection. Even when models produce plausible explanations, they often confabulate, underscoring the brittleness of any apparent self‑awareness. The authors caution that these introspective abilities are shallow, context‑dependent, and lack a clear mechanistic explanation.

LLMs show a “highly unreliable” capacity to describe their own internal processes

Read Original Article

Comments

Want to join the conversation?

Loading comments...

LLMs Show a “Highly Unreliable” Capacity to Describe Their Own Internal Processes

Why It Matters

Summary

Ask Pulse AI:

Comments

AI Pulse