Anthropic’s new paper on emergent introspective awareness demonstrates that large language models can detect internally injected cues, such as all‑caps text implying shouting, without relying on post‑hoc chain‑of‑thought reasoning. In a series of four experiments, the Opus 4.1 and Opus 4 models identified these injected thoughts about 20% of the time when the cue was placed in the appropriate network layer, outperforming earlier models. The detection occurs at the initial inference stage, suggesting the models maintain a form of early‑stage self‑monitoring. The findings imply that as LLMs become more capable, they increasingly exhibit primitive self‑awareness of their own processing.
Comments
Want to join the conversation?
Loading comments...