Zero-Shot Alignment: Harm Detection via Incongruent Attention Mechanisms

Zero-Shot Alignment: Harm Detection via Incongruent Attention Mechanisms

LessWrong
LessWrongApr 8, 2026

Key Takeaways

  • 4.7M‑parameter adapter adds opposing softmax and sigmoid heads
  • Randomly initialized adapter flags harmful prompts on Phi‑2 and Qwen
  • Dissonance score rises when heads disagree, indicating potential danger
  • Measurement loop interferes with natural model behavior, creating feedback bias
  • Training required only $20 compute and ~300 fine‑tuning steps

Pulse Analysis

The proposed adapter leverages mathematically incongruent attention mechanisms to make a language model’s internal conflict visible. By splitting the final hidden states into a positive softmax head and a negative sigmoid head, the system generates a logic vector whose magnitude reflects how strongly the model’s latent directions diverge. This dissonance score acts as an on‑the‑fly hazard indicator, surfacing content the base model would otherwise suppress. Because the adapter does not alter the underlying Phi‑2 weights, it can be attached to any frozen LLM, offering a modular safety layer that operates in real time.

Experimental results are striking: even with random initialization—no exposure to harmful data—the negative head consistently emitted suppression signals when presented with malicious instructions. The phenomenon replicated on a separate architecture, Qwen 2.5b, reinforcing the claim that opposing attention heads can expose latent safety dynamics inherent in the base model. The author achieved these findings with minimal resources—approximately $20 of cloud compute and a few hundred fine‑tuning steps—highlighting the method’s cost‑effectiveness compared to large‑scale RLHF pipelines that demand billions of tokens and extensive human labeling.

However, the approach introduces a measurement problem. Since the adapter intervenes at every generation step, it creates a cybernetic feedback loop: each steering action influences the next token, making it impossible to disentangle the model’s natural response from the adapter’s influence. This epistemic blind spot challenges rigorous evaluation but also opens research avenues into non‑intrusive monitoring techniques. If future work can mitigate this feedback, incongruent attention could become a powerful, low‑cost tool for zero‑shot alignment, complementing existing safety frameworks and accelerating the deployment of trustworthy AI systems.

Zero-Shot Alignment: Harm Detection via Incongruent Attention Mechanisms

Comments

Want to join the conversation?