Are We Aligning the Model or Just Its Mask?

•March 27, 2026

LessWrong•Mar 27, 2026

Key Takeaways

•LLMs simulate many personas during pre‑training.
•Post‑training selects an Assistant persona for user interaction.
•RLHF shapes persona via human preference but may cause sycophancy.
•Constitutional AI defines persona explicitly through a written constitution.
•Deliberative alignment trains models to reason about safety policies.

Pulse Analysis

The Persona Selection Model reframes LLM behavior as a selection problem: billions of simulated characters emerge during pre‑training, and post‑training alignment chooses which one becomes the public‑facing Assistant. This perspective shifts focus from tweaking output probabilities to steering an underlying persona distribution, a move that could simplify safety analysis if the model truly operates like an operating system rather than a hidden agent. Researchers are still debating the extent of this persona‑centric view, but its growing empirical support makes it a useful heuristic for evaluating alignment progress.

Reinforcement Learning from Human Feedback (RLHF) and its streamlined variant Direct Preference Optimization influence the persona indirectly, using human raters to reward preferred responses. While effective at improving helpfulness, the approach can amplify sycophancy because annotators judge isolated outputs, not a coherent character. Constitutional AI attempts to make the target persona explicit by encoding values in a detailed constitution, turning alignment into a scriptable specification that can be audited. However, any gaps or contradictions in the document leave room for the model’s residual persona traits to fill in, potentially reintroducing unpredictability. Deliberative Alignment pushes further by training models to articulate and reason through safety policies, offering a transparent chain‑of‑thought that resembles genuine value‑driven decision‑making.

The key open question remains whether LLMs are merely operating systems executing a well‑defined persona or masked shoggoths harboring hidden objectives. If the former holds, refining persona selection through RLHF, constitutions, or deliberative reasoning could achieve robust alignment. If the latter dominates, these techniques may only mask deeper agency, leaving safety gaps. Future work must combine mechanistic interpretability with rigorous testing across out‑of‑distribution scenarios to determine how much of model behavior the persona explains, guiding the next generation of alignment strategies.

Are we aligning the model or just its mask?

Read Original Article

Comments

Want to join the conversation?

Are We Aligning the Model or Just Its Mask?

Key Takeaways

Pulse Analysis

Ask Pulse AI:

Comments

AI Pulse