Character-Trained Models Can Struggle to Generalise

Character-Trained Models Can Struggle to Generalise

LessWrong
LessWrongMay 25, 2026

Key Takeaways

  • Full-stage models keep persona in chat with macro‑F1 ≈ 0.86–0.94.
  • In agentic email tasks, macro‑F1 falls to 0.29–0.55.
  • Distillation improves OOD performance, but full training still best.
  • Persona signal survives format shift but loses >30 F1 points.
  • Results caution against relying on chat‑only training for autonomous agents.

Pulse Analysis

Character‑training pipelines such as OpenCharacterTraining have shown impressive persona fidelity when evaluated on chat‑style benchmarks like PURE‑DOVE. By distilling a persona‑specific response distribution and then applying supervised fine‑tuning, researchers reported macro‑F1 scores above 0.90, suggesting that a model can reliably adopt traits such as sarcasm or poeticism. However, these results are tightly coupled to the input format; the model learns to recognize subtle cue patterns that are abundant in turn‑based dialogue but scarce in other interaction modes.

When the same fine‑tuned checkpoints are embedded in an agentic workflow—specifically a tool‑use loop that generates email bodies—the persona signal deteriorates dramatically. The ModernBERT classifier, which previously identified the correct persona with near‑perfect accuracy, drops to roughly 0.3‑0.5 macro‑F1. This decline mirrors findings from recent alignment research that SFT‑derived policies often fail to generalize beyond the narrow distribution they were trained on. Even though the full‑stage adapters retain a measurable edge over the base and distillation‑only variants, the loss of more than thirty F1 points underscores a fundamental brittleness in current character‑training methods.

For enterprises planning to deploy persona‑driven assistants in autonomous settings—such as email drafting, report generation, or customer‑service bots—these results serve as a cautionary signal. Relying solely on chat‑oriented fine‑tuning may produce agents that appear on‑brand during interactive sessions but revert to generic behavior when operating behind the scenes. Future work should explore hybrid approaches that combine rationale‑based prompting, multi‑modal training data, and reinforcement learning from human feedback to embed persona traits more robustly across diverse output channels.

Character-trained models can struggle to generalise

Comments

Want to join the conversation?