Features of SAEs Are Universal - but only up to an Unknown Random Rotation

Features of SAEs Are Universal - but only up to an Unknown Random Rotation

LessWrong
LessWrongMay 31, 2026

Key Takeaways

  • Decoder cosine >0.9 masks orthogonal rotation between model bases
  • Naive SAE transfer yields negative explained variance without alignment
  • A single Procrustes rotation recovers reconstruction up to 0.99 EV
  • Rotation matrices follow Haar‑uniform distribution on SO(d)
  • Steering vectors must be rotated for reliable cross‑model interventions

Pulse Analysis

Sparse Autoencoders have become a cornerstone for probing transformer internals, yet recent evidence suggests their apparent universality is more nuanced. Two independently seeded models trained on the same task produce decoder‑column cosine scores around 0.9, a metric traditionally taken as proof of shared features. However, when the encoder of one model attempts to reconstruct the other's residual activations, the explained variance plunges below zero, indicating a fundamental misalignment in the representation space. This paradox reveals that high cosine similarity alone cannot guarantee transferability.

The breakthrough comes from treating the residual‑stream bases as related by an orthogonal transformation. By solving a Procrustes problem—essentially finding the optimal rotation matrix—the researchers align the two activation spaces with a single matrix multiplication. Post‑alignment reconstruction scores soar to 0.85‑0.99 across both a 104k‑parameter toy model and 70‑million‑parameter Pythia, confirming that the underlying function is identical. Statistical tests further show that these rotation matrices are indistinguishable from draws of the Haar measure on the orthogonal group SO(d), meaning the bases are effectively random rotations of one another.

For practitioners, the implications are immediate. Any activation‑steering, feature‑editing, or interpretability pipeline that assumes direct transfer across models must first apply the discovered rotation to steering vectors or decoder weights. This simple, computationally cheap fix unlocks reliable cross‑model interventions and clarifies the true nature of SAE universality: features are shared up to a random orthogonal rotation. As model scaling continues, incorporating Procrustes alignment will become essential for robust representation engineering and reproducible AI research.

Features of SAEs are universal - but only up to an unknown random rotation

Comments

Want to join the conversation?