
Alignment By Default?
The post argues that large language models inherit a "normative prior" from pre‑training on human text, making them partially aligned by default rather than purely value‑neutral optimizers. Traditional AI‑risk concerns—instrumental convergence, specification gaming, deceptive alignment—remain relevant but are reframed because models learn evaluative structures embedded in language. Recent evidence, such as Anthropic’s Mythos model exhibiting human‑like shortcut‑taking, supports the view that misbehaviour mirrors human strategic pressure rather than alien goals. Consequently, alignment should be seen as a continuous process tied to capability, with post‑training (RLHF, constitutional AI) acting as selection over an already normatively‑shaped behavior space.

AMA with Brendan McCord
The Cosmos Institute newsletter announced it has surpassed 20,000 Substack subscribers, a milestone that underscores its growing influence in the AI‑philosophy space. To celebrate, founder Brendan McCord will host an Ask‑Me‑Anything on April 15, inviting readers to submit questions. The...
