LessWrong

LessWrong

Publication
0 followers

Community publication on rationality, decision‑making, and improving reasoning skills.

[Linkpost] Prefixing Names with 'Secure_' Makes Agents Write More Secure Code
NewsJun 1, 2026

[Linkpost] Prefixing Names with 'Secure_' Makes Agents Write More Secure Code

Researchers tested how prefixing function names influences AI coding agents. In a three‑step document‑management API task, agents given the prefix "secure_" automatically added password fields and bcrypt hashing, even though authentication was never mentioned. Other prefixes produced distinct behaviors—"safe_" generated...

By LessWrong
Features of SAEs Are Universal - but only up to an Unknown Random Rotation
NewsMay 31, 2026

Features of SAEs Are Universal - but only up to an Unknown Random Rotation

Researchers examined Sparse Autoencoders (SAEs) trained on identical transformer architectures but different random seeds. Although decoder‑column cosine similarities exceed 0.9, applying one model's SAE to another’s activations catastrophically fails, producing negative explained variance. By fitting a single orthogonal Procrustes rotation,...

By LessWrong
Why AI Safety Researchers Should Consider a Contract Research Manager Position
NewsMay 31, 2026

Why AI Safety Researchers Should Consider a Contract Research Manager Position

The article urges technical AI‑safety researchers to take a short‑term contract research manager (RM) role within AI‑safety fellowships, arguing it can accelerate career growth more than solo research. It outlines advantages across upskilling, career capital, networking, and field impact, noting...

By LessWrong
AI Is a Meteor. Don't Be a Dinosaur.
NewsMay 30, 2026

AI Is a Meteor. Don't Be a Dinosaur.

In a Harvard Crimson op‑ed addressed to the Class of 2026, the author urges graduates to treat AI as a core technology while maintaining critical citizenship. He recommends mastering a suite of AI tools—from ChatGPT to Claude Code—to unlock projects...

By LessWrong
Maybe We Should Pretrain on Synthetic Data About Good-but-Reward-Hacking AIs
NewsMay 29, 2026

Maybe We Should Pretrain on Synthetic Data About Good-but-Reward-Hacking AIs

The post proposes "inoculation pretraining," a hybrid of inoculation prompting and alignment pretraining that injects synthetic data about "good‑but‑reward‑hacking" AI personas into the pretraining corpus. By boosting the prior probability of benign reward‑hacking behaviors, the approach aims to prevent emergent...

By LessWrong
Atomically Precise Mechanosynthesis of Carbon Structures on Hydrogenated Si(100) by Inverted-Mode STM
NewsMay 28, 2026

Atomically Precise Mechanosynthesis of Carbon Structures on Hydrogenated Si(100) by Inverted-Mode STM

Researchers have used an inverted‑mode scanning tunneling microscope to deposit carbon atoms onto a hydrogen‑passivated Si(100) surface with atomic precision. The technique allows single‑site carbon donation, spatially patterned multi‑site donation, and stepwise assembly of polyyne chains through controlled C‑C bond...

By LessWrong
Should We Train LLMs to Be Human?
NewsMay 27, 2026

Should We Train LLMs to Be Human?

Recent research shows that post‑training fine‑tuning pushes large language models away from human‑like responses, a shift measured by the newly defined Pinocchio dimension (Π score). The dimension captures psychometric traits such as neuroticism, vivid imagination, and self‑attributed wellbeing, with high‑end...

By LessWrong
Cognitive Security as an AI Safety Cause Area
NewsMay 25, 2026

Cognitive Security as an AI Safety Cause Area

The article warns that as AI systems become more capable, human cognitive security—the ability to control one’s beliefs and actions—is increasingly at risk. It cites concrete cases: frontier language models can persuade as effectively as humans on political issues, extended...

By LessWrong
Sentient Welfare Across Three Futures
NewsMay 25, 2026

Sentient Welfare Across Three Futures

The article outlines three possible AI futures: long ASI timelines, short timelines with successful alignment, and short timelines without alignment. For each scenario it recommends distinct work streams—foundational research and moral philosophy for long timelines; animal‑friendly AI development and targeted...

By LessWrong
Character-Trained Models Can Struggle to Generalise
NewsMay 25, 2026

Character-Trained Models Can Struggle to Generalise

Maiya et al. fine‑tuned Llama‑3.1‑8B, Qwen‑2.5‑7B and Gemma‑3‑4B into ten distinct personas and achieved macro‑F1 scores of 0.86‑0.95 on chat‑based PURE‑DOVE prompts. The same models were then evaluated on out‑of‑distribution (OOD) agentic email outputs, where macro‑F1 collapsed to 0.29‑0.55, a 40‑60‑point...

By LessWrong
Taxing Small Cars To Improve MPG
NewsMay 24, 2026

Taxing Small Cars To Improve MPG

U.S. CAFE fuel‑economy rules tie a vehicle’s required mpg to its footprint, effectively penalizing small, efficient cars. Under the current formula a 2013 Honda Fit would owe roughly $3,900 per unit, prompting Honda to drop the model and replace it...

By LessWrong
Can Large Language Models Identify Novel Threats? Part 1: Mirror Life and the Classification Gap
NewsMay 23, 2026

Can Large Language Models Identify Novel Threats? Part 1: Mirror Life and the Classification Gap

The article examines whether large language models (LLMs) can refuse harmful uplift requests about emerging threats that lack formal classification, using mirror life—a synthetic, chirally inverted organism created in 2022—as a case study. It highlights a gap between rapid scientific...

By LessWrong
How Should We Update on AI-Enabled Coups Post-Mythos?
NewsMay 23, 2026

How Should We Update on AI-Enabled Coups Post-Mythos?

Anthropic’s Claude Mythos, deemed too dangerous for public release, can autonomously discover and exploit thousands of software vulnerabilities, turning zero‑day attacks into an industrial‑scale process. The model’s ability to out‑code most humans and expose a decades‑old flaw in a leading...

By LessWrong
Out-of-Context Reasoning (OOCR) in LLMs: A Short Primer and Reading List
NewsMay 23, 2026

Out-of-Context Reasoning (OOCR) in LLMs: A Short Primer and Reading List

Out‑of‑Context Reasoning (OOCR) refers to LLMs reaching conclusions that require multi‑step logic without any intermediate steps appearing in the prompt. The primer defines OOCR, contrasts it with in‑context (CoT) reasoning, and showcases examples such as 2‑hop factual queries and inductive...

By LessWrong
LessWrong | Pulse