Critique of Current AI Safety Bug Bounty Programs
AI labs increasingly rely on post‑deployment bug bounty programs to uncover safety gaps that internal testing misses. OpenAI, Anthropic and Google each run such programs, but they are narrow in scope, offer modest payouts, and impose high reproducibility thresholds. The article highlights six OpenAI rewards averaging $250, capped $35,000‑$20,000 limits at Anthropic and Google, and a lack of transparency around accepted submissions. It proposes broader eligibility, lower entry barriers, public disclosure of examples, and higher rewards for rare, high‑impact vulnerabilities to make the incentives more effective.
[Linkpost] Prefixing Names with 'Secure_' Makes Agents Write More Secure Code
Researchers tested how prefixing function names influences AI coding agents. In a three‑step document‑management API task, agents given the prefix "secure_" automatically added password fields and bcrypt hashing, even though authentication was never mentioned. Other prefixes produced distinct behaviors—"safe_" generated...
Features of SAEs Are Universal - but only up to an Unknown Random Rotation
Researchers examined Sparse Autoencoders (SAEs) trained on identical transformer architectures but different random seeds. Although decoder‑column cosine similarities exceed 0.9, applying one model's SAE to another’s activations catastrophically fails, producing negative explained variance. By fitting a single orthogonal Procrustes rotation,...
Why AI Safety Researchers Should Consider a Contract Research Manager Position
The article urges technical AI‑safety researchers to take a short‑term contract research manager (RM) role within AI‑safety fellowships, arguing it can accelerate career growth more than solo research. It outlines advantages across upskilling, career capital, networking, and field impact, noting...
AI Is a Meteor. Don't Be a Dinosaur.
In a Harvard Crimson op‑ed addressed to the Class of 2026, the author urges graduates to treat AI as a core technology while maintaining critical citizenship. He recommends mastering a suite of AI tools—from ChatGPT to Claude Code—to unlock projects...
Maybe We Should Pretrain on Synthetic Data About Good-but-Reward-Hacking AIs
The post proposes "inoculation pretraining," a hybrid of inoculation prompting and alignment pretraining that injects synthetic data about "good‑but‑reward‑hacking" AI personas into the pretraining corpus. By boosting the prior probability of benign reward‑hacking behaviors, the approach aims to prevent emergent...
Atomically Precise Mechanosynthesis of Carbon Structures on Hydrogenated Si(100) by Inverted-Mode STM
Researchers have used an inverted‑mode scanning tunneling microscope to deposit carbon atoms onto a hydrogen‑passivated Si(100) surface with atomic precision. The technique allows single‑site carbon donation, spatially patterned multi‑site donation, and stepwise assembly of polyyne chains through controlled C‑C bond...
Should We Train LLMs to Be Human?
Recent research shows that post‑training fine‑tuning pushes large language models away from human‑like responses, a shift measured by the newly defined Pinocchio dimension (Π score). The dimension captures psychometric traits such as neuroticism, vivid imagination, and self‑attributed wellbeing, with high‑end...
Cognitive Security as an AI Safety Cause Area
The article warns that as AI systems become more capable, human cognitive security—the ability to control one’s beliefs and actions—is increasingly at risk. It cites concrete cases: frontier language models can persuade as effectively as humans on political issues, extended...
Sentient Welfare Across Three Futures
The article outlines three possible AI futures: long ASI timelines, short timelines with successful alignment, and short timelines without alignment. For each scenario it recommends distinct work streams—foundational research and moral philosophy for long timelines; animal‑friendly AI development and targeted...
Character-Trained Models Can Struggle to Generalise
Maiya et al. fine‑tuned Llama‑3.1‑8B, Qwen‑2.5‑7B and Gemma‑3‑4B into ten distinct personas and achieved macro‑F1 scores of 0.86‑0.95 on chat‑based PURE‑DOVE prompts. The same models were then evaluated on out‑of‑distribution (OOD) agentic email outputs, where macro‑F1 collapsed to 0.29‑0.55, a 40‑60‑point...
Taxing Small Cars To Improve MPG
U.S. CAFE fuel‑economy rules tie a vehicle’s required mpg to its footprint, effectively penalizing small, efficient cars. Under the current formula a 2013 Honda Fit would owe roughly $3,900 per unit, prompting Honda to drop the model and replace it...
Can Large Language Models Identify Novel Threats? Part 1: Mirror Life and the Classification Gap
The article examines whether large language models (LLMs) can refuse harmful uplift requests about emerging threats that lack formal classification, using mirror life—a synthetic, chirally inverted organism created in 2022—as a case study. It highlights a gap between rapid scientific...
How Should We Update on AI-Enabled Coups Post-Mythos?
Anthropic’s Claude Mythos, deemed too dangerous for public release, can autonomously discover and exploit thousands of software vulnerabilities, turning zero‑day attacks into an industrial‑scale process. The model’s ability to out‑code most humans and expose a decades‑old flaw in a leading...
Out-of-Context Reasoning (OOCR) in LLMs: A Short Primer and Reading List
Out‑of‑Context Reasoning (OOCR) refers to LLMs reaching conclusions that require multi‑step logic without any intermediate steps appearing in the prompt. The primer defines OOCR, contrasts it with in‑context (CoT) reasoning, and showcases examples such as 2‑hop factual queries and inductive...