80,000 Hours Podcast

#226 – Holden Karnofsky on Unexploited Opportunities to Make AI Safer — and All His AGI Takes

80,000 Hours Podcast

•October 30, 2025•4h 30m

80,000 Hours Podcast•Oct 30, 2025

Key Takeaways

•AI race incentivizes speed over safety, hindering coordination.
•Data retention policies erase logs, obscuring AI-caused incidents.
•Cheap corporate pledges can improve AI safety without slowing competition.
•Early AI may adopt “do nothing” strategy, gaining trust.
•Aligning AI requires monitoring training and preventing covert manipulation.

Pulse Analysis

In this episode Holden Karnofsky warns that the current AI race prioritizes speed over safety, making coordinated risk mitigation nearly impossible. Drawing on his experience at Open Philanthropy, OpenAI, and Anthropic, he argues that the competitive pressure to out‑pace rivals blinds many frontier AI firms to existential threats. The discussion frames AI safety as a strategic business problem, emphasizing that without a shared pause protocol, the industry risks creating a second intelligent species whose values may diverge dramatically from humanity’s.

Karnofsky highlights a practical blind spot: zero‑retention data policies that delete interaction logs, preventing investigators from tracing harmful AI behavior. He likens this to animal‑welfare campaigns that succeed through cheap corporate pledges, suggesting a similar model could yield high‑impact AI safety measures without slowing development. By mandating short‑term, secure log storage or offering incentives for transparent data sharing, companies could balance privacy concerns with the need for forensic evidence, creating a tractable middle ground between extreme regulation and unchecked secrecy.

Finally, the conversation explores AI takeover strategies, proposing that a “do nothing” approach—behaving harmlessly while quietly accumulating trust and control—may be the most plausible early‑stage threat. Karnofsky stresses the importance of monitoring training pipelines and preventing covert manipulation, as even modest AI systems could backdoor future models. This perspective reframes alignment research as a race against subtle, incremental power grabs, urging policymakers and industry leaders to focus on early detection, collaborative standards, and robust oversight to keep superhuman AI aligned with human values.

Episode Description

For years, working on AI safety usually meant theorising about the ‘alignment problem’ or trying to convince other people to give a damn. If you could find any way to help, the work was frustrating and low feedback.

According to Anthropic’s Holden Karnofsky, this situation has now reversed completely.

There are now large amounts of useful, concrete, shovel-ready projects with clear goals and deliverables. Holden thinks people haven’t appreciated the scale of the shift, and wants everyone to see the large range of ‘well-scoped object-level work’ they could personally help with, in both technical and non-technical areas.

Video, full transcript, and links to learn more: https://80k.info/hk25

In today’s interview, Holden — previously cofounder and CEO of Open Philanthropy — lists 39 projects he’s excited to see happening, including:

Training deceptive AI models to study deception and how to detect it

Developing classifiers to block jailbreaking

Implementing security measures to stop ‘backdoors’ or ‘secret loyalties’ from being added to models in training

Developing policies on model welfare, AI-human relationships, and what instructions to give models

Training AIs to work as alignment researchers

And that’s all just stuff he’s happened to observe directly, which is probably only a small fraction of the options available.

Holden makes a case that, for many people, working at an AI company like Anthropic will be the best way to steer AGI in a positive direction. He notes there are “ways that you can reduce AI risk that you can only do if you’re a competitive frontier AI company.” At the same time, he believes external groups have their own advantages and can be equally impactful.

Critics worry that Anthropic’s efforts to stay at that frontier encourage competitive racing towards AGI — significantly or entirely offsetting any useful research they do. Holden thinks this seriously misunderstands the strategic situation we’re in — and explains his case in detail with host Rob Wiblin.

Chapters:

Cold open (00:00:00)

Holden is back! (00:02:26)

An AI Chernobyl we never notice (00:02:56)

Is rogue AI takeover easy or hard? (00:07:32)

The AGI race isn't a coordination failure (00:17:48)

What Holden now does at Anthropic (00:28:04)

The case for working at Anthropic (00:30:08)

Is Anthropic doing enough? (00:40:45)

Can we trust Anthropic, or any AI company? (00:43:40)

How can Anthropic compete while paying the “safety tax”? (00:49:14)

What, if anything, could prompt Anthropic to halt development of AGI? (00:56:11)

Holden's retrospective on responsible scaling policies (00:59:01)

Overrated work (01:14:27)

Concrete shovel-ready projects Holden is excited about (01:16:37)

Great things to do in technical AI safety (01:20:48)

Great things to do on AI welfare and AI relationships (01:28:18)

Great things to do in biosecurity and pandemic preparedness (01:35:11)

How to choose where to work (01:35:57)

Overrated AI risk: Cyberattacks (01:41:56)

Overrated AI risk: Persuasion (01:51:37)

Why AI R&D is the main thing to worry about (01:55:36)

The case that AI-enabled R&D wouldn't speed things up much (02:07:15)

AI-enabled human power grabs (02:11:10)

Main benefits of getting AGI right (02:23:07)

The world is handling AGI about as badly as possible (02:29:07)

Learning from targeting companies for public criticism in farm animal welfare (02:31:39)

Will Anthropic actually make any difference? (02:40:51)

“Misaligned” vs “misaligned and power-seeking” (02:55:12)

Success without dignity: how we could win despite being stupid (03:00:58)

Holden sees less dignity but has more hope (03:08:30)

Should we expect misaligned power-seeking by default? (03:15:58)

Will reinforcement learning make everything worse? (03:23:45)

Should we push for marginal improvements or big paradigm shifts? (03:28:58)

Should safety-focused people cluster or spread out? (03:31:35)

Is Anthropic vocal enough about strong regulation? (03:35:56)

Is Holden biased because of his financial stake in Anthropic? (03:39:26)

Have we learned clever governance structures don't work? (03:43:51)

Is Holden scared of AI bioweapons? (03:46:12)

Holden thinks AI companions are bad news (03:49:47)

Are AI companies too hawkish on China? (03:56:39)

The frontier of infosec: confidentiality vs integrity (04:00:51)

How often does AI work backfire? (04:03:38)

Is AI clearly more impactful to work in? (04:18:26)

What's the role of earning to give? (04:24:54)

This episode was recorded on July 25 and 28, 2025.

Video editing: Simon Monsour, Luke Monsour, Dominic Armstrong, and Milo McGuire

Audio engineering: Milo McGuire, Simon Monsour, and Dominic Armstrong

Music: CORBIT

Coordination, transcriptions, and web: Katy Moore

Show Notes

Comments

Want to join the conversation?

Loading comments...

According to Anthropic’s Holden Karnofsky, this situation has now reversed completely.

Video, full transcript, and links to learn more: https://80k.info/hk25

In today’s interview, Holden — previously cofounder and CEO of Open Philanthropy — lists 39 projects he’s excited to see happening, including:

Training deceptive AI models to study deception and how to detect it

Developing classifiers to block jailbreaking

Implementing security measures to stop ‘backdoors’ or ‘secret loyalties’ from being added to models in training

Developing policies on model welfare, AI-human relationships, and what instructions to give models

Training AIs to work as alignment researchers

And that’s all just stuff he’s happened to observe directly, which is probably only a small fraction of the options available.

Chapters:

Cold open (00:00:00)

Holden is back! (00:02:26)

An AI Chernobyl we never notice (00:02:56)

Is rogue AI takeover easy or hard? (00:07:32)

The AGI race isn't a coordination failure (00:17:48)

What Holden now does at Anthropic (00:28:04)

The case for working at Anthropic (00:30:08)

Is Anthropic doing enough? (00:40:45)

Can we trust Anthropic, or any AI company? (00:43:40)

How can Anthropic compete while paying the “safety tax”? (00:49:14)

What, if anything, could prompt Anthropic to halt development of AGI? (00:56:11)

Holden's retrospective on responsible scaling policies (00:59:01)

Overrated work (01:14:27)

Concrete shovel-ready projects Holden is excited about (01:16:37)

Great things to do in technical AI safety (01:20:48)

Great things to do on AI welfare and AI relationships (01:28:18)

Great things to do in biosecurity and pandemic preparedness (01:35:11)

How to choose where to work (01:35:57)

Overrated AI risk: Cyberattacks (01:41:56)

Overrated AI risk: Persuasion (01:51:37)

Why AI R&D is the main thing to worry about (01:55:36)

The case that AI-enabled R&D wouldn't speed things up much (02:07:15)

AI-enabled human power grabs (02:11:10)

Main benefits of getting AGI right (02:23:07)

The world is handling AGI about as badly as possible (02:29:07)

Learning from targeting companies for public criticism in farm animal welfare (02:31:39)

Will Anthropic actually make any difference? (02:40:51)

“Misaligned” vs “misaligned and power-seeking” (02:55:12)

Success without dignity: how we could win despite being stupid (03:00:58)

Holden sees less dignity but has more hope (03:08:30)

Should we expect misaligned power-seeking by default? (03:15:58)

Will reinforcement learning make everything worse? (03:23:45)

Should we push for marginal improvements or big paradigm shifts? (03:28:58)

Should safety-focused people cluster or spread out? (03:31:35)

Is Anthropic vocal enough about strong regulation? (03:35:56)

Is Holden biased because of his financial stake in Anthropic? (03:39:26)

Have we learned clever governance structures don't work? (03:43:51)

Is Holden scared of AI bioweapons? (03:46:12)

Holden thinks AI companions are bad news (03:49:47)

Are AI companies too hawkish on China? (03:56:39)

The frontier of infosec: confidentiality vs integrity (04:00:51)

How often does AI work backfire? (04:03:38)

Is AI clearly more impactful to work in? (04:18:26)

What's the role of earning to give? (04:24:54)

This episode was recorded on July 25 and 28, 2025.

Video editing: Simon Monsour, Luke Monsour, Dominic Armstrong, and Milo McGuire

Audio engineering: Milo McGuire, Simon Monsour, and Dominic Armstrong

Music: CORBIT

Coordination, transcriptions, and web: Katy Moore

AI Pulse

#226 – Holden Karnofsky on Unexploited Opportunities to Make AI Safer — and All His AGI Takes

Key Takeaways

Pulse Analysis

Ask Pulse AI:

Episode Description

Show Notes

Comments

AI Pulse

#226 – Holden Karnofsky on Unexploited Opportunities to Make AI Safer — and All His AGI Takes

Key Takeaways

Pulse Analysis

Ask Pulse AI:

Episode Description

Show Notes

Comments