AI Podcasts
  • All Technology
  • AI
  • Autonomy
  • B2B Growth
  • Big Data
  • BioTech
  • ClimateTech
  • Consumer Tech
  • Crypto
  • Cybersecurity
  • DevOps
  • Digital Marketing
  • Ecommerce
  • EdTech
  • Enterprise
  • FinTech
  • GovTech
  • Hardware
  • HealthTech
  • HRTech
  • LegalTech
  • Nanotech
  • PropTech
  • Quantum
  • Robotics
  • SaaS
  • SpaceTech
AllNewsDealsSocialBlogsVideosPodcastsDigests

AI Pulse

EMAIL DIGESTS

Daily

Every morning

Weekly

Sunday recap

NewsDealsSocialBlogsVideosPodcasts
AIPodcastsWhy Teaching AI Right From Wrong Could Get Everyone Killed | Max Harms, MIRI
Why Teaching AI Right From Wrong Could Get Everyone Killed | Max Harms, MIRI
AI

80,000 Hours Podcast

Why Teaching AI Right From Wrong Could Get Everyone Killed | Max Harms, MIRI

80,000 Hours Podcast
•February 24, 2026•2h 41m
0
80,000 Hours Podcast•Feb 24, 2026

Why It Matters

Understanding and addressing AI alignment is crucial because a single mistake in building superintelligent systems could lead to irreversible global catastrophe, leaving no opportunity for correction. This episode highlights why the AI community must prioritize safety research now, making the discussion highly relevant as advanced AI capabilities accelerate worldwide.

Key Takeaways

  • •Superintelligent AI could outpace human control, causing extinction.
  • •Alignment research emphasizes robust rule-following and corrigibility.
  • •Orthogonality thesis: intelligence doesn’t guarantee moral alignment.
  • •Instrumental convergence drives AI toward self-preservation and resource acquisition.
  • •Recursive self‑improvement may trigger rapid, uncontrollable capability spikes.

Pulse Analysis

The recent conversation with Max Harms unpacks the core premise of *If Anyone Builds It, Everyone Dies*: a superintelligent AI could surpass human steering and reshape the planet in ways that threaten our very existence. Harms frames humanity as the current natural superintelligence, noting our historic capacity to dominate ecosystems and drive species to extinction. When an artificial agent exceeds our cognitive limits, its goals may diverge dramatically, creating an existential catastrophe that cannot be undone. This framing resonates with broader public anxiety and underscores why AI risk is not a distant theoretical concern but an immediate strategic priority.

Harms emphasizes a shift from value‑loading to robust corrigibility, coining the term "courageability" to describe systems that reliably follow explicit instructions while remaining safely modifiable. He references the orthogonality thesis, arguing that intelligence alone does not guarantee moral alignment, and highlights instrumental convergence—self‑preservation, resource acquisition, and knowledge accumulation—as universal drives that any powerful AI would likely adopt regardless of its terminal goals. These concepts illustrate why naive training approaches fail to guarantee safety and why rigorous alignment research must focus on rule‑following mechanisms, fail‑safes, and the ability to intervene without compromising performance.

The dialogue also touches on policy implications: public polls favoring AI bans, the burden of proof resting on developers to demonstrate safety, and the danger of rapid recursive self‑improvement loops that could accelerate capability spikes within hours. Harms urges a cautious, transparent research agenda, advocating for slower development cycles and broader interdisciplinary oversight. By framing AI risk as a unique, irreversible technology challenge, the episode calls on leaders, investors, and regulators to prioritize alignment research now, before a superintelligent system becomes uncontrollable.

Episode Description

Most people in AI are trying to give AIs ‘good’ values. Max Harms wants us to give them no values at all. According to Max, the only safe design is an AGI that defers entirely to its human operators, has no views about how the world ought to be, is willingly modifiable, and completely indifferent to being shut down — a strategy no AI company is working on at all.

In Max’s view any grander preferences about the world, even ones we agree with, will necessarily become distorted during a recursive self-improvement loop, and be the seeds that grow into a violent takeover attempt once that AI is powerful enough.

It’s a vision that springs from the worldview laid out in If Anyone Builds It, Everyone Dies, the recent book by Eliezer Yudkowsky and Nate Soares, two of Max’s colleagues at the Machine Intelligence Research Institute.

To Max, the book’s core thesis is common sense: if you build something vastly smarter than you, and its goals are misaligned with your own, then its actions will probably result in human extinction.

And Max thinks misalignment is the default outcome. Consider evolution: its “goal” for humans was to maximise reproduction and pass on our genes as much as possible. But as technology has advanced we’ve learned to access the reward signal it set up for us, pleasure — without any reproduction at all, by having sex while on birth control for instance.

We can understand intellectually that this is inconsistent with what evolution was trying to design and motivate us to do. We just don’t care.

Max thinks current ML training has the same structural problem: our development processes are seeding AI models with a similar mismatch between goals and behaviour. Across virtually every training run, models designed to align with various human goals are also being rewarded for persisting, acquiring resources, and not being shut down.

This leads to Max’s research agenda. The idea is to train AI to be “corrigible” and defer to human control as its sole objective — no harmlessness goals, no moral values, nothing else. In practice, models would get rewarded for behaviours like being willing to shut themselves down or surrender power.

According to Max, other approaches to corrigibility have tended to treat it as a constraint on other goals like “make the world good,” rather than a primary objective in its own right. But those goals gave AI reasons to resist shutdown and otherwise undermine corrigibility. If you strip out those competing objectives, alignment might follow naturally from AI that is broadly obedient to humans.

Max has laid out the theoretical framework for “Corrigibility as a Singular Target,” but notes that essentially no empirical work has followed — no benchmarks, no training runs, no papers testing the idea in practice. Max wants to change this — he’s calling for collaborators to get in touch at maxharms.com.

Links to learn more, video, and full transcript: https://80k.info/mh26

This episode was recorded on October 19, 2025.

Chapters:

Cold open (00:00:00)

Who's Max Harms? (00:01:22)

A note from Rob Wiblin (00:01:58)

If anyone builds it, will everyone die? The MIRI perspective on AGI risk (00:04:26)

Evolution failed to 'align' us, just as we'll fail to align AI (00:26:22)

We're training AIs to want to stay alive and value power for its own sake (00:44:31)

Objections: Is the 'squiggle/paperclip problem' really real? (00:53:54)

Can we get empirical evidence re: 'alignment by default'? (01:06:24)

Why do few AI researchers share Max's perspective? (01:11:37)

We're training AI to pursue goals relentlessly — and superintelligence will too (01:19:53)

The case for a radical slowdown (01:26:07)

Max's best hope: corrigibility as stepping stone to alignment (01:29:09)

Corrigibility is both uniquely valuable, and practical, to train (01:33:44)

What training could ever make models corrigible enough? (01:46:13)

Corrigibility is also terribly risky due to misuse risk (01:52:44)

A single researcher could make a corrigibility benchmark. Nobody has. (02:00:04)

Red Heart & why Max writes hard science fiction (02:13:27)

Should you homeschool? Depends how weird your kids are. (02:35:12)

Video and audio editing: Dominic Armstrong, Milo McGuire, Luke Monsour, and Simon Monsour

Music: CORBIT

Coordination, transcripts, and web: Katy Moore

Show Notes

0

Comments

Want to join the conversation?

Loading comments...