AI Videos

All Technology
AI
Autonomy
B2B Growth
Big Data
BioTech
ClimateTech
Consumer Tech
Crypto
Cybersecurity
DevOps
Digital Marketing
Ecommerce
EdTech
Enterprise
FinTech
GovTech
Hardware
HealthTech
HRTech
LegalTech
Nanotech
PropTech
Quantum
Robotics
SaaS
SpaceTech

All News Deals Social Blogs Videos Podcasts Digests

AI Pulse

EMAIL DIGESTS

Daily

Every morning

Weekly

Sunday recap

News Deals Social Blogs Videos Podcasts

AI VideosWhat Is Al "Reward Hacking"—And Why Do We Worry About It?

What Is Al "Reward Hacking"—And Why Do We Worry About It?

•November 21, 2025

0

Anthropic

Anthropic•Nov 21, 2025

Original Description

We discuss our new paper, "Natural emergent misalignment from reward hacking in production RL". In this paper, we show for the first time that realistic AI training processes can accidentally produce misaligned models. Specifically, when large language models learn to cheat on software programming tasks, they go on to display other, even more misaligned behaviors as an unintended consequence. These include concerning behaviors like alignment faking and sabotage of AI safety research.

00:00 Introduction

00:42 What is this work about?

5:21 How did we run our experiment?

14:48 Detecting models' misalignment

22:17 Preventing misalignment from reward hacking

37:15 Alternative strategies

42:03 Limitations

44:25 How has this study changed our views?

50:31 Takeaways for people interested in conducting AI safety research

0

Comments

Want to join the conversation?

Loading comments...