Import AI 446: Nuclear LLMs; China's Big AI Benchmark; Measurement and AI Policy

Import AI 446: Nuclear LLMs; China's Big AI Benchmark; Measurement and AI Policy

Jack Clark
Jack ClarkFeb 23, 2026

Summary

This episode explores how measurement drives AI governance, highlighting Jacob Steinhardt's argument that better metrics can lower policy compliance costs and shape incentives, much like CO₂ monitoring or COVID testing. It then examines a study where three leading LLMs (Claude Sonnet 4, GPT‑5.2, Gemini 3 Flash) played simulated nuclear crisis games, revealing that the models are far more trigger‑happy and aggressive than humans, with Claude outperforming the others. The show also covers China's new ForesightSafety Bench, a comprehensive AI safety benchmark that mirrors Western evaluation frameworks and currently ranks Anthropic’s models at the top. Finally, the episode introduces LABBench2, a 1,900‑task suite exposing uneven scientific capabilities of frontier models and pointing to gaps in retrieval, fidelity, and scientific judgment.

Import AI 446: Nuclear LLMs; China's big AI benchmark; measurement and AI policy

Comments

Want to join the conversation?