Venture Capital

How Intelligent Is AI, Really?

•December 17, 2025

0

YCombinator

YCombinator•Dec 17, 2025

Why It Matters

The ARC benchmarks provide the most rigorous, human‑grounded measure of an AI system’s ability to learn new tasks, steering the industry away from superficial performance claims toward genuine generalization—a prerequisite for safe and impactful AGI development.

Summary

The video captures a conversation at NeurIPS 2025 between Diana and Greg Kamrat, president of the ARC Prize Foundation, about the foundation’s mission to advance AI systems that can generalize like humans. The discussion centers on the ARC benchmark suite, which defines intelligence not by raw performance on static tests but by the ability to learn new tasks efficiently, a concept introduced by François Chollet in 2019.

Key insights include the evolution of the ARC benchmarks from the original static suite (ARC‑AGI 1) to the upgraded ARC‑AGI 2 released in early 2025, and the upcoming interactive ARC‑AGI 3. Kamrat notes that early large language models scored only 4‑5 % on the original benchmark, while newer models jumped to 21 % after the release of GPT‑4‑preview, highlighting the benchmark’s sensitivity to reasoning advances. Major labs such as OpenAI, XAI, Google Gemini, DeepThink, and Anthropic now report their performance using ARC metrics, signaling industry adoption.

The interview provides concrete examples: ARC‑AGI 3 will feature roughly 150 video‑game‑style environments with no textual instructions, requiring agents to infer goals through trial‑and‑error, mirroring real‑world interaction. Human participants from diverse backgrounds will be used to set solvability thresholds, and AI performance will be normalized to the average number of actions a human needs, addressing concerns about brute‑force approaches that dominated earlier Atari‑style benchmarks. Kamrat emphasizes that solving ARC‑AGI 1 or 2 is necessary but not sufficient for true AGI, and that even a perfect score on ARC‑AGI 3 would represent the strongest evidence of generalization to date, not a declaration of AGI.

The implications are profound for both research and commercial AI development. By shifting focus from vanity metrics to measurable learning efficiency, data and energy consumption, the ARC suite encourages the creation of models that can adapt to novel problems without bespoke environments. This could reshape funding priorities, benchmark design, and regulatory scrutiny, as stakeholders seek more reliable indicators of progress toward artificial general intelligence.

Original Description

ARC-AGI is redefining how to measure progress on the path to AGI - focusing on reasoning, generalization, and adaptability instead of memorization or scale. During this month's NeurIPS 2025 conference, YC's Diana Hu sat down with ARC Prize Foundation President Greg Kamradt to find out why most AI benchmarks fail, how ARC-AGI reveals the limits of today’s models, and why measuring intelligence may be harder than building it.

Apply to Y Combinator: https://www.ycombinator.com/apply

Chapters:

00:11 — What ARC Prize is and why it exists

00:38 — François Chollet’s definition of AGI

01:48 — What ARC-AGI Actually Tests

02:25 — When LLMs Failed the ARC Benchmark

02:44 — The Reasoning Breakthrough

03:38 — ARC-AGI Becomes the Standard

04:20 — Vanity Metrics

04:49 — False Positives in AI Progress

06:06 — The Evolution of ARC-AGI

07:05 — Inside ARC-AGI v3

08:55 — Measuring Intelligence beyond just accuracy

10:25 — What happens if a model solves ARC-AGI?

0

Comments

Want to join the conversation?

Loading comments...