Building AlphaGo From Scratch – Eric Jang

Dwarkesh Patel
Dwarkesh PatelMay 15, 2026

Why It Matters

By lowering the cost and complexity of building top‑tier Go AI, the approach democratizes advanced reinforcement‑learning research, enabling broader experimentation on problems once considered intractable.

Key Takeaways

  • AlphaGo combines deep neural nets with Monte‑Carlo tree search.
  • Open‑source KataGo cut training compute by forty‑fold compared
  • LLM‑generated code now replicates DeepMind’s effort for a few thousand dollars.
  • Go’s massive game tree requires clever node merging and exploration bonuses.
  • Understanding AlphaGo reveals how AI can tackle previously intractable problems.

Summary

Eric Jang, former DeepMind robotics researcher, walks through rebuilding AlphaGo from scratch, showing how the classic Go‑AI combines deep neural networks with Monte‑Carlo tree search (MCTS) to make an otherwise intractable game tractable.

He highlights key technical breakthroughs: neural nets provide policy and value estimates, while PUCT‑enhanced MCTS directs exploration. Open‑source KataGo demonstrated a 40× compute reduction, and modern large‑language‑model code generation now lets a small team replicate DeepMind’s original effort for just a few thousand dollars of cloud compute.

Jang illustrates Go fundamentals, Tromp‑Taylor scoring, and the importance of node merging and exploration bonuses in the search tree. He notes that deterministic game states let actions be inferred from child nodes, and that PUCT balances exploitation (Q‑values) with exploration (visit counts).

The broader implication is that sophisticated AI systems once requiring massive resources are becoming accessible to independent researchers and startups, accelerating innovation across domains that were previously deemed computationally infeasible.

Original Description

Eric Jang walks through how to build AlphaGo from scratch, but with modern AI tools.
Sometimes you understand the future better by stepping backward. AlphaGo is still the cleanest worked example of the primitives of intelligence: search, learning from experience, and self-play. You have to go back to 2017 to get insight into how the more general AIs of the future might learn.
Once he explained how AlphaGo works, it gave us the context to have a discussion about how RL works in LLMs and how it could work better – naive policy gradient RL has to figure out which of the 100k+ tokens in your trajectory actually got you the right answer, while AlphaGo’s MCTS suggests a strictly better action every single move, giving you a training target that sidesteps the credit assignment problem. The way humans learn is surely closer to the second.
Eric also kickstarted an Autoresearch loop on his project. And it was very interesting to discuss which parts of AI research LLMs can already automate pretty well (implementing and running experiments, optimizing hyperparameters) and which they still struggle with (choosing the right question to investigate next, escaping research dead ends). Informative to all the recent discussion about when we should expect an intelligence explosion, and what it would look like from the inside.
𝐄𝐏𝐈𝐒𝐎𝐃𝐄 𝐋𝐈𝐍𝐊𝐒
* Check out the flashcards I wrote to retain the insights: https://flashcards.dwarkesh.com/eric-jang/
𝐒𝐏𝐎𝐍𝐒𝐎𝐑𝐒
- Cursor's agent SDK let me build a pipeline to generate flashcards for this episode. For each card, I had an agent read the transcript, ingest blackboard screenshots, generate an SVG visual, and run everything through a critic. A durable agent is much better at this kind of work than a chain of LLM calls, and Cursor's SDK made it easy. Check out the cards at https://flashcards.dwarkesh.com and get started with the SDK at https://cursor.com/dwarkesh
- Jane Street gave me a real deep-dive tour of one of their datacenters. I got to ask a bunch of questions to Ron Minsky, who co-leads Jane Street's tech group, and Dan Pontecorvo, who runs Jane Street's physical engineering team. They were willing to literally pull up the floorboards and take out racks to explain how everything works. Check out the full tour at https://janestreet.com/dwarkesh
To sponsor a future episode, visit https://dwarkesh.com/advertise.
𝐓𝐈𝐌𝐄𝐒𝐓𝐀𝐌𝐏𝐒
00:00:00 – Basics of Go
00:08:06 – Monte Carlo Tree Search
00:31:53 – What the neural network does
01:00:22 – Self-play
01:25:27 – Alternative RL approaches
01:45:36 – Why doesn’t MCTS work for LLMs
02:00:58 – Off-policy training
02:11:51 – RL is even more information inefficient than you thought
02:22:05 – Automated AI researchers

Comments

Want to join the conversation?

Loading comments...