Allen School Colloquium: Test-Time Training

UW CSE (Allen School)
UW CSE (Allen School)Mar 9, 2026

Why It Matters

By enabling models to learn during inference, test‑time training dramatically improves efficiency on long‑context tasks, reshaping how large language models are deployed in real‑world applications.

Key Takeaways

  • Test-time training lets models adapt effectively during inference.
  • Sliding-window attention reduces latency but loses long-range information.
  • Backpropagation at test time compresses distant context into model weights.
  • Method outperforms full-context transformers beyond thirty-two thousand tokens.
  • Meta-learning trains models to excel at take-home test scenarios.

Summary

The colloquium introduced test‑time training, a paradigm where models continue to learn while being deployed. Yan, a post‑doctoral researcher at Stanford and Nvidia, traced the idea back to his 2019 PhD work and explained how it mirrors the "take‑home test" approach: instead of guessing at inference, a model updates itself using data available at test time.

Traditional machine‑learning pipelines consist of pre‑training, fine‑tuning, and a static testing phase. This static phase becomes a bottleneck when dealing with long‑context inputs such as legal documents or code bases. Sliding‑window attention keeps latency constant but discards information outside the window, leading to higher loss. Yan’s solution applies a backward pass on each new token, using the same next‑token loss as pre‑training, thereby compressing the out‑of‑window context into the model’s weights.

He illustrated the concept with anecdotes—from Chinese exam culture to Andrew Wiles’s decades‑long proof—highlighting how learning on the job can be more powerful than rote preparation. Empirical results show that beyond roughly 32 k tokens, his test‑time training approach becomes faster than a full‑context transformer, achieving 2.7× speed‑up in pre‑fill and up to six‑fold acceleration during decoding while maintaining comparable loss.

If adopted broadly, test‑time training could shift the frontier of AI from static inference toward continual adaptation, reducing latency for long‑context tasks and opening new avenues for meta‑learning and continual learning research.

Original Description

Title: Beyond Physical Intelligence: Why Generalist Robots Require Social Intelligence
Speaker: Yu Sun (Stanford
Date: Thursday, March 5, 2026
Abstract: Most AI models are trained only before the test instances arrive and then fixed during deployment, even though making good predictions on test instances is the ultimate goal of training. What if we continue to train a model after each test instance arrives? In this talk, we discuss how this conceptual framework, known as test-time training, leads to long-term memory that scales differently with context length, and enables AI to discover new results on open scientific problems.
Bio: Yu Sun is a postdoc at Stanford University and a researcher at NVIDIA. His research focuses on continual learning, specifically a conceptual framework known as test-time training, where each test instance defines its own learning problem. Yu obtained his PhD in EECS from UC Berkeley and BS in CS from Cornell University.
This video is in the process of being closed captioned.

Comments

Want to join the conversation?

Loading comments...