Maybe I Was Too Harsh on Deep Learning Theory (Three Days Ago)

Maybe I Was Too Harsh on Deep Learning Theory (Three Days Ago)

LessWrong
LessWrongApr 30, 2026

Key Takeaways

  • NTK captures linearized dynamics but lacks feature learning
  • Mean‑Field scaling lets parameters move significantly, enabling feature evolution
  • Tensor Programs unify NTK, NNGP, and MFT under a single framework
  • μP parameterization allows hyperparameter transfer across network widths
  • Signal‑propagation MFT predates and informs modern Tensor Program results

Pulse Analysis

The resurgence of interest in infinite‑width limits reflects a broader shift toward rigorously grounding deep‑learning practice in mathematics. Early work on Neural Tangent Kernels showed that, in the 1/√N regime, training behaves like kernel regression, offering elegant convergence proofs but failing to explain the empirical superiority of finite‑width models. By contrast, Mean‑Field Theory adopts a 1/N scaling, allowing parameters to shift appreciably during training, which in turn lets the kernel evolve and the network learn representations. This distinction clarifies why NTK‑based predictions often underestimate real‑world performance and underscores the need for theories that capture feature learning.

Greg Yang’s Tensor Program framework builds on both strands, providing a unifying abc‑parameterization that nests NTK, NNGP, and Mean‑Field as special cases. The breakthrough μP (maximal‑update parameterization) emerged from this synthesis, demonstrating that hyperparameters such as learning rate can be transferred across widths without retuning—a practical advantage for scaling models. Moreover, Tensor Programs leverage tools from free probability and random matrix theory to formalize signal propagation at initialization, linking back to earlier Google Brain studies on chaos and information flow. This lineage illustrates a coherent intellectual tradition rather than isolated breakthroughs.

Despite these advances, the community still lacks a comprehensive theory that explains why stochastic gradient descent on highly over‑parameterized nets generalizes so well, or why specific architectural motifs (e.g., residual connections) consistently outperform others. Papers like Zhang et al. (2016) and Nagarajan et al. (2019) continue to challenge uniform‑convergence approaches, reminding researchers that many open questions remain. Nonetheless, the cumulative progress of MFT, signal‑propagation analyses, and Tensor Programs offers a tangible roadmap for future research, making them essential study areas for the next generation of AI scientists.

Maybe I was too harsh on deep learning theory (three days ago)

Comments

Want to join the conversation?