AI Videos

All News Deals Social Blogs Videos Podcasts Digests

Jon Kleinberg - Formal Models of Language Generation

•April 24, 2026

Berkeley EECS

Berkeley EECS•Apr 24, 2026

Why It Matters

Understanding the theoretical limits of language generation guides the development of more robust, architecture‑independent LLMs and clarifies why probabilistic modeling is essential for practical AI systems.

Key Takeaways

•Kleinberg reframes language modeling as a generation, not identification, problem.
•Gold's 1967 theorem shows language identification in the limit impossible.
•Angluin’s characterization limits learnable languages to those with finite telltales.
•New model requires algorithm to output unseen strings without feedback.
•Probabilistic assumptions become necessary to make language generation tractable.

Summary

Jon Kleinberg’s talk explores a theoretical foundation for large language models by shifting focus from probabilistic prediction to the core task of language generation. He argues that instead of asking what distribution a model should learn, researchers should define the abstract problem of producing valid, previously unseen strings from an unknown language.

The discussion revisits classic learning theory, highlighting Mark Gold’s 1967 result that language identification in the limit is impossible even for simple regular languages, and Dana Angluin’s later characterization that only languages with finite telltale subsets are learnable. Kleinberg uses these negative results to motivate a new formulation: the algorithm must generate unseen strings without ever receiving negative examples or correctness feedback.

He illustrates the formulation with the six‑month‑old infant analogy—children receive only positive linguistic input and cannot query for errors, yet they eventually produce fluent speech. In the revised game, the adversary emits strings from a secret language, and the learner must output novel strings that belong to that language, winning once it consistently does so, despite never knowing when it has succeeded.

The implication is that pure worst‑case guarantees are unattainable, pushing researchers toward probabilistic assumptions and statistical regularities to make generation feasible. This reframing offers a cleaner, architecture‑agnostic lens for evaluating and designing future large language models, emphasizing generative competence over exact distribution matching.

Original Description

Biography:

Jon Kleinberg is the Tisch University Professor in the Departments of Computer Science and Information Science at Cornell University. His research focuses on the interaction of algorithms and networks, the roles they play in large-scale social and information systems, and their broader societal implications. He is a member of the National Academy of Sciences, the National Academy of Engineering, the American Academy of Arts and Sciences, and the American Philosophical Society, and he has served on advisory groups including the National AI Advisory Committee (NAIAC) and the National Research Council’s Computer Science and Telecommunications Board (CSTB) and Committee on Science, Technology, and Law (CSTL). He has received MacArthur, Packard, Simons, Sloan, and Vannevar Bush research fellowships, as well as awards including the the Nevanlinna Prize, the World Laureates Association Prize, the ACM/AAAI Allen Newell Award, and the ACM Prize in Computing.

Abstract:

The emergence of large language models has prompted a surge of interest into theoretical models that might give us insight into both their successes and their shortcomings. We’ll give an overview of recent work in this direction, focusing on a surprising line of positive results that shows it is possible to give guarantees for language-generation algorithms even in the absence of any probabilistic assumptions, in a framework known as “language generation in the limit”. These results suggest interesting notions of “breadth” in language generation, attempting to formalize the idea that different algorithms for this problem might all meet the specification but differ significantly in their expressiveness — in how “richly” they can generate from the underlying language. We also discuss strong contrasts with classical results on language identification, showing a strong sense in which language generation and language learning are fundamentally different as computational problems. The talk will be based on joint work with Sendhil Mullainathan and Fan Wei.

EECS Colloquium

Wednesday April 22, 2026

Banatao Auditorium

4 - 5p

Comments

Want to join the conversation?

Loading comments...