Stanford CS25: Transformers United V6 I From Language Models to Native Multimodal Intelligence

Stanford Online
Stanford OnlineJun 4, 2026

Why It Matters

Native multimodal models unlock AI systems that can understand and generate across text, image, audio, and video, expanding commercial use cases from virtual assistants to autonomous robotics.

Key Takeaways

  • Large language models excel but need multimodal extension.
  • Native multimodal models tokenise images, audio, video like text.
  • Two model families: multimodal‑input/text‑output and omni models outputting all modalities.
  • Chameleon uses discrete tokenisation via vector quantisation, enabling mixed text‑image generation.
  • Transfusion blends autoregressive text with diffusion‑based continuous image generation for higher quality.

Summary

The Stanford CS25 talk introduced native multimodal intelligence, highlighting how large language models (LLMs) have become ubiquitous but remain limited to symbolic token prediction. Victoria Lynn explained that real‑world applications demand models that ingest and generate across visual, auditory, and video streams, prompting a shift toward "native" multimodal architectures.

The core idea is to treat every modality as a sequence of tokens—patchifying images, segmenting audio waveforms, and flattening video frames—so that a single transformer can apply the same autoregressive training used for text. Two broad families emerged: models that accept multimodal inputs but output only text (e.g., Gemini, Claude) and "omni" models that both consume and produce multiple modalities, exemplified by GPT‑4‑Vision.

Lynn detailed two research pathways. The Chameleon series discretizes image patches via a learned codebook, allowing interleaved text‑image generation but suffering information loss and token‑efficiency issues. Transfusion addresses these limits by merging causal text modeling with diffusion‑based continuous image synthesis, using bidirectional attention for image tokens and achieving higher visual fidelity.

Scaling these designs—both data volume and parameter count—mirrors LLM trends, suggesting that larger native multimodal models will deliver richer reasoning, planning, and real‑time interaction capabilities. Architectural tricks such as mixture‑of‑experts can further improve efficiency, positioning multimodal AI as a strategic frontier for enterprises seeking integrated perception and language services.

Original Description

For more information about Stanford’s graduate programs, visit: https://online.stanford.edu/graduate-education
May 21, 2026
This seminar covers:
• The evolution from language models to native multimodal systems
• Focus on the architectural and training principles that transfer from the LLM paradigm
• Challenges introduced by multimodal learning
• The building blocks of modern multimodal LLMs, including modality representations, autoregressive modeling, and reasoning capabilities inherited from strong language models
• Emerging directions in multimodal architecture design, including sparsity and modality specialization
Follow along with the seminar schedule. Visit: https://web.stanford.edu/class/cs25/
Guest Speaker: Victoria Lin (Thinking Machines)
Instructors:
• Steven Feng, Stanford Computer Science PhD student and NSERC PGS-D scholar
• Karan P. Singh, Electrical Engineering PhD student and NSF Graduate Research Fellow in the Stanford Translational AI Lab
• Michael C. Frank, Benjamin Scott Crocker Professor of Human Biology Director, Symbolic Systems Program
• Christopher Manning, Thomas M. Siebel Professor in Machine Learning, Professor of Linguistics and of Computer Science, Co-Founder and Senior Fellow of the Stanford Institute for Human-Centered Artificial Intelligence (HAI)

Comments

Want to join the conversation?

Loading comments...