Stanford CS25: Transformers United V6 I From Language Models to Native Multimodal Intelligence
Why It Matters
Native multimodal models unlock AI systems that can understand and generate across text, image, audio, and video, expanding commercial use cases from virtual assistants to autonomous robotics.
Key Takeaways
- •Large language models excel but need multimodal extension.
- •Native multimodal models tokenise images, audio, video like text.
- •Two model families: multimodal‑input/text‑output and omni models outputting all modalities.
- •Chameleon uses discrete tokenisation via vector quantisation, enabling mixed text‑image generation.
- •Transfusion blends autoregressive text with diffusion‑based continuous image generation for higher quality.
Summary
The Stanford CS25 talk introduced native multimodal intelligence, highlighting how large language models (LLMs) have become ubiquitous but remain limited to symbolic token prediction. Victoria Lynn explained that real‑world applications demand models that ingest and generate across visual, auditory, and video streams, prompting a shift toward "native" multimodal architectures.
The core idea is to treat every modality as a sequence of tokens—patchifying images, segmenting audio waveforms, and flattening video frames—so that a single transformer can apply the same autoregressive training used for text. Two broad families emerged: models that accept multimodal inputs but output only text (e.g., Gemini, Claude) and "omni" models that both consume and produce multiple modalities, exemplified by GPT‑4‑Vision.
Lynn detailed two research pathways. The Chameleon series discretizes image patches via a learned codebook, allowing interleaved text‑image generation but suffering information loss and token‑efficiency issues. Transfusion addresses these limits by merging causal text modeling with diffusion‑based continuous image synthesis, using bidirectional attention for image tokens and achieving higher visual fidelity.
Scaling these designs—both data volume and parameter count—mirrors LLM trends, suggesting that larger native multimodal models will deliver richer reasoning, planning, and real‑time interaction capabilities. Architectural tricks such as mixture‑of‑experts can further improve efficiency, positioning multimodal AI as a strategic frontier for enterprises seeking integrated perception and language services.
Comments
Want to join the conversation?
Loading comments...