Latent Space

After LLMs: Spatial Intelligence and World Models — Fei-Fei Li & Justin Johnson, World Labs

Latent Space

•November 25, 2025•0 min

Latent Space•Nov 25, 2025

Why It Matters

Marble bridges the gap between language models and embodied AI, unlocking new workflows for content creation and training data generation in 3D. Its scalable compute approach could accelerate adoption of spatial intelligence across industries.

Key Takeaways

•Marble generates editable 3D worlds from text and images
•Uses Gaussian splats for efficient spatial rendering
•Enables real-time editing on phones, laptops, VR headsets
•Spatial intelligence complements LLMs for embodied AI
•Aims to democratize world model creation for creators

Pulse Analysis

The rise of large language models has reshaped natural‑language processing, but experts like Fei‑Fei Li see a larger horizon: spatial intelligence. By moving from words to three‑dimensional representations, AI can reason about physical objects, geometry, and dynamics in ways that text alone cannot capture. This shift mirrors the historical leap from AlexNet’s image classification to today’s multimodal systems, where the bottleneck is no longer data scarcity but the ability to ingest and generate rich spatial signals.

Marble’s core engine relies on Gaussian splats, a compact representation that encodes surface detail as weighted points in space. This approach reduces memory footprints while preserving high‑fidelity rendering, allowing the model to run on consumer hardware in real time. The architecture treats transformers as set‑based processors rather than strict sequences, enabling massive parallelism across distributed GPU clusters. Such compute‑efficient design makes it feasible to scale world models to the size of modern LLMs without prohibitive costs, positioning Marble as a practical tool for creators and researchers alike.

The implications extend beyond entertainment. Game studios can prototype levels instantly, VFX teams can previsualize complex shots, and robotics labs can generate synthetic environments for training embodied agents. By coupling spatial models with language interfaces, developers gain a unified pipeline where textual prompts produce fully editable 3D scenes, accelerating design cycles and lowering barriers to entry. As enterprises adopt these capabilities, we can expect a new wave of products that blend conversational AI with tangible, manipulable worlds, redefining how digital and physical experiences are built.

Episode Description

Fei-Fei Li and Justin Johnson are cofounders of World Labs, who have recently launched Marble (https://marble.worldlabs.ai/), a new kind of generative “world model” that can create editable 3D environments from text, images, and other spatial inputs. Marble lets creators generate persistent 3D worlds, precisely control cameras, and interactively edit scenes, making it a powerful tool for games, film, VR, robotics simulation, and more. In this episode, Fei-Fei and Justin share how their journey from ImageNet and Stanford research led to World Labs, why spatial intelligence is the next frontier after LLMs, and how world models could change how machines see, understand, and build in 3D.

We discuss:

The massive compute scaling from AlexNet to today and why world models and spatial data are the most compelling way to “soak up” modern GPU clusters compared to language alone.

What Marble actually is: a generative model of 3D worlds that turns text and images into editable scenes using Gaussian splats, supports precise camera control and recording, and runs interactively on phones, laptops, and VR headsets.

Fei-fei’s essay (https://drfeifei.substack.com/p/from-words-to-worlds-spatial-intelligence) on spatial intelligence as a distinct form of intelligence from language: from picking up a mug to inferring the 3D structure of DNA, and why language is a lossy, low-bandwidth channel for describing the rich 3D/4D world we live in.

Whether current models “understand” physics or just fit patterns: the gap between predicting orbits and discovering F=ma, and how attaching physical properties to splats and distilling physics engines into neural networks could lead to genuine causal reasoning.

The changing role of academia in AI, why Fei-Fei worries more about under-resourced universities than “open vs closed,” and how initiatives like national AI compute clouds and open benchmarks can rebalance the ecosystem.

Why transformers are fundamentally set models, not sequence models, and how that perspective opens up new architectures for world models, especially as hardware shifts from single GPUs to massive distributed clusters.

Real use cases for Marble today: previsualization and VFX, game environments, virtual production, interior and architectural design (including kitchen remodels), and generating synthetic simulation worlds for training embodied agents and robots.

How spatial intelligence and language intelligence will work together in multimodal systems, and why the goal isn’t to throw away LLMs but to complement them with rich, embodied models of the world.

Fei-Fei and Justin’s long-term vision for spatial intelligence: from creative tools for artists and game devs to broader applications in science, medicine, and real-world decision-making.

—

Fei-Fei Li

X: https://x.com/drfeifei

LinkedIn: https://www.linkedin.com/in/fei-fei-li-4541247

Justin Johnson

X: https://x.com/jcjohnss

LinkedIn: https://www.linkedin.com/in/justin-johnson-41b43664

Where to find Latent Space

X: https://x.com/latentspacepod

Substack: https://www.latent.space/

Chapters

00:00:00 Introduction and the Fei-Fei Li & Justin Johnson Partnership

00:02:00 From ImageNet to World Models: The Evolution of Computer Vision

00:12:33 Dense Captioning and Early Vision-Language Work

00:19:39 Spatial Intelligence: Beyond Language Models

00:28:49 Introducing Marble: World Labs' First Spatial Intelligence Model

00:33:24 Gaussian Splats and the Technical Architecture of Marble

00:35:50 Physics, Dynamics, and the Future of World Models

00:44:00 Multimodality and the Interplay of Language and Space

00:37:37 Use Cases: From Creative Industries to Robotics and Embodied AI

00:57:03 Hiring, Research Directions, and the Future of World Labs

Show Notes

Comments

Want to join the conversation?

Loading comments...

We discuss:

The massive compute scaling from AlexNet to today and why world models and spatial data are the most compelling way to “soak up” modern GPU clusters compared to language alone.

Fei-Fei and Justin’s long-term vision for spatial intelligence: from creative tools for artists and game devs to broader applications in science, medicine, and real-world decision-making.

—

Fei-Fei Li

X: https://x.com/drfeifei

LinkedIn: https://www.linkedin.com/in/fei-fei-li-4541247

Justin Johnson

X: https://x.com/jcjohnss

LinkedIn: https://www.linkedin.com/in/justin-johnson-41b43664

Where to find Latent Space

X: https://x.com/latentspacepod

Substack: https://www.latent.space/

Chapters

00:00:00 Introduction and the Fei-Fei Li & Justin Johnson Partnership

00:02:00 From ImageNet to World Models: The Evolution of Computer Vision

00:12:33 Dense Captioning and Early Vision-Language Work

00:19:39 Spatial Intelligence: Beyond Language Models

00:28:49 Introducing Marble: World Labs' First Spatial Intelligence Model

00:33:24 Gaussian Splats and the Technical Architecture of Marble

00:35:50 Physics, Dynamics, and the Future of World Models

00:44:00 Multimodality and the Interplay of Language and Space

00:37:37 Use Cases: From Creative Industries to Robotics and Embodied AI

00:57:03 Hiring, Research Directions, and the Future of World Labs

AI Pulse

After LLMs: Spatial Intelligence and World Models — Fei-Fei Li & Justin Johnson, World Labs

Why It Matters

Key Takeaways

Pulse Analysis

Ask Pulse AI:

Episode Description

Show Notes

Comments

AI Pulse

After LLMs: Spatial Intelligence and World Models — Fei-Fei Li & Justin Johnson, World Labs

Why It Matters

Key Takeaways

Pulse Analysis

Ask Pulse AI:

Episode Description

Show Notes

Comments