Inference, Diffusion, World Models, and More | YC Paper Club

YCombinator
YCombinatorMay 28, 2026

Why It Matters

Accelerating inference transforms AI from a costly backend task into a core capability, unlocking more powerful, responsive applications and giving startups a competitive edge.

Key Takeaways

  • Inference speed will become a core capability, not just cost.
  • Speculative decoding uses a small model to draft tokens for faster sampling.
  • SSD parallelizes drafting and verification, hiding latency for large models.
  • Predicting verification outcomes achieves 80‑90% accuracy, enabling speedups.
  • YC Paper Club fosters AI community linking Bay Area and Palo Alto.

Summary

The inaugural YC Paper Club gathered top AI researchers and founders to discuss cutting‑edge inference techniques, spotlighting speculative decoding and its next‑generation variant, Speculative Speculative Decoding (SSD). The session highlighted how inference, traditionally viewed as a cost or convenience issue, is poised to become a fundamental capability that determines an AI system’s real‑time intelligence.

Speakers explained that speculative decoding leverages a lightweight “draft” model to generate candidate tokens, which a larger target model then verifies in a single forward pass. SSD pushes this further by overlapping drafting and verification, effectively hiding the latency of the large model. By predicting verification outcomes with 80‑90% accuracy using information from the draft’s token distributions, SSD achieves substantial speedups without sacrificing quality.

A live demo contrasted vanilla auto‑regressive decoding, standard speculative decoding, and a custom SSD engine, showing the latter’s superior throughput. The presenter emphasized that inference speed directly translates to peak intelligence, noting that future data centers could house tens of thousands of GPUs dedicated to rapid inference. The talk also recalled YC’s early AI days, underscoring the community’s role in fostering breakthroughs.

If SSD’s parallelism scales, AI services can serve billions of tokens at lower cost while delivering more responsive, capable models. Faster inference expands the feasible size of deployed models, accelerates product iteration for startups, and reinforces YC’s mission to unite Bay Area and Palo Alto AI talent.

Original Description

Even if you’re a current PhD student, it's hard to keep up with the latest AI research. That's why we started YC Paper Club, a small group of researchers, engineers, and founders who will meet every two weeks this summer to present and discuss new papers together.
This was from our very first discussion group on May 20th, 2026, at the YC office in Mountain View, CA.
Thanks to the following presenters:
0:12 - Intro from YC Visiting Partner Francois Chaubard
3:49 - Tanishq Kumar — Speculative Speculative Decoding (https://arxiv.org/abs/2603.03251)
18:33 - Guangyao (Stannis) Zhou — Diffusion-MPC (https://arxiv.org/abs/2410.05364)
30:26 - Isaac Ward — LeWorldModeling (https://arxiv.org/abs/2603.19312)
43:54 - Akshay Vegesna — Deep Learning is Not So Mysterious or Different (https://arxiv.org/abs/2503.02113)
51:24 - Konwoo Kim — Pretraining Under Infinite Compute (https://arxiv.org/pdf/2509.14786)

Comments

Want to join the conversation?

Loading comments...