IROS 2025 Keynotes - AI and Robot Learning: Xfeng Yan

IEEE Robotics & Automation Society
IEEE Robotics & Automation SocietyFeb 18, 2026

Why It Matters

Token‑level layer skipping dramatically cuts inference cost, enabling real‑time, low‑power deployment of large transformer models in robotics and other latency‑sensitive applications.

Key Takeaways

  • Transformers can skip layers per token to reduce computation.
  • Adaptive inference yields up to 25% layer reduction without accuracy loss.
  • Direct multiple token decoding doubles speed with minimal performance drop.
  • Easy tokens need fewer layers; difficult tokens require full depth.
  • Open‑source code and data enable broader research on efficient transformers.

Summary

In this keynote, Shifen Yan from UC Santa Barbara introduced a token‑level adaptive inference framework for transformer models, arguing that the uniform computational cost per token is inefficient for many robotics and language tasks. By inserting a lightweight router before each attention block, the system decides dynamically whether to execute the full layer or bypass it with a small adapter, balancing a layer‑usage loss against the standard language modeling loss.

Experimental results across question answering, summarization, and arithmetic tasks demonstrate that easy tokens—such as copied words or simple additions—consume far fewer layers, while complex tokens—like new numbers or novel sentences—require the full stack. The approach achieved up to a 25% reduction in layer usage without sacrificing accuracy, and even improved performance in some cases by discarding noisy layers. A second technique, direct multiple token decoding, re‑purposes idle layers to generate subsequent tokens, delivering up to a 2× speedup with only a 2‑4% drop in quality.

Concrete examples include the mass‑calculation benchmark where generating a new integer demanded many layers, whereas copying an existing integer used few. The MTD2 and MTD4 configurations showed near‑identical outputs to the baseline while halving inference time. The speaker emphasized that larger models exhibit more underutilized computation, making these methods increasingly valuable as model sizes grow.

The broader implication is a pathway to more efficient deployment of large transformer models in real‑time robotics, where latency and power constraints are critical. By open‑sourcing the code and datasets, the team invites the community to further explore adaptive computation, potentially reshaping how AI workloads are scaled across industry and research.

Original Description

"Keynote Title: ""Adaptive Inference in Transformers""
Speaker Biography
Xifeng Yan is a professor at the University of California, Santa Barbara, where he holds the Venkatesh Narayanamurti Chair in Computer Science. He received his Ph.D. in Computer Science from the University of Illinois at Urbana-Champaign in 2006 and was a research staff member at the IBM T. J. Watson Research Center from 2006 to 2008. His current research objective is to explore foundation models in artificial intelligence, leverage these models for knowledge discovery, and develop cross-disciplinary applications. His work has been widely cited. He has received numerous honors, including the NSF CAREER Award, IBM Invention Achievement Award, ACM SIGMOD Dissertation Runner-Up Award, IEEE ICDM 10-Year Highest Impact Paper Award, 2022 PLDI Distinguished Paper Award, 2022 VLDB Test of Time Award, and first place in the Amazon SocialBot Grand Challenge 5. His team is the creator of the first Transformer-based time series forecasting model, initiating a new research direction in the field.
Abstract
Transformer-based large language models (LLMs) have achieved remarkable success across both language and vision tasks, with their impact now extending into robotics—for example, through VLA models in robotic manipulation. Despite these advances, many open questions remain. In this talk, I will focus on one fundamental question: Do all tokens require the same amount of computation within a Transformer? I will share insights into this question and present preliminary approaches to adaptive inference, in which different tokens are generated using varying numbers of Transformer layers. Actually many layers can be automatically skipped without compromising output quality. The overarching goal is to demonstrate how such methods can enhance the efficiency of Transformer-based models and improve their applicability to domains beyond LLMs.
"

Comments

Want to join the conversation?

Loading comments...