Xifeng Yan - "Adaptive Inference in Transformers"

•January 29, 2026

0

IEEE Robotics and Automation Society

IEEE Robotics and Automation Society•Jan 29, 2026

Why It Matters

Adaptive token‑level layer skipping dramatically cuts transformer inference costs while maintaining accuracy, enabling faster, cheaper AI for robotics and other real‑time applications.

Key Takeaways

•Transformers waste compute on easy tokens, prompting layer skipping.
•Router module decides per-token layer usage via threshold scores.
•Adaptive inference can cut up to 25% layers without accuracy loss.
•Direct multiple token decoding reuses idle layers for faster generation.
•Open‑source code and data enable broader research on efficient transformers.

Summary

Xifeng Yan, a UC Santa Barbara researcher, presented an adaptive inference framework for transformer models, highlighting its relevance to emerging robotics applications that increasingly rely on large‑scale language and vision transformers. He argued that the uniform computational cost per token is inefficient, especially when easy tokens require far less processing than difficult ones.

The core idea is a lightweight router inserted before each attention block that outputs a score between 0 and 1; tokens whose scores fall below a threshold skip the full attention‑FFN stack and instead pass through a tiny adapter. A combined loss penalizes layer usage while preserving language‑model accuracy, enabling the model to learn token‑level depth allocation without altering the pretrained weights.

Experiments on question answering, summarization, and arithmetic tasks show that easy tokens (e.g., copied numbers) use few layers, while complex tokens (e.g., new numbers, reasoning steps) traverse the full depth. Skipping an average of four to eight layers yields comparable or even superior BLEU/ROUGE scores, and a novel “direct multiple‑token decoding” scheme reassigns idle layers to generate subsequent tokens, achieving up to 2× speed‑up with less than 2% performance loss.

The approach promises substantial inference cost reductions for large‑scale models, especially as model size grows and idle computation expands. By open‑sourcing the code and datasets, Yan’s team invites the community to further explore token‑wise conditional computation, a step toward more sustainable and real‑time AI deployments in robotics and beyond.

Original Description

Speaker Biography

Xifeng Yan is a professor at the University of California, Santa Barbara, where he holds the Venkatesh Narayanamurti Chair in Computer Science. He received his Ph.D. in Computer Science from the University of Illinois at Urbana-Champaign in 2006 and was a research staff member at the IBM T. J. Watson Research Center from 2006 to 2008. His current research objective is to explore foundation models in artificial intelligence, leverage these models for knowledge discovery, and develop cross-disciplinary applications. His work has been widely cited. He has received numerous honors, including the NSF CAREER Award, IBM Invention Achievement Award, ACM SIGMOD Dissertation Runner-Up Award, IEEE ICDM 10-Year Highest Impact Paper Award, 2022 PLDI Distinguished Paper Award, 2022 VLDB Test of Time Award, and first place in the Amazon SocialBot Grand Challenge 5. His team is the creator of the first Transformer-based time series forecasting model, initiating a new research direction in the field.

Abstract

Transformer-based large language models (LLMs) have achieved remarkable success across both language and vision tasks, with their impact now extending into robotics—for example, through VLA models in robotic manipulation. Despite these advances, many open questions remain. In this talk, I will focus on one fundamental question: Do all tokens require the same amount of computation within a Transformer? I will share insights into this question and present preliminary approaches to adaptive inference, in which different tokens are generated using varying numbers of Transformer layers. Actually many layers can be automatically skipped without compromising output quality. The overarching goal is to demonstrate how such methods can enhance the efficiency of Transformer-based models and improve their applicability to domains beyond LLMs.

0

Comments

Want to join the conversation?

Loading comments...