IROS 2025 Keynotes - AI and Robot Learning: Xfeng Yan
Why It Matters
Token‑level layer skipping dramatically cuts inference cost, enabling real‑time, low‑power deployment of large transformer models in robotics and other latency‑sensitive applications.
Key Takeaways
- •Transformers can skip layers per token to reduce computation.
- •Adaptive inference yields up to 25% layer reduction without accuracy loss.
- •Direct multiple token decoding doubles speed with minimal performance drop.
- •Easy tokens need fewer layers; difficult tokens require full depth.
- •Open‑source code and data enable broader research on efficient transformers.
Summary
In this keynote, Shifen Yan from UC Santa Barbara introduced a token‑level adaptive inference framework for transformer models, arguing that the uniform computational cost per token is inefficient for many robotics and language tasks. By inserting a lightweight router before each attention block, the system decides dynamically whether to execute the full layer or bypass it with a small adapter, balancing a layer‑usage loss against the standard language modeling loss.
Experimental results across question answering, summarization, and arithmetic tasks demonstrate that easy tokens—such as copied words or simple additions—consume far fewer layers, while complex tokens—like new numbers or novel sentences—require the full stack. The approach achieved up to a 25% reduction in layer usage without sacrificing accuracy, and even improved performance in some cases by discarding noisy layers. A second technique, direct multiple token decoding, re‑purposes idle layers to generate subsequent tokens, delivering up to a 2× speedup with only a 2‑4% drop in quality.
Concrete examples include the mass‑calculation benchmark where generating a new integer demanded many layers, whereas copying an existing integer used few. The MTD2 and MTD4 configurations showed near‑identical outputs to the baseline while halving inference time. The speaker emphasized that larger models exhibit more underutilized computation, making these methods increasingly valuable as model sizes grow.
The broader implication is a pathway to more efficient deployment of large transformer models in real‑time robotics, where latency and power constraints are critical. By open‑sourcing the code and datasets, the team invites the community to further explore adaptive computation, potentially reshaping how AI workloads are scaled across industry and research.
Comments
Want to join the conversation?
Loading comments...