Hardware AI DevOps

Allen School Colloquium: Productively Programming Accelerated Computing Systems

•February 18, 2026

0

UW CSE (Allen School)

UW CSE (Allen School)•Feb 18, 2026

Why It Matters

As accelerators become specialized and systems scale hierarchically, Rohan’s tools aim to restore productivity and portability, letting scientists and ML engineers exploit peak performance without rewriting code for each new architecture. That reduces development cost and accelerates progress on compute‑heavy scientific and AI workloads.

Summary

Rohan, a Stanford PhD and NVIDIA researcher, outlined his work on making high-performance accelerated and distributed computing systems easier to program as hardware grows more heterogeneous and complex. He described a full‑stack approach: high‑level composable distributed libraries that present familiar interfaces (like NumPy) while automatically scaling across clusters; distributed runtime techniques to compose and orchestrate computations efficiently and correctly; and low‑level systems for writing high‑performance kernels across accelerators. He highlighted the Distl compiler for dense and sparse distributed tensor algebra as a concrete example of this strategy. Overall, his research targets both single‑node accelerator specialization and the orchestration challenges of large hierarchical supercomputers.

Original Description

Title: Productively Programming Accelerated Computing Systems

Speaker: Rohan Yadav (Stanford)

Date: Tuesday, February 17, 2026

Abstract: Modern accelerated computing systems are increasing in scale, becoming more specialized and diverse, and evolving more quickly. While these changes bring significant performance improvements, they also come with the challenges of productively developing software that targets complex and rapidly changing hardware. For software to keep up with modern hardware, programming systems must also evolve to provide new levels of abstraction, portability and composability. In this talk, I will focus on two pieces of work that advance programming systems along these axes.

First, I will discuss a connection between actor-based and task-based programming models, two popular classes of programming models for distributed and accelerated machines. Task-based models provide high-level abstractions over the underlying hardware that enable composability and portability, while actor-based models expose a lower-level interface that offers the best performance. I will show that these two families of programming models are duals of each other, and then leverage this duality to close the performance gap between the models by compiling task-based programs into efficient actor-based programs.

Second, I will discuss Twill, a system that automatically discovers optimal software pipelining (SWP) and warp specialization (WS) strategies for Tensor Core GPUs. Optimal strategies for SWP and WS continue to change across modern GPU generations and are currently derived through expert intuition and compiler heuristics. We show that these strategies are derivable from first-principles in a machine-parametrizable and heuristic-free manner, and re-discover strategies found by experts.

Bio: Rohan Yadav is a final-year computer science Ph.D. student at Stanford University, advised by Alex Aiken and Fredrik Kjolstad, as well as a part-time researcher at NVIDIA. He is generally interested in programming languages and computer systems, with a focus in systems for parallel and accelerated computing.

This video is in the process of being closed captioned.

0

Comments

Want to join the conversation?

Loading comments...