FAST '26 - AdaCheck: An Adaptive Checkpointing System for Efficient LLM Training with Redundancy...
Why It Matters
By slashing checkpoint size and bandwidth needs, AdaCheck makes massive LLM training cheaper and more fault‑tolerant, directly impacting the scalability and cost structure of AI development.
Key Takeaways
- •Adaptive checkpointing cuts LLM training waste dramatically by leveraging state redundancy
- •Tensor redundancy detection reduces checkpoint size by up to 896×
- •Offline and online methods exploit parallelism and iteration redundancy
- •Gradient-only incremental checkpoints lower bandwidth and storage demands
- •AdaCheck outperforms Checkpoint, Gemmini across diverse GPU clusters
Summary
The video introduces AdaCheck, an adaptive checkpointing framework designed to curb the massive resource waste inherent in large‑language‑model (LLM) training. By recognizing that parallelism and model‑parallel architectures create duplicated tensors across workers, the authors propose a system that dynamically trims redundant data before writing checkpoints, enabling far more frequent saves without overwhelming storage or network bandwidth.
AdaCheck combines three core components: an offline redundancy‑utilization stage that models tensor‑level duplication, a lightweight tensor‑redundancy detector that hashes and exchanges tensor signatures within communication groups, and an online incremental checkpointing scheme that stores only mixed‑precision gradients between iterations. Experiments on both data‑center‑grade and commodity GPU clusters show checkpoint size reductions ranging from 6× to 896× compared with state‑of‑the‑art baselines such as Checkpoint and Gemmini, while maintaining the ability to checkpoint at the optimal one‑step‑per‑second (1 C) frequency.
The presenters highlight a striking example: training Llama 3.1 on 16,000 GPUs incurred 419 failures, wasting roughly two million GPU‑hours. AdaCheck’s gradient‑only incremental approach cuts the data transferred during recovery, and its tensor‑redundancy detector leverages ring‑based communication to avoid unnecessary comparisons, dramatically lowering inter‑node traffic. The authors also note that existing systems either ignore redundancy across training iterations or are limited to specific parallelism schemes, gaps that AdaCheck explicitly fills.
For practitioners, AdaCheck promises to shrink checkpoint storage footprints, reduce network congestion, and improve overall training throughput, making ultra‑large model training more economically viable and resilient to frequent hardware failures.
Comments
Want to join the conversation?
Loading comments...