FAST '26 - AdaCheck: An Adaptive Checkpointing System for Efficient LLM Training with Redundancy...

USENIX Association
USENIX AssociationApr 7, 2026

Why It Matters

By slashing checkpoint size and bandwidth needs, AdaCheck makes massive LLM training cheaper and more fault‑tolerant, directly impacting the scalability and cost structure of AI development.

Key Takeaways

  • Adaptive checkpointing cuts LLM training waste dramatically by leveraging state redundancy
  • Tensor redundancy detection reduces checkpoint size by up to 896×
  • Offline and online methods exploit parallelism and iteration redundancy
  • Gradient-only incremental checkpoints lower bandwidth and storage demands
  • AdaCheck outperforms Checkpoint, Gemmini across diverse GPU clusters

Summary

The video introduces AdaCheck, an adaptive checkpointing framework designed to curb the massive resource waste inherent in large‑language‑model (LLM) training. By recognizing that parallelism and model‑parallel architectures create duplicated tensors across workers, the authors propose a system that dynamically trims redundant data before writing checkpoints, enabling far more frequent saves without overwhelming storage or network bandwidth.

AdaCheck combines three core components: an offline redundancy‑utilization stage that models tensor‑level duplication, a lightweight tensor‑redundancy detector that hashes and exchanges tensor signatures within communication groups, and an online incremental checkpointing scheme that stores only mixed‑precision gradients between iterations. Experiments on both data‑center‑grade and commodity GPU clusters show checkpoint size reductions ranging from 6× to 896× compared with state‑of‑the‑art baselines such as Checkpoint and Gemmini, while maintaining the ability to checkpoint at the optimal one‑step‑per‑second (1 C) frequency.

The presenters highlight a striking example: training Llama 3.1 on 16,000 GPUs incurred 419 failures, wasting roughly two million GPU‑hours. AdaCheck’s gradient‑only incremental approach cuts the data transferred during recovery, and its tensor‑redundancy detector leverages ring‑based communication to avoid unnecessary comparisons, dramatically lowering inter‑node traffic. The authors also note that existing systems either ignore redundancy across training iterations or are limited to specific parallelism schemes, gaps that AdaCheck explicitly fills.

For practitioners, AdaCheck promises to shrink checkpoint storage footprints, reduce network congestion, and improve overall training throughput, making ultra‑large model training more economically viable and resilient to frequent hardware failures.

Original Description

AdaCheck: An Adaptive Checkpointing System for Efficient LLM Training with Redundancy Utilization
Weijie Liu, Shengwei Li, Zhiquan Lai, and Keshi Ge, National University of Defense Technology; Qiaoling Chen, Nanyang Technological University; Peng Sun, Shanghai AI Laboratory; Dongsheng Li and Kai Lu, National University of Defense Technology
The development of large language models (LLMs) relies on sophisticated parallel training techniques, involving prolonged training runs with thousands of workers. Checkpointing systems are essential for handling failures in large-scale training. However, existing checkpointing systems are almost offline solutions tailored to specific parallelisms or model architectures. They lack adaptability to diverse parallel strategies and fail to recognize that most model states can be excluded from checkpoints, missing optimization opportunities.
In this paper, we present AdaCheck, an adaptive checkpointing system that achieves minimized checkpoint size by characterizing and exploiting state redundancy across various parallelisms, model architectures, and training iterations. We model the state redundancy induced by parallelisms and model architectures using the abstraction tensor redundancy, and propose an offline redundancy utilization method to create checkpoints with a reduced set of states. To fully identify tensor redundancy, we design an efficient redundancy detector, which employs a hash-based data consistency check method and a ring-based communication algorithm. Besides, we introduce a novel online redundancy utilization method, which further reduces checkpoint size by exploiting the state redundancy across training iterations.
Experimental results demonstrate that AdaCheck is adaptable to various parallelisms, including irregular parallelisms generated by automatic planners, as well as diverse model architectures, encompassing both dense and sparse architectures. Compared with state-of-the-art checkpointing approaches, AdaCheck can reduce checkpoint size by 6.00–896×, increase the checkpointing frequency by 1.46–111×, and incur almost no overhead on training throughput for LLM training.

Comments

Want to join the conversation?

Loading comments...