
Inference Energy Consumption Diagnosed: LLM Tasks Show 25% Energy Differences
Key Takeaways
- •LLM task type drives up to 25× energy variance
- •Video generation can exceed image generation energy by >100×
- •Memory and GPU utilisation are primary latent energy factors
- •Lower precision not always reduces inference energy consumption
- •Scaling GPUs may cut energy via larger memory capacity
Summary
Researchers at the University of Michigan and The ML.ENERGY Initiative conducted a massive measurement campaign across 46 generative AI models, seven tasks, and 1,858 configurations on NVIDIA H100 and B200 GPUs. They discovered order‑of‑magnitude energy differences, with large‑language‑model (LLM) task type causing up to 25× more power use and video generation consuming over 100× the energy of image generation. The team introduced a diagnostic framework that links inference time and energy to latent metrics such as memory pressure and GPU utilisation. Their findings challenge common assumptions about precision and scaling, offering actionable levers for power‑constrained datacenters.
Pulse Analysis
The rapid expansion of generative AI has turned inference energy into a critical operational expense, especially as GPUs dominate 50‑70% of datacenter power draw. By instrumenting 1,858 model‑system configurations on both H100 and B200 platforms, the researchers provided the first large‑scale, task‑level breakdown of energy consumption. Their data reveal that the nature of the task—problem‑solving versus casual conversation—can inflate per‑response energy by 25 times, while multimodal video generation can demand more than a hundredfold the power of a comparable image task. These stark contrasts underscore that not all AI workloads are created equal from a sustainability perspective.
Beyond raw measurements, the study proposes a diagnostic framework that attributes energy and latency to hidden variables such as memory bandwidth, KV‑cache utilisation, and overall GPU occupancy. Counterintuitively, the authors show that reducing precision (e.g., moving from BF16 to FP8) does not guarantee lower energy, and that adding GPUs can sometimes reduce total joules by unlocking greater memory capacity for larger batch sizes. This nuanced view equips engineers with concrete levers—batch sizing, precision tuning, and hardware scaling—to optimise throughput per watt without sacrificing model performance.
For industry stakeholders, the implications are immediate. Datacenter operators can leverage the framework to predict service capacity under strict power caps, prioritize model‑task pairings that align with energy budgets, and inform procurement decisions between H100 and B200 accelerators. Moreover, the methodology sets a benchmark for future research, encouraging deeper exploration of software‑stack optimisations and hardware‑aware model design. As AI workloads continue to proliferate, such evidence‑based strategies will be essential for balancing innovation with environmental and cost constraints.
Inference Energy Consumption Diagnosed: LLM Tasks Show 25% Energy Differences
Comments
Want to join the conversation?