AMD Makes a Big Splash with the MI355X in MLPerf Inference 6.0: Over One Million Tokens per Second in Multi-Node Inference

AMD Makes a Big Splash with the MI355X in MLPerf Inference 6.0: Over One Million Tokens per Second in Multi-Node Inference

Igor’sLAB
Igor’sLABApr 15, 2026

Key Takeaways

  • MI355X hits 1.04M tokens/s on Llama 2 70B across 11 nodes
  • Multi‑node scaling efficiency reaches 92‑93% for GPT‑OSS‑120B
  • MLPerf Inference 6.0 adds new GenAI benchmarks, reflecting real workloads
  • MI355X built on 3‑nm CDNA‑4, delivering up to 10 PFLOPS FP4/FP6
  • AMD reports 3.1‑fold performance boost over MI325X on Llama 2 70B

Pulse Analysis

MLPerf Inference 6.0 marks the most extensive overhaul of the benchmark suite to date, adding five new or updated data‑center tests that mirror today’s generative‑AI demands. By introducing a GPT‑OSS‑120B workload, a text‑to‑video scenario, and expanded vision‑language models, the consortium shifts the focus from isolated, single‑node scores to real‑world, multi‑node scaling. This evolution gives vendors a clearer stage to demonstrate how their hardware handles the massive parallelism required for production‑grade inference, making the numbers far more actionable for cloud providers and enterprises.

AMD’s Instinct MI355X is built on a 3‑nm CDNA‑4 architecture, packing 185 billion transistors and up to 288 GB of HBM3E memory. The accelerator delivers up to 10 petaflops of FP4/FP6 performance and can host models with up to 520 billion parameters. In the MLPerf run, the MI355X achieved 1,042,110 tokens per second on Llama 2 70B and 1,031,070 tokens per second on GPT‑OSS‑120B, with scaling efficiencies hovering around 92‑93 % across 11‑12 nodes. These figures represent a 3.1‑fold jump over the previous MI325X, highlighting the impact of the new process node and architectural refinements.

For the data‑center market, the milestone signals that AMD is closing the performance gap with dominant rivals in large‑scale generative‑AI inference. Enterprises that prioritize multi‑node efficiency can now consider the MI355X as a viable, potentially lower‑cost alternative for serving massive language models. As benchmark suites continue to align with production workloads, the industry will likely see a broader adoption of heterogeneous GPU fleets, driving competition, price pressure, and faster innovation cycles in AI infrastructure.

AMD makes a big splash with the MI355X in MLPerf Inference 6.0: Over one million tokens per second in multi-node inference

Comments

Want to join the conversation?