Red Hat AI Tops MLPerf Inference v6.0 with vLLM on Qwen3-VL, Whisper, and GPT-OSS-120B

Red Hat AI Tops MLPerf Inference v6.0 with vLLM on Qwen3-VL, Whisper, and GPT-OSS-120B

Red Hat – DevOps
Red Hat – DevOpsApr 1, 2026

Why It Matters

The performance edge validates Red Hat’s AI stack as a hardware‑agnostic, enterprise‑grade solution, giving customers confidence to deploy large language and multimodal models at scale without vendor lock‑in. It also positions Red Hat to influence the next generation of MLPerf benchmarks focused on agentic, multi‑turn AI workloads.

Key Takeaways

  • vLLM leads MLPerf inference across NVIDIA and AMD GPUs.
  • Red Hat's stack outperforms competitors on GPT‑OSS‑120B and Qwen3‑VL.
  • OpenShift AI enables Kubernetes‑based distributed inference at scale.
  • Hardware‑agnostic stack simplifies AI deployment across GPU vendors.
  • MLPerf results position Red Hat for upcoming agentic workload benchmarks.

Pulse Analysis

MLPerf Inference has become the de‑facto yardstick for data‑center AI performance, rewarding not just raw hardware power but the efficiency of the software stack. Red Hat’s latest submission showcases how an open‑source inference engine, vLLM, can extract maximum throughput from both cutting‑edge NVIDIA H200/B200 GPUs and AMD MI350X accelerators. By integrating tightly with Red Hat Enterprise Linux and OpenShift AI, the company delivers a unified platform that rivals proprietary solutions while maintaining transparency and community support.

The technical edge stems from a blend of advanced kernel optimizations—FlashInfer MoE kernels, FP8 multimodal attention, and Triton‑based vision encoder tweaks—and sophisticated scheduling algorithms like shortest‑job‑first and KV‑cache utilization scoring. llm‑d orchestrates eight‑GPU replicas on Kubernetes, enabling dynamic load balancing and latency‑aware routing that meets stringent P99 targets. This hardware‑agnostic approach means enterprises can migrate between GPU generations or even vendors without re‑engineering their inference pipelines, preserving investment and accelerating time‑to‑value.

Looking ahead, Red Hat’s strong showing positions it to shape upcoming MLPerf focus areas such as multi‑turn, agentic workloads that mimic real‑world conversational AI. The company’s roadmap includes enhancements to llm‑d for prefix‑aware scoring and tighter integration with OpenShift AI, promising even lower latency and higher scalability. For businesses eyeing large‑scale AI deployments, Red Hat’s results signal a mature, enterprise‑ready stack capable of delivering competitive performance across diverse hardware ecosystems, reducing total cost of ownership while supporting future AI workloads.

Red Hat AI tops MLPerf Inference v6.0 with vLLM on Qwen3-VL, Whisper, and GPT-OSS-120B

Comments

Want to join the conversation?

Loading comments...