HM - SGM _ System GPU Management - Workstream - (2026-01-16)

Open Compute Project
Open Compute ProjectJun 1, 2026

Why It Matters

A unified GPU message registry enables reliable monitoring and automation for emerging accelerator workloads, accelerating product releases and reducing operational friction.

Key Takeaways

  • Cancel light tool meeting; prioritize GPU management presentation
  • Retag GitHub issues to 1.1/1.7, ignore legacy 1.0 bugs
  • Propose extending network device registry for GPU messages
  • Debate creating separate GPU device vs. GPU fabric registries
  • Define terminology; consider ‘accelerator fabric’ for broader scope

Summary

The meeting focused on the System GPU Management workstream, reviewing agenda changes, attendance constraints, and the need to push forward a GPU‑related presentation despite several participants being unavailable. The team revisited open GitHub issues, agreeing to retag those still relevant to the newer 1.1 and 1.7 releases while archiving legacy 1.0 items, streamlining the backlog for upcoming releases.

Key technical discussion centered on how to model GPU‑related events within the existing DNTF message registry framework. Participants examined whether to reuse the network‑device registry, extend it, or create a dedicated GPU device or GPU‑fabric registry. They highlighted that many proposed GPU messages map cleanly onto network‑device definitions, but a subset lacks appropriate mapping, prompting a short‑term proposal to add specific port‑related messages and rename ambiguous terms like “degraded.”

The dialogue also explored subscription mechanics, emphasizing that consumers need precise filters to receive only GPU‑specific events without enumerating numerous origin conditions. Examples from sensor registries illustrated the challenges of dynamic URIs and the risk of stale subscriptions. Consensus emerged around a hybrid approach: employ the network‑device registry for generic link events, introduce a GPU‑device registry for point‑to‑point connections, and consider a separate GPU‑fabric registry when topology resembles switch‑level fabrics.

Implications include a clearer, more maintainable message taxonomy, reduced duplication across registries, and faster integration of GPU monitoring capabilities into existing tooling. By aligning terminology and registry design now, the group aims to support future accelerator workloads—such as AI inference—while minimizing long‑term engineering overhead.

Original Description

Public call recording of HM - SGM _ System GPU Management workstream.

Comments

Want to join the conversation?

Loading comments...