FTI - Scaling AI Clusters at Neoclouds - Workstream - (2026-06-08)
Why It Matters
Standardizing out-of-band telemetry and Redfish profiles will let AI agents validate failures and perform autonomous remediation, reducing manual OEM-specific scripting, improving GPU uptime, and enabling NeoClouds to scale reliably. That increases operational efficiency and lowers time-to-repair for large AI clusters.
Summary
FTI’s workstream meeting reviewed outcomes from the OCP AMIA panel and presented an initial mapping of OCP hardware management profiles against NeoCloud cluster-evaluation criteria using the Redfish 1.24 spec. The team ran automated reviews to identify gaps in the existing OCP management repo and highlighted practical pain points: inconsistent OEM Redfish implementations, missing out-of-band GPU telemetry, and lack of standardized data for AI-driven agents. Participants framed these gaps as actionable targets—defining mandated GPU health telemetry (including XID/error classification), harmonizing vendor-specific provisioning playbooks, and feeding updates into upcoming OP Global announcements. The next steps are detailed hardware-level reviews, collaboration with OEMs, and extending Redfish usage to enable autonomous detection and remediation workflows.
Comments
Want to join the conversation?
Loading comments...