FTI - Scaling AI Clusters at Neoclouds - Workstream - (2026-06-08)

Open Compute Project
Open Compute ProjectJun 10, 2026

Why It Matters

Standardizing out-of-band telemetry and Redfish profiles will let AI agents validate failures and perform autonomous remediation, reducing manual OEM-specific scripting, improving GPU uptime, and enabling NeoClouds to scale reliably. That increases operational efficiency and lowers time-to-repair for large AI clusters.

Summary

FTI’s workstream meeting reviewed outcomes from the OCP AMIA panel and presented an initial mapping of OCP hardware management profiles against NeoCloud cluster-evaluation criteria using the Redfish 1.24 spec. The team ran automated reviews to identify gaps in the existing OCP management repo and highlighted practical pain points: inconsistent OEM Redfish implementations, missing out-of-band GPU telemetry, and lack of standardized data for AI-driven agents. Participants framed these gaps as actionable targets—defining mandated GPU health telemetry (including XID/error classification), harmonizing vendor-specific provisioning playbooks, and feeding updates into upcoming OP Global announcements. The next steps are detailed hardware-level reviews, collaboration with OEMs, and extending Redfish usage to enable autonomous detection and remediation workflows.

Original Description

Public call recording of FTI - Scaling AI Clusters at Neoclouds workstream.

Comments

Want to join the conversation?

Loading comments...