HM - FMFM _ Fleetscale Memory Fault Management - Workstream - (2026-01-13)

Open Compute Project
Open Compute ProjectMay 21, 2026

Why It Matters

Standardizing memory fault logging and preemptive mitigation accelerates data‑center reliability, directly lowering downtime and operational costs for cloud providers.

Key Takeaways

  • Decide between DDR5 and LPDDR logging focus for fault management.
  • RAS API spec expected by mid‑January, implementation in 3‑6 months.
  • Page offlining can preemptively mitigate multi‑row/column memory failures.
  • Patrol‑scrub errors need separate weighting to avoid unnecessary DIM swaps.
  • OCP spec will allow configurable policies, preserving designer flexibility.

Summary

The FMFM workstream convened to review progress on Fleetscale Memory Fault Management, focusing on logging requirements, standards adoption, and recent research presented by Roy. Participants debated whether to prioritize DDR5 or LPDDR logging, noting DDR5’s easier integration with existing specifications while LPDDR offers broader relevance for AI inference platforms. Key updates included the upcoming RAS API specification slated for release by mid‑January, with a projected 3‑6‑month window for full implementation across data‑center fleets. The team highlighted preemptive page‑offlining as an effective mitigation for multi‑row and column failures across diverse server configurations, and raised concerns about patrol‑scrub errors inflating DIM‑swap metrics. Roy revisited his paper, demonstrating that page‑offlining reduced error rates on both 4‑by‑8 and 2‑by‑4 DIMM layouts, and referenced Samsung’s proposal to adjust error thresholds before offline actions. Discussions emphasized that OCP specifications should provide a flexible policy framework rather than rigid mandates, allowing designers to weight different error sources appropriately. The consensus points to a near‑term rollout of the RAS API, followed by open‑source code distribution to enable BMC and fleet‑control integration. Successful adoption will improve fault visibility, reduce false positives, and enhance overall memory reliability in increasingly heterogeneous data‑center environments.

Original Description

Public call recording of HM - FMFM _ Fleetscale Memory Fault Management workstream.

Comments

Want to join the conversation?

Loading comments...