BIG AI Cluster Little Power the 8x NVIDIA GB10 Cluster
Why It Matters
The cluster proves enterprises can run multi‑hundred‑gigabyte AI models on‑premise without waiting for vendor‑certified solutions, reducing latency and total cost of ownership. It also showcases a replicable blueprint for scaling AI workloads with affordable networking and shared storage.
Key Takeaways
- •8‑node NVIDIA GB10 cluster built with 400 GbE MikroTik switch
- •RDMA over ConnectX‑7 enables low‑latency NCCL scaling
- •Shared QNAP NAS cuts storage cost by $10 k for ten nodes
- •Kimi K2.5/K2.6 runs locally, demonstrating massive model inference
Pulse Analysis
The NVIDIA GB10, codenamed Grace Blackwell, combines 20 Arm cores with a Blackwell‑generation GPU and 128 GB of LPDDR5X memory, positioning it as a powerful yet power‑efficient AI accelerator. By aggregating eight of these units, the cluster achieves a combined 160 Arm cores and 1 TB of memory, enough to host state‑of‑the‑art large language models that would otherwise require multiple data‑center‑scale servers. The key to unlocking this performance lies in the networking layer: leveraging the onboard ConnectX‑7 200 GbE NICs with RDMA (RoCE) over a 400 GbE MikroTik CRS804 DDQ switch reduces inter‑node latency dramatically, enabling NCCL‑driven tensor parallelism across all eight nodes.
Choosing the right switch architecture was critical. While high‑end Dell switches could handle the bandwidth, their 900 W power draw made them overkill for a modest eight‑node deployment. The MikroTik solution offered a cost‑effective, low‑power alternative that still provided dual‑port QSFP56‑DD connectivity, allowing two GB10 nodes per port. Complementing this data plane, a Cisco Catalyst C1300‑12XT‑2X supplies a dedicated 10 GbE management network, isolating control traffic and simplifying VLAN configuration. This dual‑network design ensures that high‑throughput AI traffic does not contend with administrative operations, preserving both performance and reliability.
Beyond compute and networking, shared storage proved to be a decisive factor in total cost reduction. By centralizing model files on a QNAP TS‑h1290FX NAS with ZFS snapshots and an auxiliary GPU for embedding workloads, the team saved roughly $10 k compared to provisioning 4 TB local NVMe per node. This approach not only cuts hardware spend but also streamlines model versioning and deployment across the cluster. As AI workloads continue to balloon, the demonstrated architecture offers a scalable, cost‑effective template for enterprises seeking on‑premise inference power without the overhead of traditional data‑center solutions.
BIG AI Cluster Little Power the 8x NVIDIA GB10 Cluster
Comments
Want to join the conversation?
Loading comments...