768GB of Cheap Intel Optane DIMM Memory Sticks Used to Run 1-Trillion-Parameter LLM on a System with a Single GPU — Local Kimi K2.5 Install Achieved Roughly 4 Tokens per Second

768GB of Cheap Intel Optane DIMM Memory Sticks Used to Run 1-Trillion-Parameter LLM on a System with a Single GPU — Local Kimi K2.5 Install Achieved Roughly 4 Tokens per Second

Tom's Hardware
Tom's HardwareMay 23, 2026

Why It Matters

The setup proves that high‑parameter LLMs can be run locally on modest hardware, opening doors for smaller enterprises and researchers without cloud budgets. It also highlights a market gap that emerging standards like CXL aim to fill.

Key Takeaways

  • 768 GB Intel Optane used as RAM for 1‑trillion‑parameter LLM
  • System achieved ~4 tokens per second with single RTX 3060 GPU
  • Optane memory mode paired with DDR4 cache reduced latency vs SSD
  • Hybrid CPU/GPU inference via llama.cpp enabled model execution on workstation
  • Highlights need for affordable byte‑addressable memory as DRAM gap narrows

Pulse Analysis

The memory requirements of trillion‑parameter models have traditionally forced developers into expensive, multi‑GPU clusters or cloud services. Intel’s Optane Persistent Memory, positioned between DRAM and SSD in latency, offers a niche solution: large capacity at a fraction of DRAM cost. By configuring six 128 GB DCPMM modules in memory mode and using DDR4 as a cache, the Redditor created a 768 GB addressable pool that kept the model’s parameters resident, avoiding costly NVMe swaps while staying within a single‑GPU budget.

Technical execution relied on llama.cpp’s flexible inference engine, which can offload routing layers to the GPU via the override‑tensor flag while the CPU handles the bulk of the computation. The Xeon Gold 6246 and RTX 3060 combination yielded about four tokens per second—modest by commercial standards but remarkable for a workstation built on a shoestring budget. The hybrid approach demonstrates that careful software tuning can extract meaningful performance from legacy hardware, especially when the memory subsystem is optimized for the model’s access patterns.

The broader implication is a clear signal to the industry: as LLMs scale, the memory hierarchy becomes a critical bottleneck. While Optane’s discontinuation leaves a gap, emerging standards like Compute Express Link (CXL) promise scalable, cost‑effective, byte‑addressable memory pools that could democratize access to frontier AI models. Vendors and cloud providers that invest early in such technologies may capture a competitive edge, while enterprises seeking on‑premise AI will benefit from reduced reliance on costly cloud compute.

768GB of cheap Intel Optane DIMM memory sticks used to run 1-trillion-parameter LLM on a system with a single GPU — local Kimi K2.5 install achieved roughly 4 tokens per second

Comments

Want to join the conversation?

Loading comments...