Tenstorrent Unveils Next-Gen Servers for Fast Tokens, No Disaggregation Needed

Tenstorrent Unveils Next-Gen Servers for Fast Tokens, No Disaggregation Needed

EE Times – Designlines/AI & ML
EE Times – Designlines/AI & MLApr 28, 2026

Why It Matters

By eliminating the need for separate pre‑fill and decode hardware, Tenstorrent cuts infrastructure complexity and token‑per‑dollar costs, giving enterprises a more economical path to high‑throughput generative AI.

Key Takeaways

  • Galaxy Blackhole server delivers 23 PFLOPS using 32 Blackhole chips
  • Blitz Mode hits 350 tokens/sec per user, <4 s first token
  • Tenstorrent avoids disaggregation, handling prefill and decode on same server
  • Each server provides 1 TB DRAM and 6.2 GB SRAM, 2.9 PB/s bandwidth
  • Tenstorrent partners with Equinix Distributed AI Hub to reach enterprise market

Pulse Analysis

Tenstorrent’s Galaxy Blackhole line marks a strategic shift away from the industry’s growing reliance on disaggregated AI clusters. By packing 32 Blackhole chips into a single 6U chassis, the company delivers 23 PFLOPS of FP8 compute while maintaining a unified memory hierarchy—1 TB of DRAM and 6.2 GB of on‑chip SRAM per server. This architecture enables both pre‑fill and decode stages to run on the same hardware, simplifying deployment and reducing latency. The result is a token generation pipeline that can sustain 350 tokens per second per user in Blitz mode, with first‑token latency under four seconds, positioning Tenstorrent as a cost‑effective alternative to Nvidia’s multi‑rack solutions.

The performance claims are underpinned by an ambitious software stack that includes a domain‑specific language, TTLang, and a compiler capable of converting CUDA code automatically. Tenstorrent reports an 80‑90 % success rate when running off‑the‑shelf models from Hugging Face, and its open‑source tools accelerate the integration of image, video, and LLM workloads. By supporting batch sizes up to 64 and context windows of 128 k tokens, the platform caters to both conversational AI and high‑throughput content generation, offering enterprises a versatile compute engine without the need for specialized accelerators.

Strategically, Tenstorrent leverages its partnership with Equinix’s Distributed AI Hub to bridge the gap between hardware innovation and enterprise adoption. The hub provides a managed service layer, allowing customers to run workloads on‑prem or in a cloud‑like environment while benefiting from Tenstorrent’s high‑speed Ethernet interconnects (up to 11.2 GB/s per server). This integrated stack addresses the market’s demand for lower token‑per‑dollar economics and reduced operational complexity, especially in latency‑sensitive sectors such as finance and real‑time video generation. As AI models continue to scale, Tenstorrent’s general‑purpose, non‑disaggregated approach could reshape how organizations balance performance, cost, and flexibility.

Tenstorrent Unveils Next-Gen Servers for Fast Tokens, No Disaggregation Needed

Comments

Want to join the conversation?

Loading comments...