Groq, Etched, SambaNova, Taalas // The AI Hardware Show S2E4
Why It Matters
Specialized inference chips promise dramatically lower latency and cost, forcing data‑center operators and investors to rethink the balance between flexibility and performance in AI deployments.
Key Takeaways
- •Groq’s LPU architecture offers deterministic inference via on‑chip SRAM.
- •Etched’s SOHU ASIC trades flexibility for transformer‑only speed advantage.
- •New chips’ Raptor targets low‑latency enterprise inference with moderate throughput.
- •Samanova’s SN40L combines massive SRAM and DDR for trillion‑parameter models.
- •Talis and Posetron pursue extreme specialization using model‑compiled silicon and FPGA.
Summary
The AI Hardware Show episode dives deep into the rapidly evolving LLM inference market, profiling a suite of startups that are redefining data‑center acceleration. Hosts Sally Ward Foxton and Ian Cutras outline why inference at scale is the next cash‑flow engine, noting that dozens of unicorns are racing to lock down deterministic performance, power efficiency, and cost advantages. Key insights include Groq’s Language Processing Unit, a 14 nm chip that eliminates caches, DRAM and out‑of‑order execution to guarantee compile‑time latency, and its upcoming 4 nm, stacked‑DRAM successor funded by a $700 million Series D. Etched’s SOHU ASIC, built on TSMC’s 4 nm node, forgoes all flexibility to run transformers exclusively, claiming 500 k Llama 70B tokens per second—an order of magnitude ahead of Nvidia’s Blackwell. Meanwhile, New chips’ Raptor accelerator balances modest 8‑10 tps per chip latency with on‑device vector search, targeting enterprise workloads where power and latency trump raw throughput. Samanova’s SN40L leverages a coarse‑grained reconfigurable array, 520 MB SRAM and 64 GB HPM to serve multi‑trillion‑parameter models with micro‑second model‑switching, sold as a fully integrated rack. Talis bets on a “hard‑core model‑as‑silicon” approach, recompiling each model onto a custom chip for thousand‑fold efficiency gains, while Posetron’s FPGA‑based Atlas card promises 70 % faster token rates than Nvidia Hopper by exploiting HBM‑enabled Altera Agile FPGAs. Notable quotes underscore the stakes: Groq’s acquisition by Nvidia was announced on Christmas Eve 2025, Etched’s CEO admits, “If transformers lose, we lose,” and Talis’s founder emphasizes eliminating every runtime abstraction. Posetron’s founders, former Groq engineers, tout 93 % memory‑bandwidth utilization on DDR‑only ASICs as a path to competitive performance without HBM. These anecdotes illustrate the spectrum from ultra‑flexible CPUs to single‑purpose ASICs, each carving a niche in the inference hierarchy. The implications are clear: investors must choose between flexibility and peak efficiency, while hyperscalers weigh deterministic latency against the risk of architectural lock‑in. As power‑hungry GPUs approach diminishing returns, specialized silicon—whether deterministic LPUs, transformer‑only ASICs, or model‑compiled chips—could reshape AI infrastructure economics, driving down cost per token and enabling new edge‑centric generative applications.
Comments
Want to join the conversation?
Loading comments...