Show HN: Real-Time AI (Audio/Video in, Voice Out) on an M3 Pro with Gemma E2B

Show HN: Real-Time AI (Audio/Video in, Voice Out) on an M3 Pro with Gemma E2B

Hacker News
Hacker NewsApr 5, 2026

Why It Matters

Running multimodal AI locally cuts operating expenses and safeguards user data, making advanced conversational tools accessible to learners and future mobile devices.

Key Takeaways

  • Real-time speech+vision AI runs on Apple M3 Pro
  • End‑to‑end latency about 2.5‑3 seconds
  • No server needed; free, open‑source deployment
  • Multilingual model supports native language fallback
  • Uses Gemma 4 E2B (~2.6 GB) and Kokoro TTS

Pulse Analysis

The AI landscape is rapidly shifting from cloud‑centric models to edge‑first solutions, driven by powerful silicon like Apple’s M3 Pro and Nvidia’s latest GPUs. On‑device inference reduces latency, lowers bandwidth costs, and addresses growing privacy regulations, allowing developers to embed sophisticated capabilities directly into consumer hardware. This trend is especially relevant for multimodal applications that blend audio, video, and text, where real‑time responsiveness is critical for natural interaction.

Parlor leverages Google’s Gemma 4 E2B, a compact yet capable multimodal transformer, through the LiteRT‑LM runtime to run on the M3 Pro’s GPU. Coupled with Kokoro’s cross‑platform TTS, the system streams audio responses while still generating text, delivering a conversational loop in under three seconds. The open‑source stack, built on FastAPI and WebSocket communication, provides developers with a reproducible benchmark and a template for deploying similar solutions on Linux GPUs or other Apple Silicon devices. Its modest 3 GB RAM footprint and automatic model download make it practical for hobbyists and educators alike.

For the broader market, Parlor demonstrates that high‑quality, multilingual AI can be democratized without expensive hardware or recurring cloud fees. As mobile processors continue to close the performance gap, we can expect a wave of on‑device language tutors, visual assistants, and accessibility tools that operate offline. The key challenges will involve optimizing model size, ensuring consistent cross‑platform performance, and integrating seamless user experiences, but the groundwork laid by projects like Parlor signals a viable path toward ubiquitous, privacy‑first AI.

Show HN: Real-time AI (audio/video in, voice out) on an M3 Pro with Gemma E2B

Comments

Want to join the conversation?

Loading comments...