Ollama Is Now Powered by MLX on Apple Silicon in Preview

Ollama Is Now Powered by MLX on Apple Silicon in Preview

Hacker News
Hacker NewsMar 31, 2026

Why It Matters

By marrying MLX with NVFP4 quantization, Ollama offers developers near‑server‑grade performance on a laptop, accelerating AI‑driven workflows and lowering hardware barriers for advanced coding assistants.

Key Takeaways

  • Ollama now runs on Apple MLX for faster inference
  • NVFP4 quantization improves accuracy while cutting memory
  • Cache upgrades lower memory use and speed responses
  • Requires Mac with over 32 GB unified memory
  • Optimizes coding agents like Claude Code and OpenClaw

Pulse Analysis

Local large‑language‑model deployment has long been hampered by the gap between desktop hardware and cloud‑grade GPUs. Apple’s MLX framework, built on the unified memory architecture of M‑series silicon, narrows that divide by allowing Ollama to execute heavy models directly on the GPU Neural Accelerators. This shift not only trims latency for the first token but also raises sustained generation speeds, making Mac laptops viable platforms for real‑time AI assistants and developer tools.

The adoption of NVIDIA’s NVFP4 format is a strategic move that aligns Ollama’s on‑device inference with industry‑standard low‑precision techniques. NVFP4 maintains model fidelity while dramatically reducing bandwidth and storage demands, enabling developers to run 35‑billion‑parameter models without sacrificing response quality. As more inference providers embrace NVFP4, Ollama users gain production‑parity results, simplifying the transition from local testing to cloud deployment and fostering a more consistent AI ecosystem.

Beyond raw speed, Ollama’s upgraded caching architecture delivers tangible productivity gains. By reusing cache across conversations and intelligently checkpointing prompts, memory consumption drops and response times improve, especially for multi‑step coding agents that rely on extensive system prompts. For developers building tools like Claude Code or OpenClaw, these efficiencies translate into smoother user experiences and lower hardware costs. Looking ahead, Ollama’s roadmap promises broader model support and streamlined custom model imports, positioning the platform as a cornerstone for on‑device AI innovation.

Ollama is now powered by MLX on Apple Silicon in preview

Comments

Want to join the conversation?

Loading comments...