
I Ran Qwen3.5-35B-A3B Locally with Cline Code Agent For Free, Forever

Key Takeaways
- •Qwen3.5-35B-A3B runs on Apple Silicon with 64 GB
- •MoE architecture yields 35 tps vs 10 tps dense
- •4‑bit quantization compresses model to ~20 GB
- •omlx provides OpenAI‑compatible server for local inference
- •Cline integrates model for AI‑assisted coding without API fees
Summary
A developer ran the 35‑billion‑parameter Qwen3.5‑35B‑A3B‑4bit model on a Mac Mini M4 with 64 GB RAM, using the omlx inference server and the Cline VS Code AI agent. The MoE architecture and 4‑bit quantization shrink the model to ~20 GB, delivering an average 35 tokens per second—about 3.5× faster than a dense 32‑B model. Integration required fixing four streaming‑API bugs, but the result is a fully local, cost‑free coding assistant. The setup keeps data on‑device and eliminates API billing.
Pulse Analysis
The surge in on‑device large language models reflects growing concerns over data privacy and the unsustainable expense of cloud APIs. Apple Silicon, with its unified memory architecture and high‑bandwidth cores, offers a compelling platform for running sophisticated models without external compute. By leveraging quantization techniques, developers can fit multi‑billion‑parameter networks into consumer‑grade hardware, opening the door to offline AI that respects corporate confidentiality and reduces operational overhead.
Qwen3.5‑35B‑A3B’s mixture‑of‑experts design activates only a fraction of its 35 billion parameters per token, effectively behaving like a 3‑billion‑parameter model while retaining the reasoning depth of the larger network. The 4‑bit quantization further trims the footprint to roughly 20 GB, enabling the model to sit comfortably in 64 GB of unified memory on a Mac Mini M4. Benchmarks show an average throughput of 35 tokens per second, a stark contrast to the 10 tokens per second achieved by a dense 32‑B counterpart, translating into smoother, near‑real‑time coding assistance.
Coupling the model with omlx—a lightweight, OpenAI‑compatible inference server optimized for Apple Silicon—allows any tool that speaks the ChatGPT API to point to a local endpoint. The Cline VS Code extension capitalizes on this by providing multi‑step planning, file access, and terminal execution directly within the editor. After resolving streaming‑API quirks related to the model’s internal <think> blocks, the author demonstrates a seamless, free‑forever AI coding assistant that delivers enterprise‑grade performance without sacrificing security or budget.
Comments
Want to join the conversation?