Troubleshooting Guide: Running Qwen3.5-35B with Reasoning & Tool Calling Using vLLM on Nvidia DGX Spark

Troubleshooting Guide: Running Qwen3.5-35B with Reasoning & Tool Calling Using vLLM on Nvidia DGX Spark

Agentic AI
Agentic AI Mar 14, 2026

Key Takeaways

  • Standard vLLM images lack support for Qwen3.5 MOE
  • Upgrade Transformers and vLLM to recognize qwen3_5_moe
  • 4‑bit AWQ quantization reduces memory for 35B model
  • 131K context enables extensive reasoning and tool calling
  • OpenCode integration adds programmable tool execution

Summary

The post details how to run the Qwen3.5-35B MOE model—featuring 35 B parameters, 4‑bit AWQ quantization, and a 131 K context window—on Nvidia DGX Spark using vLLM. Standard vLLM Docker images (e.g., nvcr.io/nvidia/vllm:26.01-py3) ship with Transformers versions that do not recognize the `qwen3_5_moe` architecture, causing loading failures. By upgrading to newer Transformers and vLLM releases, adjusting memory settings, and configuring host networking, the authors achieve a functional deployment with reasoning and tool‑calling capabilities. The guide also shows integration with OpenCode for programmable tool execution.

Pulse Analysis

Deploying large‑scale mixture‑of‑experts models like Qwen3.5‑35B on on‑premise GPU clusters has historically been hampered by software compatibility gaps. The model’s architecture, identified as `qwen3_5_moe`, was not recognized by the Transformers library bundled with early vLLM Docker images, leading to validation errors during initialization. By pulling newer vLLM releases and aligning the underlying Transformers version, engineers can bridge this gap, allowing the model to load correctly and leverage its 35 billion‑parameter capacity while keeping active parameters to 10.7 B through expert routing.

Memory efficiency is another critical factor. The guide demonstrates that 4‑bit AWQ quantization dramatically cuts VRAM consumption, making a 35 B model feasible on a single DGX Spark node. Coupled with careful batch sizing and the `--net=host` flag to avoid port‑binding conflicts, the setup achieves stable inference throughput. These optimizations are essential for enterprises that need to run long‑context (131 K tokens) workloads without incurring prohibitive hardware costs.

Beyond raw performance, the integration with OpenCode transforms the model from a static text generator into an interactive tool‑calling engine. This capability enables automated data retrieval, code execution, and real‑time decision support within a single LLM pipeline. For sectors such as finance, healthcare, and logistics, the combination of extensive context windows, reasoning depth, and programmable tool access opens new avenues for complex workflow automation and insight generation, positioning Qwen3.5‑35B as a strategic asset in the AI stack.

Troubleshooting Guide: Running Qwen3.5-35B with Reasoning & Tool Calling using vLLM on Nvidia DGX Spark

Comments

Want to join the conversation?