
By offloading deterministic computation to a secure executor, DeepMath cuts inference latency, reduces arithmetic errors, and offers auditable, concise reasoning—key advantages for enterprise AI deployments that demand reliability and cost efficiency.
Mathematical problem solving has long been a stumbling block for large language models, which excel at language but falter on precise arithmetic. Traditional chain‑of‑thought approaches generate lengthy textual traces that are both slow to process and prone to calculation mistakes. DeepMath tackles this gap by integrating a lightweight Python executor directly into the inference loop, allowing the model to delegate deterministic steps to code rather than prose. This hybrid strategy aligns with a broader industry shift toward tool‑augmented AI, where external utilities enhance model reliability without inflating parameter counts.
The technical core of DeepMath combines the Qwen‑3‑4B Thinking foundation with the smolagents framework, which orchestrates agent calls and sandboxed execution. GRPO fine‑tuning further shapes the model’s behavior by rewarding correct answers, the generation of code snippets, and shorter outputs, creating a strong incentive for concise, computation‑driven reasoning. Training leverages the OpenMathReasoning TIR subset, exposing the model to problem statements without solutions, so it learns to request calculations rather than fabricate them. Benchmarks across four challenging datasets demonstrate that the agentic configuration not only slashes token output by two‑thirds but also lifts accuracy, especially when GRPO and the agent are used together.
For businesses deploying AI at scale, DeepMath offers a cost‑effective alternative to massive, compute‑hungry models. Shorter traces translate to faster inference, lower bandwidth, and easier post‑processing, while sandboxed code execution mitigates security risks associated with unrestricted tool use. The open‑source release invites integration into existing pipelines, paving the way for more trustworthy, interpretable AI solutions in finance, engineering, and education where precise numerical reasoning is non‑negotiable.
Published December 4, 2025 · By Intel AI Software Group
Daniel Fleischer (danf) – Intel
Moshe Berchansky (mber) – Intel
Moshe Wasserblat (moshew) – Intel

DeepMath is an aligned math‑reasoning agent built on Qwen‑3‑4B Thinking and fine‑tuned with GRPO (Group Relative Policy Optimization).
Instead of verbose text, the model emits tiny Python snippets for intermediate steps, runs them in a secure sandbox, and folds the results back into its reasoning, reducing errors and output length. The agent is implemented using the smolagents library.
We evaluate DeepMath on four math datasets: MATH‑500, AIME, HMMT, and HLE, and show that:
🤖 The math agent alone reduces output lengths by up to 66 %, while often improving accuracy.
⚡ GRPO training improves the agent performance even further, in almost all benchmarks.
Code and evaluation scripts: https://github.com/IntelLabs/DeepMath
Model: https://huggingface.co/Intel/deepmath-v1
Large language models (LLMs) have advanced reasoning capabilities, but mathematical problem‑solving remains challenging; chain‑of‑thought traces can be lengthy and prone to arithmetic mistakes. Recent works demonstrate that small models can reach strong performance, and other studies investigate tool use to improve reliability. What those papers generally do not emphasize is reducing trace verbosity or explicitly training models to prefer short, computation‑oriented traces executed in a constrained, auditable environment.
We focused on two goals:
Offload deterministic computation to a safe executor.
Train models to prefer concise, computation‑oriented traces over verbose text.
DeepMath tackles this by combining a small Python executor with a fine‑tuned LLM, enabling concise, computation‑driven reasoning. The model learns to generate short Python snippets, which are executed in a sandbox and reintegrated into the context. GRPO fine‑tuning encourages this behavior by rewarding correctness and encouraging shorter outputs.
Base model: Qwen‑3‑4B Thinking.
Executor constraints: sandboxed environment, allow‑list of imported modules, per‑snippet timeout.
Inference: built on smolagents; a math agent was created. vLLM is used as the inference engine.
Training: based on the GRPO trainer in TRL; we modified TRL’s vLLM client and server to generate GRPO completions using our DeepMath agent.

Figure 1: The vLLM client and server were modified to use the DeepMath agent in generating the candidates, while using the vLLM backend.
During inference, the model can output normal tokens or special agent calls containing Python snippets.
Snippets run in a sandboxed environment with strict safety constraints (no file I/O, no network, timeouts).
Concision: Replace multi‑line textual calculations with short, focused snippets.
Determinism & Safety: Enforce strict execution limits.
Interpretability: Snippets are readable and auditable.

Figure 2: Output example where Python code is generated, evaluated and the answer is inserted into the trace and used for context.
We fine‑tune the model using GRPO, a reward‑based optimization that balances:
Accuracy Reward: +1 for correct answers.
Using code snippets: +1 for generating code snippets (weighted 10 : 1 vs. the accuracy reward).
Length reduction: shorter outputs are encouraged by limiting the GRPO completion candidates to 5 k tokens.
Temperature Scheduling: linear schedule (T = 1.2 → 0.7) to balance exploration early on and stability later.
In‑context Learning: 4 solved examples with agent calls and executor outputs are included so the model learns the syntax and call/response pattern.
Dataset: the Tool‑Integrated Reasoning (TIR) subset of the OpenMathReasoning dataset. GRPO uses only the problem (not the solution), ensuring the problems benefit from the external tool.
We benchmarked DeepMath against baselines on four datasets. Metrics include:
majority@16: robustness across samples (as used in previous math‑reasoning works).
Mean output length: brevity.

We compare a baseline configuration (Qwen‑3‑4B‑Thinking‑2507, no agenting) with our DeepMath model. As ablations we evaluate:
+Agent: the agentic framework running with the untrained Qwen‑3 model.
+GRPO: GRPO training applied to non‑agentic inference.
The two ablations are independent, not additive.
Findings
The agentic inference reduces output lengths, with mixed accuracy results.
The DeepMath model (both GRPO‑trained and run in agentic mode) shows the highest accuracy with shortened traces.
Both GRPO training and agentic inference are needed for the best results.
Key Insight: DeepMath reduces output length by up to 66 % while improving accuracy on challenging datasets.
Accuracy: Offloading computation reduces arithmetic errors.
Efficiency: Shorter outputs mean faster inference and easier interpretability.
Safety: Sandbox execution mitigates risks of running arbitrary code.
DeepMath demonstrates a practical and lightweight way to combine a small executor with an LLM and to train the model to prefer short, computation‑driven traces. Offloading deterministic computation reduces arithmetic and numerical errors and shortens traces, and GRPO fine‑tuning further encourages concise, correct answers. The result is a more accurate and more interpretable math‑solving agent without requiring a massive model or heavy infrastructure.
Comments
Want to join the conversation?
Loading comments...