The model demonstrates that frontier‑level multimodal intelligence can be achieved with modest compute, offering enterprises a cost‑effective alternative to massive LLMs.
The AI landscape is increasingly dominated by gigantic multimodal models, yet many organizations cannot justify the associated compute and licensing costs. Apriel-1.6-15B-Thinker challenges this trend by delivering state‑of‑the‑art reasoning with just 15 billion parameters. Leveraging NVIDIA’s GB200 Grace Blackwell chips, the model was trained in roughly 10 000 GPU‑hours—a fraction of the resources expended by larger competitors—showcasing how strategic data curation and efficient architecture can compress the cost‑efficient frontier.
A two‑stage training pipeline underpins the model’s capabilities. First, a depth‑upscaling corpus blends high‑quality web, scientific, and code data with NVIDIA Nemotron sources, followed by two continual pre‑training phases that extend sequence lengths to 49 K tokens. Supervised fine‑tuning on 2.4 M step‑by‑step reasoning samples and a lightweight multimodal phase refine both text and vision skills. Reinforcement learning with Group Sequence Policy Optimization further incentivizes concise, correct answers, slashing token consumption by more than 30% while preserving accuracy across benchmarks.
For enterprises, Apriel-1.6 offers a compelling blend of performance and efficiency. Its AI Index score of 57 places it alongside models with over 200 B parameters, yet its modest footprint translates to lower inference costs and easier deployment on on‑premise or edge hardware. As businesses seek to embed sophisticated reasoning—ranging from complex math to visual document analysis—without ballooning infrastructure budgets, Apriel-1.6 positions itself as a practical, high‑value alternative in the rapidly evolving generative AI market.
Apriel‑1.6‑15b‑Thinker
We release Apriel‑1.6‑15b‑Thinker, a 15‑billion‑parameter multimodal reasoning model in ServiceNow’s Apriel SLM series which achieves state‑of‑the‑art performance against models ten times its size. Apriel‑1.6 builds on top of Apriel‑1.5‑15b‑Thinker with an extensive focus on improving text and vision reasoning, while improving token efficiency. This version was trained on NVIDIA DGX™ Cloud with GB200 Grace™ Blackwell Superchips.
Apriel‑1.6 scores 57 on the Artificial Analysis Index, outperforming models like Gemini 2.5 Flash, Claude Haiku 4.5 and GPT OSS 20b. It obtains a score on par with Qwen3 235B A22B, while being significantly more efficient. This new release improves or maintains task performance compared with the previous Apriel‑1.5‑15B‑Thinker, while reducing reasoning token usage by more than 30 %.
We follow the same overall training process used for Apriel‑1.5‑15B‑Thinker, which includes a depth‑upscaling phase followed by two Continual Pre‑training (CPT) stages (detailed in [1]).
Depth‑upscaling corpus – 35 % diverse sources (high‑quality web content, scientific/technical literature, mathematical problem sets, programming code), 15 % high‑quality datasets from NVIDIA Nemotron™, and 50 % pre‑training‑style replay data.
Stage‑1 CPT – Expanded mixture focusing on textual reasoning and image understanding, with additional synthetic text‑only samples (general reasoning, knowledge, coding, creative writing) and image‑text pairs (document/chart understanding, OCR, visual‑reasoning, SVG/web‑code synthesis).
Stage‑2 CPT – After Stage‑1, a text‑only CPT run at an extended 49 K sequence length, followed by a second stage to further refine visual‑reasoning capabilities.
Training this mid‑training pipeline required approximately 10 000 GPU‑hours on NVIDIA’s GB200s, a small compute footprint enabled by their high throughput and our careful data strategy.
Using the mid‑trained model, we perform post‑training consisting of large‑scale Supervised Fine‑tuning (SFT) and Reinforcement Learning (RL) targeting both vision and text abilities.
Trained on a curated dataset of 2.4 M high‑signal text samples, each containing explicit step‑by‑step reasoning traces.
Dataset composition: synthetic execution‑verifiable samples for math, coding, scientific problem‑solving, plus a broad mix of instruction‑following, conversational, API/function‑calling, creative writing, safety, and other knowledge‑intensive samples.
Quality pipeline: multi‑stage de‑duplication, content filtering, heuristic pruning, LLM‑as‑Judge validation, execution‑based verification, and strict de‑contamination against evaluation benchmarks.
Two phases (both at 32 K context length):
Text‑only training on the 2.4 M samples for 4 epochs; simplified chat template and four new special tokens (<tool_calls>, </tool_calls>, [BEGIN FINAL RESPONSE], <|end|>).
Lightweight multimodal run for 3 epochs, using rejection‑sampled data from Apriel‑1.5‑15b‑Thinker to preserve image performance after token changes.
This SFT foundation yields strong multimodal understanding, improved text reasoning, and enhanced agentic behavior.
Multi‑stage RL setup to improve reasoning capability and efficiency.
Training domains: visual reasoning, VQA, OCR, simple Q&A (encouraging short answers), math, STEM multiple‑choice, and function calling (structured tool use).
Rewards for correctness; penalties for verbosity, incorrect formats, etc., encouraging fewer reasoning tokens and more direct answers.
Optimized with Group Sequence Policy Optimization (GSPO) [2] using the VeRL framework and rule‑based verification.
Apriel‑1.6 is evaluated on a wide range of benchmarks (tool use, math, coding, instruction following, long context). Scores from the Artificial Analysis Index v3.0 are reported alongside internal evaluations.
| Category | Benchmark | Apriel‑1.6‑15B‑Thinker | Apriel‑1.5‑15B‑Thinker | GPT OSS 120B | DeepSeek R1 0528 | Gemini 2.5 Flash (Sep) | GPT 5 mini (high) | Claude 4.5 Sonnet (thinking) | o3‑mini (high) |
|---|---|---|---|---|---|---|---|---|---|
| Average Score | – | 53.22 | 46.56 | 52.56 | 51.92 | 50.71 | 62.58 | 60.37 | 48.85 |
| Function Calling | BFCL v3 only | 63.50 | 51.88 | 50.62 | 39.75 | 39.75 | 17.62 | – | 50 |
| Tau2 Telecom | – | 69 | 57.8 | 66 | 37 | 32 | 68 | 50.8 | 31 |
| Tau2 Retail | – | 66.67 | 46.78 | 61.4 | 59.94 | 61.69 | 73.39 | 69.8 | 75.73 |
| Tau2 Airline | – | 58 | 52 | 45.3 | 47.33 | 56.66 | 59.33 | 58 | 61.33 |
| ComplexFuncBench | – | 33.2 | 19 | 24.6 | 24.2 | 26.3 | 37.5 | 24.6 | 18.9 |
| Instruction Following – Agent IF | – | 57.2 | 55 | 54.20 | 52.20 | 49.70 | 57.60 | 54.50 | 54.90 |
| Multi IF | – | 83.34 | 76.91 | 82.95 | 73.76 | 82.49 | 85.37 | 84.32 | 87.28 |
| Multi‑Challenge | – | 46.15 | 41.39 | 46.90 | 44.50 | 49.08 | 57.90 | 42.49 | 38.46 |
| IF Bench | – | 69 | 62 | 69 | 40 | 50 | 75 | 57 | 70.07 |
| Math – AIME 25 | – | 88 | 88 | 93 | 76 | 73 | 91 | 88 | 86.67 |
| Coding – Struct Eval | – | 79 | 48.50 | 71 | 73 | 70 | 69.92 | 76 | 73 |
| LCB | – | 81 | 73 | 88 | 77 | 70 | 84 | 71 | 73 |
| SciCode | – | 37 | 35 | 39 | 40 | 41 | 39 | 45 | 40 |
| Agentic – DeepresearchBench | – | 36.47 | 32.73 | 36.30 | 34.19 | 38.15 | – | – | 33.40 |
| GAIA | – | 40 | 30.91 | 21.21 | 32.12 | 47.88 | 65.45 | 69.09 | 23.03 |
| Work‑Arena L1 | – | 50.2 | 51.5 | 50.9 | 63.9 | 51.8 | 65.5 | 62.7 | 52.4 |
| OS World Small | – | 16.70 | 13.90 | 16.70 | 25 | 19.40 | 22.20 | 30.60 | 19.40 |
| SWE Bench Verified | – | 23 | 16 | 31 | 29.60 | 34.20 | 61 | 64.2 | 22.60 |
| Terminal Bench | – | 14 | 10 | 22 | 15 | 13 | 31 | 33 | 5.67 |
| Aider Polyglot | – | 37.68 | 26.37 | 42 | 71.40 | 40 | 71.60 | 78 | 60.40 |
| Knowledge – MMLU Pro | – | 79 | 77 | 81 | 85 | 83 | 84 | 88 | 80 |
| Creative Writing – EQ Bench | – | 59.73 | 60.24 | 53.70 | 79.40 | 74.25 | 75.25 | 80.70 | 30.40 |
| Others – GPQA Diamond | – | 73 | 71 | 78 | 81 | 79 | 83 | 83 | 77 |
| HLE | – | 10 | 12 | 18.5 | 14.9 | 11.1 | 19.7 | 17.3 | 12.3 |
| Long Context – AA LCR* | – | 50* | 20 | 51 | 55 | 62 | 68 | 66 | 30*** |
* Score with DCA enabled; without DCA the model scores 36.
** Average excludes BFCL v3 Only and DeepResearchBench (missing scores).
*** o3‑mini‑high score is projected from its AA Index score.
We evaluate Apriel‑1.6 on a representative set of vision‑centric benchmarks (mathematical reasoning, VQA, logical reasoning, STEM, chart reasoning) using VLMEvalkit. Apriel‑1.6 improves on its predecessor by 4 points on the average of 13 benchmarks of the Image Index:


Apriel‑1.6‑15B‑Thinker sits in the sweet spot of the cost‑efficient frontier: it delivers intelligence scores that rival or surpass much larger models while using only 15 B parameters. On the chart it lies in the most attractive quadrant, balancing efficiency with top‑tier reasoning. This makes it an exceptionally efficient choice for real‑world, especially enterprise, deployments.

Our post‑training heavily improves reasoning‑token efficiency. The plot shows a high intelligence score with far fewer tokens than many similarly capable or larger models. Compared with Apriel‑1.5‑15b‑Thinker, token usage is reduced by over 30 %.
Overall, Apriel‑1.6 is a highly capable reasoner that maintains memory and efficiency characteristics required for enterprise deployment.
We gratefully acknowledge the contributions of: Varun Pandey, Shashank Maiya, Dhruv Jhamb, Massimo Caccia, Dheeraj Vattikonda, Nicolas Gontier, Patrice Bechard, Tayfun Tuna, Kavya Sriram, Denis Akhiyarov, Hari Subramani, Tara Bogavelli.
We are a small lab with limited compute relative to frontier labs, but we demonstrate that SOTA performance can be achieved with careful data, design, and methodology.
The model prioritizes extensive reasoning for difficult questions and minimal reasoning for simpler ones. Ongoing work aims to make models even more efficient and concise.
Vision‑related limitations:
Complex or low‑quality images can reduce OCR accuracy.
Dense scenes (crowds, many similar objects) make counting and subtle detail extraction harder.
Highly detailed or unusually formatted charts may lead to imperfect interpretations.
Fine‑grained visual grounding (e.g., bounding‑box predictions) can be approximate or inconsistent.
Radhakrishna, S., Tiwari, A., Shukla, A., Hashemi, M., Maheshwary, R., Malay, S.K.R., Mehta, J., Pattnaik, P., Mittal, S., Slimi, K., Ogueji, K., Oladipo, A., Parikh, S., Bamgbose, O., Liang, T., Masry, A., Mahajan, K., Mudumba, S.R., Yadav, V., Madhusudhan, S.T., Scholak, T., Davasam, S., Sunkara, S., & Chapados, N. (2025). Apriel‑1.5‑15b‑Thinker. arXiv preprint arXiv:2510.01141.
Zheng, C., Liu, S., Li, M., Chen, X.-H., Yu, B., Gao, C., Dang, K., Liu, Y., Men, R., Yang, A., Zhou, J., & Lin, J. (2025). Group Sequence Policy Optimization. arXiv preprint arXiv:2507.18071.
Comments
Want to join the conversation?
Loading comments...