Introducing Falcon H1R 7B

We’re excited to unveil Falcon H1R 7B, a decoder‑only large language model, developed by the Technology Innovation Institute (TII) in Abu Dhabi. Building upon the robust foundation of the Falcon‑H1 Base model, Falcon H1R 7B takes a major leap forward in reasoning capabilities.

Despite its modest 7 billion‑parameter size, Falcon H1R 7B matches or outperforms state‑of‑the‑art reasoning models that are 2–7× larger, proving its exceptional parameter efficiency and doing so consistently across a wide range of reasoning‑intensive benchmarks.

Its performance stems from a carefully curated training set and a two‑stage pipeline of efficient supervised fine‑tuning followed by RL scaling.

Falcon H1R 7B’s design rests on three key axes of reasoning efficiency: speed, token‑efficiency, and accuracy that together set the “3‑D limits” of performance. By integrating Deep Think with confidence (DeepConf) during test‑time scaling, the model achieves state‑of‑the‑art efficiency, delivering substantial accuracy gains while generating fewer tokens than competing models.

This iteration includes

Full checkpoint – 🤗 Falcon H1R 7B
Quantized GGUF – 🤗 Falcon H1R 7B GGUF

Training recipe

Falcon H1R 7B’s training regimen is a two‑stage, data‑driven pipeline designed to maximize reasoning quality.

Cold‑start supervised fine‑tuning (SFT) – Starting from the Falcon‑H1‑7B backbone, we train on curated datasets that contain step‑by‑step long‑form reasoning traces across three domains: mathematics, coding, and science. Non‑reasoning domains (chat, tool‑calling, safety, etc.) are also included. Difficulty‑aware filtering prioritises challenging examples. Training targets extremely long response lengths (up to 48 k tokens).
Reinforcement learning with GRPO – The SFT checkpoint is further refined using the GRPO algorithm. Rewards are given for correct reasoning chains, encouraging the model to generate high‑quality, diverse outputs while staying within the token‑budget limit. The RL stage balances exploration and exploitation to improve output quality while respecting token constraints.

Model’s capabilities

Below, the bar plot (described in text) compares Falcon H1R 7B’s performance across math, code & agentic, and general benchmarks against leading 7 B‑to‑47 B models.

Math – Falcon H1R 7B leads (73.96 %) by a wide margin, beating the next best (Apriel 1.5 15B at 69.32 %) and outpacing larger baselines such as Qwen3‑32B (63.66 %) and Nemotron H 47B (49.72 %).
Code & Agentic – Falcon H1R 7B has the highest score in this group (33.95 %), ahead of Qwen3‑32B (33.40 %) and Apriel 1.5 (31.60 %).
General – Falcon H1R 7B remains highly competitive (49.48 %), sitting just below Apriel 1.5 (53.10 %) and Phi 4 Reasoning Plus 14B (51.18 %).

Math benchmarks

| Benchmark | Falcon H1R 7B | Next best |

|-----------|---------------|-----------|

| AIME‑24 | 88.1 % | Apriel 1.5 15B – 86.2 % |

| AIME‑25 | 83.1 % | Apriel 1.5 15B – 80.0 % |

| HMMT‑25 | 64.9 % | Apriel 1.5 15B – 61.0 % |

| AMO‑Bench | 36.3 % | DeepSeek R1‑0528 Qwen3‑8B – 23.3 % |

Code & agentic benchmarks

| Benchmark | Falcon H1R 7B | Relative standing |

|-----------|---------------|--------------------|

| LCB v6 | 68.6 % | Highest of all models – outperforms even the 32 B Qwen3 by ~7 pp |

| SciCode (sub‑problem) | 28.3 % | Best for <8 B models |

| TB Hard | 4.96 % | Second best (Apriel 1.5 15B at 9.9 %) and beats the 8 B/32 B Qwen3 models |

General benchmarks

| Benchmark | Falcon H1R 7B | Relative standing |

|-----------|---------------|--------------------|

| GPQA‑D | 61.3 % | On‑par with other 8 B models (Qwen3‑8B 61.2 %, DeepSeek 61.4 %) |

| MMLU‑Pro | 72.1 % | Outperforms all 8 B rivals and close to the 14/32 B cohort |

| HLE | 11.1 % | Slightly behind Apriel 1.5 15B and beats every other 8 B/32 B variant |

| IFBench | 53.4 % | Second best after Apriel (55.8 %) and outpaces all 8 B models; demonstrates robust instruction‑following at a compact scale |

Inference

We benchmark Falcon H1R 7B’s token throughput per GPU against Qwen3 8B under realistic test‑time scaling workloads.

Falcon H1R 7B outperforms Qwen3 8B across the board, especially as batch size grows. In the typical test‑time scaling case (512 → 32 k tokens), Falcon reaches roughly 1,000 tokens / s / GPU at batch 32 and ≈ 1,500 tokens / s / GPU at batch 64, nearly double Qwen3’s rates. The advantage widens further for longer inputs (8 k → 16 k), where Falcon delivers ≈ 1,800 tokens / s / GPU while Qwen3 stays below 900. The hybrid Transformer–Mamba backbone is the key to this superior scaling and memory efficiency.

Test‑time scaling

Test‑time scaling (TTS) boosts a model’s reasoning by running many parallel solution chains and aggregating the best answer, unlocking latent capability without extra training. In Falcon H1R 7B we employ Deep Think with Confidence (DeepConf), a lightweight, confidence‑aware filtering method that dynamically discards low‑quality reasoning traces during or after generation. DeepConf leverages the model’s own next‑token confidence scores to identify and prune noisy traces, requiring no additional training or hyper‑parameter tuning.

Falcon H1R 7B thrives at high batch sizes and is token‑efficient, generating fewer tokens per inference for a given accuracy level, making TTS especially effective and positioning the model on a new Pareto frontier of performance vs. inference compute.

The grid below shows how many tokens were generated for a given accuracy. Falcon H1R 7B sits on the Pareto frontier of low cost, high performance:

AIME 24 / 25 – 96.7 % accuracy with <100 M tokens, beating every other 8 B model and matching the best 14/32 B systems.
AMO‑Bench – 35.9 % accuracy with just 217 M tokens, surpassing every other model.

Falcon H1R 7B demonstrates that a 7 billion‑parameter model can rival larger peers in reasoning tasks while delivering efficient inference, making it an attractive choice for developers and researchers alike.

Open‑source commitment

In line with our mission to foster AI accessibility and collaboration, Falcon H1R 7B is released under the Falcon LLM license. We hope the AI community finds these models valuable for research, application development, and further experimentation. Falcon H1R 7B is a continuation of our efforts to create more capable and efficient foundation models. We welcome feedback and collaboration from the community as we continue to refine and advance the capabilities of these models.

Useful links

Access the models through the Falcon H1R 7B HuggingFace collection.
Try it in the HuggingFace demo or directly in Falcon Chat.
View the technical report.
Join our Discord server for questions and interaction with our researchers and developers.
See the Falcon‑LLM License for more details about the license.

Acknowledgments

We would like to thank the following TII colleagues for their valuable support during this project: Younes Belkada, Ilyas Chahed, Dhia Eddine Rhaiem, Maksim Velikanov and Jingwei Zuo.

Citation


@article{falconh1r,

    title = {Falcon-H1R: Pushing the Reasoning Frontiers with a Hybrid Model for Efficient Test-Time Scaling},

    url = {https://github.com/tiiuae/falcon-h1r/blob/main/tech_report.pdf},

    author = {Falcon Reasoning Team, Iheb Chaabane, Puneesh Khanna, Suhail Mohmad, Slim Frikha, Shi Hu, Abdalgader Abubaker, Reda Alami, Mikhail Lubinets, Mohamed El Amine Seddik, Hakim Hacid},

    month = {December},

    year = {2025}

}

AI Blogs and Articles

Why It Matters