Introducing Falcon-H1-Arabic: Pushing the Boundaries of Arabic Language AI with Hybrid Architecture
AI

Introducing Falcon-H1-Arabic: Pushing the Boundaries of Arabic Language AI with Hybrid Architecture

Hugging Face
Hugging FaceJan 5, 2026

Why It Matters

The breakthrough in long‑context capability and hybrid architecture gives Arabic AI applications unprecedented accuracy and scalability, positioning Falcon‑H1‑Arabic as a new standard for enterprise and research deployments.

Introducing Falcon-H1-Arabic: Pushing the Boundaries of Arabic Language AI with Hybrid Architecture

Falcon‑H1‑Arabic: State‑of‑the‑Art Arabic Language Models with Hybrid Mamba‑Transformer Architecture

The journey of building world‑class Arabic language models has been one of continuous learning and iteration. Today, we're excited to announce Falcon‑H1‑Arabic, our most advanced Arabic language model family to date, representing a significant leap forward in both architecture and capabilities. This release embodies months of research, community feedback, and technical innovation, culminating in three powerful models that set new standards for Arabic natural language processing.


Building on Success: The Evolution from Falcon‑Arabic

When we launched Falcon‑Arabic a few months ago, the response from the community was both humbling and enlightening. Developers, researchers and students across the Arab world used the model for real use cases, pushing them to its limits and providing invaluable feedback. We learned where the model excelled and, more importantly, where it struggled. Long‑context understanding, dialectal variations, mathematical reasoning, and domain‑specific knowledge emerged as key areas requiring deeper attention.

We didn't just want to make incremental improvements; we wanted to fundamentally rethink our approach. The result is Falcon‑H1‑Arabic, a model family that addresses every piece of feedback we received while introducing architectural innovations that were previously unexplored in Arabic language modeling.

Falcon‑H1‑Arabic 3B, 7B, 34B models outperforming all SOTA models of similar sizes and sometimes bigger.


A First for Arabic NLP: Hybrid Mamba‑Transformer Architecture

Falcon‑H1‑Arabic is built on the Falcon‑H1 hybrid architecture, which integrates State Space Models (Mamba) and Transformer attention within every block. Both components run in parallel and their representations are fused before the block’s output projection. This design provides the linear‑time scalability of Mamba for extremely long sequences while preserving the precise long‑range modeling capabilities of attention. For Arabic, with its rich morphology and flexible sentence structures, this approach significantly improves coherence and reasoning across extended text. The architecture has been deployed across three scales (3B, 7B, 34B parameters), each balancing capacity, efficiency, and deployability for different use cases from edge devices to enterprise applications.

Falcon‑H1 architecture. Attention and SSM run in parallel within each block; their outputs are concatenated before the block’s output projection. The number of SSM/Attention heads depends on the model size. More details in the Falcon‑H1 technical report.


Breaking Context Boundaries

We have dramatically increased context capabilities from Falcon‑Arabic’s 32K limit to 128K tokens for the 3B model and 256K tokens for both the 7B and 34B models. At 256K tokens (~200,000 words), these models can process several novels or hundreds of pages of technical documentation, enabling applications in legal analysis, medical records, academic research, and extended conversations that were previously impractical. Our post‑training specifically addresses “lost in the middle” challenges to ensure models effectively utilize their full context range, not just accept long inputs.

| Parameters | Context Window | Architecture | Ideal Uses |

|------------|----------------|--------------|------------|

| 3B | 128K | Hybrid | Fast agents, high‑QPS systems, lightweight analytics |

| 7B | 256K | Hybrid | Production assistants, reasoning, enterprise chat |

| 34B | 256K | Hybrid | Long‑document analysis, research, high‑stakes tasks |


Data Quality and Diversity: The Foundation of Excellence

We rebuilt our pre‑training data pipeline from the ground up to better reflect the complexity of Arabic. This began with a multi‑stage quality‑filtering process tailored to Arabic orthography, morphology, diacritics, and syntactic patterns. Instead of heuristic filtering, we used deep linguistic analysis to isolate coherent, well‑structured text and remove noise commonly found in open‑web corpora. The result is a significantly cleaner, more stylistically consistent Arabic dataset.

Dialect coverage was another key priority. Arabic is not monolithic; Modern Standard Arabic coexists with dialects such as Egyptian, Levantine, Gulf, and Maghrebi, each with distinct vocabularies and grammatical constructions. We expanded dialectal sources substantially so the models would understand and generate the full spectrum of real‑world Arabic rather than leaning disproportionately toward formal MSA. To maintain global reasoning and domain diversity, we also preserved the multilingual capabilities of Falcon‑H1 by training the Arabic models on an almost equal mix of Arabic, English, and multilingual content totalling around 300 Billion tokens. This ensures strong performance in code, STEM, and cross‑lingual reasoning.

Figure: Distribution of the pre‑training data across languages and categories (values expressed in billions of tokens).


Post‑Training: Refining Capabilities Without Compromising Competence

After pre‑training, Falcon‑H1‑Arabic undergoes a focused post‑training pipeline consisting of supervised fine‑tuning (SFT) followed by direct preference optimization (DPO).

  • SFT exposes the models to high‑quality Arabic instructions, curated long‑context examples, and structured reasoning tasks that teach them to follow directives, maintain coherence over extended sequences, and ground their responses in relevant information. This stage is crucial for ensuring that the models can actually use their large context windows, which does not emerge automatically from architecture alone.

  • DPO refines alignment, conversational quality, and preference consistency. It helps the models balance long‑context reasoning with general linguistic competence, improving helpfulness and reducing common failure modes such as drifting, overuse of context, or neglecting earlier information. Throughout both stages, we carefully monitor for catastrophic forgetting and maintain a controlled curriculum so gains in long‑context behavior do not come at the expense of core reasoning or factual accuracy.

Beyond benchmark‑oriented optimization, our post‑training process deliberately strengthens areas that traditional evaluations do not fully capture, including conversational faithfulness, rhetorical organization, structured follow‑ups, and discourse coherence. These enhancements significantly boost the model’s practical usefulness, making Falcon‑H1‑Arabic more dependable in real multi‑turn dialogue, instruction execution, and long‑context conversational flows.


Benchmark Performance: Setting New Standards

Numbers tell an important part of the story. On the Open Arabic LLM Leaderboard (OALL), a comprehensive benchmark evaluating Arabic language understanding across diverse tasks, Falcon‑H1‑Arabic achieves state‑of‑the‑art results at every scale we tested. Note that our scores may vary slightly from those reported on the leaderboard, as we used vLLM as the backend instead of the leaderboard’s Accelerate‑based implementation. These differences are typically under one point while offering significantly faster runtime.

We also report results on:

  • 3LM – STEM‑related tasks (synthetic and native splits)

  • Arabculture – Arabic culture assessment

  • AraDice – Arabic dialect coverage (Levantine, Egyptian, Gulf, etc.)

3B Model

  • OALL: ~62 % (outperforming small‑scale models such as Gemma‑4B, Qwen3‑4B, Phi‑4‑mini by ~10 pts)

  • 3LM: 82 % (native), 73 % (synthetic)

  • ArabCulture: ~62 %

  • AraDice: ~50 % across dialects

7B Model

  • OALL: 71.7 % (surpassing ~10B class models)

  • 3LM: 92 % (native), 85 % (synthetic)

  • ArabCulture: ~80 %

  • AraDice: mid‑50s across dialects

34B Model (flagship)

  • OALL: ~75 % (outperforming even 70B‑scale transformers)

  • 3LM: 96 % (native), 94 % (synthetic)

  • ArabCulture: ~80 %

  • AraDice: ~53 % across dialects

These benchmark results validate our approach but also highlight an important reality: the frontier of Arabic language modeling is advancing rapidly. Each percentage point on these benchmarks represents countless hours of engineering effort, careful dataset curation, and architectural refinement. The margins by which Falcon‑H1‑Arabic leads aren’t just statistical artifacts; they translate to meaningfully better user experiences in real‑world applications.


Practical Applications: From Edge to Enterprise

| Model | Ideal Deployment |

|-------|------------------|

| 3B | Speed‑critical, cost‑efficient, high‑throughput systems – agentic workflows, on‑device apps, low‑latency chat, resource‑constrained environments |

| 7B | General‑purpose production – document understanding, chatbots, summarization pipelines, content generation tools |

| 34B | High‑stakes domains – legal analysis, medical summarization, academic research, large‑scale enterprise automation; capable of processing hundreds of pages in a single pass |


Responsible AI and Limitations

  • Falcon‑H1‑Arabic may reflect biases present in the training data and can produce hallucinated information.

  • Model outputs should not be used as sole authorities for medical, legal, or financial decisions without professional verification.

  • Long‑context performance may degrade at extreme ranges.

  • We recommend task‑specific evaluation and appropriate guardrails before deployment in production or sensitive applications.


Acknowledgments

We extend our gratitude to the Arabic NLP research community, whose open sharing of benchmarks, datasets, and methodologies enables progress across the field. Special thanks to our colleagues at TII: Ilyas Chahed, Younes Belkada, Dhia Eddine Rhaiem, Puneesh Khanna, Jingwei Zuo, Mikhail Lubinets, Slim Frikha, Maksim Velikanov, Kacper Piskorski, and Suhail Mohmad for their invaluable support during this project.


Citation


@misc{Falcon-H1-Arabic-2025,

  title={Falcon-H1-Arabic: State-of-the-Art Arabic Language Models with Hybrid Mamba-Transformer Architecture},

  author={Basma El Amel Boussaha and Mohammed Alyafeai and Ahmed Alzubaidi and Leen AlQadi and Shaikha Alsuwaidi and Omar Alkaabi and Hamza Alobeidli and Hakim Hacid},

  url={https://huggingface.co/blog/tiiuae/falcon-h1-arabic},

  month={December},

  year={2025},

  note={Available in 3B, 7B, and 34B parameter versions}

}

NB: The scores of ALLaM‑7B‑Instruct‑preview in our evaluation are higher than those reported on the OALL leaderboard, as we used the newest release (7b‑alpha‑v2.33.0.30), while the leaderboard currently reflects results from the older version (7b‑alpha‑v1.27.2.25).


Falcon‑H1‑Arabic models are available for use at the links below. For questions, collaborations, or feedback, reach us at falcon.info@tii.ae or join our community.

Comments

Want to join the conversation?

Loading comments...