The dataset democratizes access to high‑quality, privacy‑safe training data, enabling Brazilian AI developers to build culturally accurate models and improve fairness across the nation’s diverse population.
Brazil’s AI ecosystem has long grappled with a shortage of locally relevant training data, as most large‑scale corpora are dominated by English‑centric sources. Synthetic data offers a pragmatic solution, allowing developers to generate massive, statistically sound datasets without exposing personal information. By anchoring personas to IBGE census figures, Nemotron‑Personas‑Brazil mirrors the country’s regional, occupational, and linguistic nuances, providing a foundation for models that understand Brazilian Portuguese idioms, naming conventions, and cultural references.
The technical backbone of the release is NVIDIA’s NeMo Data Designer, a compound‑AI pipeline that combines a probabilistic graphical model with the GPT‑OSS‑120B language model. This hybrid approach ensures each persona adheres to real‑world distributions while delivering fluent, natural‑language descriptions. The dataset’s 20‑field schema includes age, education, occupation, and location, as well as contextual attributes such as hobbies and goals, enabling fine‑grained scenario generation for dialogue systems, recommendation engines, and bias‑testing frameworks. Because the personas are fully synthetic, they comply with Brazil’s LGPD privacy regulations, removing legal hurdles for commercial deployment.
For businesses and startups, the open CC BY 4.0 license removes cost barriers and encourages rapid experimentation. Companies can fine‑tune large language models on this data to improve customer support bots, virtual assistants, and sector‑specific AI tools that resonate with Brazilian users. Moreover, the dataset serves as a benchmark for fairness assessments, allowing stakeholders to evaluate model behavior across urban‑rural divides, age groups, and socioeconomic strata. As sovereign AI initiatives gain momentum worldwide, Nemotron‑Personas‑Brazil positions Brazil as a leader in responsibly sourced, culturally attuned synthetic data.
Authors: Andre Manoel, Yev Meyer, Shyamala Prayaga, Will Jennings, bardiya27
Building AI systems that serve national populations requires data that reflects local language, demographics, and cultural context. For Brazil—home to more than 200 million people across diverse regions—this remains a persistent challenge, as much of today’s high‑quality training data is English‑centric or unavailable for commercial use.
Nemotron‑Personas‑Brazil is an open dataset (CC BY 4.0) of 6 million fully synthetic personas, statistically grounded in official census and labor data from the Brazilian Institute of Geography and Statistics (IBGE). Every persona is aligned to real demographic, geographic, and occupational distributions—but no real person is represented.
This release extends NVIDIA’s growing Nemotron‑Personas Collection, which already includes the USA, Japan, India, and Singapore. Like the other collections, the Brazil dataset covers attributes such as age, sex, education, occupation, and location.
The dataset is designed for Brazilian developers and researchers building sovereign AI, with data that is locally grounded, culturally informed, and commercially usable (CC BY 4.0). It was built in collaboration with WideLabs, an NVIDIA Inception member with deep experience supporting government and regulated‑sector AI deployments across Latin America.

At a glance
6 million Brazilian personas (1 million records × 6 personas each)
~1.4 billion tokens total, including ~450 million persona tokens
20 fields per record: 6 persona fields + 14 contextual fields grounded in official statistics
Full geographic coverage: all 26 Brazilian states + the Federal District
~457 k unique Portuguese names
1 500+ occupation categories reflecting Brazil’s workforce
Multiple persona types including professional, sports, arts, travel, among others
Each persona is written in natural Brazilian Portuguese and includes cultural background, skills, goals, hobbies, and interests.
Nemotron‑Personas‑Brazil was built using NeMo Data Designer, NVIDIA’s compound AI system for synthetic data generation. The pipeline supports structured generation, validation, and retry mechanisms required to produce large‑scale, population‑aware datasets.
Key components include:
Probabilistic Graphical Model (Apache‑2.0) for statistical grounding
GPT‑OSS‑120B (Apache‑2.0) for narrative generation in Brazilian Portuguese
An extended version of Nemotron‑Personas‑Brazil will be available directly within NeMo Data Designer, enabling developers to generate, refine, and extend Brazilian Portuguese personas as part of their own synthetic‑data pipelines.
To capture the socio‑demographic and geographic diversity of Brazil’s population, the dataset leverages census and labor data published by the Brazilian Institute of Geography and Statistics (IBGE).
Geography – Personas are anchored at the state and municipality level, reflecting regional variation across Brazil’s five macro‑regions.
Occupation – Includes skills, expertise, and career trajectories, covering micro‑entrepreneurs and regional trades.
Life Stages – Incorporates student status, unemployment, and retirement to reflect real population dynamics.
Cultural Traits – Natural‑language personas capture Brazilian social norms, interests, and lifestyle dimensions such as arts, sports, and travel.
Language Fidelity – All personas are written in natural Brazilian Portuguese, reflecting local naming conventions and communication styles.
The result is a dataset that is statistically grounded, culturally representative, and fully synthetic by design.
The dataset contains no personally identifiable information. While real‑world distributions of ages, names, and occupations from public sources are used, nothing is tied to any real person, living or deceased. Every persona is fully synthetic, so you can train on authentic cultural patterns without compromising privacy.
Nemotron‑Personas‑Brazil is designed primarily for Brazilian developers and researchers building sovereign AI systems. By providing high‑quality, population‑representative data in Brazilian Portuguese, the dataset addresses gaps left by predominantly English‑language training corpora.
Global developers may also leverage the dataset to improve model performance and alignment in Brazilian cultural and linguistic contexts.
Multi‑turn conversation – Use personas as seeds to generate authentic dialogue datasets.
Domain‑specific training – Build culturally aware AI assistants.
Bias testing & fairness – Evaluate model performance across rural vs. urban populations, age groups, and education levels, ensuring AI works fairly across all segments of Brazilian society.
AI model builders have long struggled with access to diverse, high‑quality training data that reflects real‑world populations. Proprietary datasets dominate enterprise AI, creating barriers for researchers, startups, and developers in under‑represented regions.
Data diversity – Prevents narrow training and model collapse by reflecting Brazil’s full population spectrum.
Cultural authenticity – Reduces reliance on Western‑centric datasets and supports sovereign AI development.
Privacy preservation – Designed to meet Brazil’s data‑protection requirements and emerging AI‑governance standards.
By releasing Nemotron‑Personas‑Brazil under CC BY 4.0, NVIDIA is democratizing access to enterprise‑grade synthetic data—enabling anyone to build culturally authentic AI without barriers of cost, privacy concerns, or geography.
from datasets import load_dataset
dataset = load_dataset("nvidia/nemotron-personas-brazil")
Want to learn more about NVIDIA’s open data products, or interested in co‑designing a future dataset? Join the conversation on NVIDIA’s Discord.
Comments
Want to join the conversation?
Loading comments...