
InstaDeep Introduces Nucleotide Transformer V3 (NTv3): A New Multi-Species Genomics Foundation Model, Designed for 1 Mb Context Lengths at Single-Nucleotide Esolution
Why It Matters
By unifying long‑range context, cross‑species learning, and generative design, NTv3 accelerates functional genomics research and biotech applications that require precise DNA engineering.
InstaDeep Introduces Nucleotide Transformer v3 (NTv3): A New Multi-Species Genomics Foundation Model, Designed for 1 Mb Context Lengths at Single-Nucleotide esolution
Genomic prediction and design now require models that connect local motifs with megabase‑scale regulatory context and that operate across many organisms. Nucleotide Transformer v3 (NTv3) is InstaDeep’s new multi‑species genomics foundation model for this setting. It unifies representation learning, functional‑track and genome‑annotation prediction, and controllable sequence generation in a single backbone that runs on 1 Mb contexts at single‑nucleotide resolution.
Earlier Nucleotide Transformer models already showed that self‑supervised pre‑training on thousands of genomes yields strong features for molecular‑phenotype prediction. The original series included models from 50 M to 2.5 B parameters trained on 3,200 human genomes and 850 additional genomes from diverse species. NTv3 keeps this sequence‑only pre‑training idea but extends it to longer contexts and adds explicit functional supervision and a generative mode.
Architecture for 1 Mb genomic windows
NTv3 uses a U‑Net‑style architecture that targets very long genomic windows. A convolutional down‑sampling tower compresses the input sequence, a transformer stack models long‑range dependencies in that compressed space, and a de‑convolution tower restores base‑level resolution for prediction and generation. Inputs are tokenized at the character level over A, T, C, G, N with special tokens such as <unk>, <pad>, <mask>, <cls>, <eos>, and <bos>. Sequence length must be a multiple of 128 tokens, and the reference implementation uses padding to enforce this constraint. All public checkpoints use single‑base tokenization with a vocabulary size of 11 tokens.
-
Smallest public model – NTv3 8M pre
* ≈ 7.69 M parameters
* Hidden dimension 256, FFN dimension 1,024
* 2 transformer layers, 8 attention heads
* 7 down‑sample stages
-
Largest public model – NTv3 650M
* Hidden dimension 1,536, FFN dimension 6,144
* 12 transformer layers, 24 attention heads
* 7 down‑sample stages, plus conditioning layers for species‑specific prediction heads
Training data
The NTv3 model is pre‑trained on 9 trillion base pairs from the OpenGenome2 resource using base‑resolution masked language modeling. After this stage, the model is post‑trained with a joint objective that integrates continued self‑supervision with supervised learning on approximately 16,000 functional tracks and annotation labels from 24 animal and plant species.
Performance and Ntv3 Benchmark
After post‑training, NTv3 achieves state‑of‑the‑art accuracy for functional‑track prediction and genome annotation across species. It outperforms strong sequence‑to‑function models and previous genomic foundation models on existing public benchmarks and on the new Ntv3 Benchmark, which is defined as a controlled downstream fine‑tuning suite with standardized 32 kb input windows and base‑resolution outputs.
-
The Ntv3 Benchmark currently consists of 106 long‑range, single‑nucleotide, cross‑assay, cross‑species tasks.
-
Because NTv3 sees thousands of tracks across 24 species during post‑training, the model learns a shared regulatory grammar that transfers between organisms and assays and supports coherent long‑range genome‑to‑function inference.
From prediction to controllable sequence generation
Beyond prediction, NTv3 can be fine‑tuned into a controllable generative model via masked diffusion language modeling. In this mode the model receives conditioning signals that encode desired enhancer activity levels and promoter selectivity, and it fills masked spans in the DNA sequence in a way that is consistent with those conditions.
In experiments described in the launch materials, the team designed 1,000 enhancer sequences with specified activity and promoter specificity and validated them in vitro using STARR‑seq assays in collaboration with the Stark Lab. The results show that these generated enhancers recover the intended ordering of activity levels and achieve more than 2 × improved promoter specificity compared with baselines.
Key Takeaways
-
NTv3 is a long‑range, multi‑species genomics foundation model – it unifies representation learning, functional‑track prediction, genome annotation, and controllable sequence generation in a single U‑Net‑style architecture that supports 1 Mb nucleotide‑resolution context across 24 animal and plant species.
-
Training on 9 trillion base pairs with joint self‑supervised and supervised objectives – pre‑training on OpenGenome2 followed by post‑training on > 16 k functional tracks and annotation labels from 24 species.
-
State‑of‑the‑art performance on the Ntv3 Benchmark – NTv3 reaches top accuracy for functional‑track prediction and genome annotation across species, outperforming previous sequence‑to‑function models and other genomics foundation models.
-
Same backbone supports controllable enhancer design validated with STARR‑seq – masked diffusion language modeling enables design of enhancers with specified activity levels and promoter selectivity, experimentally confirmed to follow the intended activity ordering and to improve promoter specificity.
Author
Asif Razzaq – CEO of Marktechpost Media Inc.
Asif is a visionary entrepreneur and engineer committed to harnessing the potential of artificial intelligence for social good. He leads Marktechpost, a platform that provides in‑depth, technically sound coverage of machine‑learning and deep‑learning news to a broad audience.
Comments
Want to join the conversation?
Loading comments...