
By unifying long‑range context, cross‑species learning, and generative design, NTv3 accelerates functional genomics research and biotech applications that require precise DNA engineering.
The rise of foundation models in biology mirrors breakthroughs in natural language processing, yet genomics presents unique challenges: sequences span millions of bases and functional signals are dispersed across vast regulatory landscapes. NTv3 tackles this by adopting a U‑Net‑style architecture that compresses 1 Mb genomic windows, applies transformer attention in a reduced space, and then restores base‑level detail. This design preserves fine‑grained nucleotide information while capturing megabase‑scale dependencies, a capability that older models lacking such depth could not achieve.
Training scale is another differentiator. Leveraging 9 trillion base pairs from the OpenGenome2 repository, NTv3 learns a universal DNA language across 24 diverse species. The subsequent joint objective—blending masked language modeling with supervision from over 16,000 functional tracks—creates a shared regulatory grammar that transfers across organisms and assay types. This multi‑task exposure drives superior performance on the newly introduced Ntv3 Benchmark, where NTv3 outperforms prior sequence‑to‑function models on 106 long‑range, cross‑species tasks.
Beyond prediction, NTv3’s generative mode opens practical avenues for synthetic biology. By conditioning masked diffusion models on desired enhancer activity and promoter specificity, researchers can design DNA sequences that meet precise functional criteria. Validation with STARR‑seq assays demonstrated more than a two‑fold improvement in promoter specificity compared with baseline generators. As biotech firms seek rapid, data‑driven design cycles, NTv3’s blend of accurate annotation and controllable synthesis positions it as a pivotal tool for next‑generation genome engineering.
Comments
Want to join the conversation?
Loading comments...