Tahoe, Arc Institute, and Biohub Join Forces on Massive Virtual Cell Dataset

•January 12, 2026

GEN (Genetic Engineering & Biotechnology News)•Jan 12, 2026

Companies Mentioned

Tahoe Bio

Arc Institute

NVIDIA

NVDA

10x Genomics

TXG

Noetik

GSK

Xaira Therapeutics

Why It Matters

By delivering unprecedented perturbation depth, the dataset will enable more accurate AI models of gene regulation, speeding therapeutic target identification and reducing reliance on costly wet‑lab experiments.

Key Takeaways

•Over 120 M single‑cell points, 225 k perturbations generated.
•Dataset four times larger than Tahoe‑100M, 50 cell lines included.
•Mosaic tech cuts sequencing cost 100‑fold via multiplexing.
•Open‑source release will fuel AI model training and benchmarks.
•Partners aim to bridge data gap for clinical outcome prediction.

Pulse Analysis

The collaboration between Tahoe Therapeutics, Arc Institute and Biohub addresses a critical bottleneck in the emerging field of virtual‑cell modeling: the scarcity of large‑scale, high‑quality perturbation data. While single‑cell sequencing has become more affordable, generating diverse chemical and cytokine perturbations at scale remains prohibitively expensive for most labs. By leveraging Tahoe’s Mosaic platform, which multiplexes dozens of cell lines into a single tumor and deconvolutes individual responses, the partnership can produce a dataset that dwarfs existing resources such as Tahoe‑100M and X‑Atlas. This influx of data is poised to reshape AI‑driven drug discovery pipelines, providing the statistical power needed for models to learn complex gene‑regulatory networks.

Mosaic’s innovative approach reduces the cost of single‑cell sequencing by roughly a hundred‑fold, making it feasible to explore 1,400 distinct chemical scaffolds across multiple doses and cytokine conditions. The resulting dataset, encompassing 120 million cells and 225,000 perturbations, will be a cornerstone for training next‑generation transformer‑style models like TranscriptFormer and STATE. Moreover, the open‑source release—after an initial exclusive period—will feed directly into community benchmarks such as Arc’s Virtual Cell Challenge, where competitors strive to predict cellular responses with ever‑greater fidelity. This democratization of data accelerates methodological advances and encourages reproducible research across academia and industry.

Beyond methodological gains, the partnership signals a strategic shift toward data‑centric drug development. Large‑scale perturbation maps enable researchers to simulate how candidate molecules shift cells from diseased to healthy states, potentially shortening the preclinical timeline and lowering attrition rates. As pharmaceutical giants like GSK already invest in foundation models for oncology, the availability of a comprehensive, openly licensed virtual‑cell dataset could become a critical asset for both startups and established firms seeking to harness AI for precision therapeutics. In the long run, such resources may bridge the gap between in‑silico predictions and clinical outcomes, ushering in a new era of biologically informed drug design.

Tahoe, Arc Institute, and Biohub Join Forces on Massive Virtual Cell Dataset

Companies Mentioned

Why It Matters

Key Takeaways

Pulse Analysis

Ask Pulse AI: