
Common Ancestry Limits Protein Sequence Exploration, Computational Study Shows
Why It Matters
If AI‑driven protein design cannot extrapolate beyond the limited ancestral subspace, commercial pipelines risk missing novel functionalities, slowing biotech innovation. Broader experimental coverage could dramatically expand the design horizon for therapeutics and industrial enzymes.
Key Takeaways
- •Ancestry limits protein sequence diversification more than selection
- •Effective dimensionality of many families is near one
- •AI tools trained on known proteins may miss vast space
- •Early protein evolution likely required DNA recombination, not single mutations
- •More experimental data needed to expand explored sequence space
Pulse Analysis
The surge of AI‑powered platforms such as AlphaFold has transformed structural biology, yet most models are built on databases that capture only a fraction of the theoretical protein universe. While these repositories contain millions of sequences, the combinatorial possibilities for a 100‑residue chain exceed 20^100, a number far beyond current sampling. This disparity raises a fundamental question for both academia and industry: how representative are the known proteins of the broader functional landscape, and what does that mean for predictive design tools?
The new PNAS study tackles this question by quantifying the "effective dimensionality" of protein families—essentially how many independent evolutionary directions have been explored. Simulations reveal that many families cluster along a one‑dimensional trajectory anchored to their earliest ancestors, indicating that ancestry imposes a far stronger bottleneck than selective pressure or epistatic interactions. Such low dimensionality implies that the functional space accessible through natural evolution is a narrow corridor, and that early protein emergence likely depended on recombination events that generated entirely new sequence scaffolds rather than incremental mutations.
For protein engineers, the findings serve as a cautionary note. AI generators trained on existing sequences may excel at optimizing within the known corridor but falter when asked to venture into truly novel regions. To break this ceiling, the field must invest in high‑throughput experimental pipelines that map uncharted sequence space, feeding richer data back into machine‑learning models. By expanding the empirical foundation, biotech firms can unlock unprecedented enzyme activities, therapeutic targets, and sustainable bioprocesses, turning the vast, untapped protein landscape into a competitive advantage.
Comments
Want to join the conversation?
Loading comments...