Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

EvoAI enables extreme compression and reconstruction of the protein sequence space

Abstract

Designing proteins with improved functions requires a deep understanding of how sequence and function are related, a vast space that is hard to explore. The ability to efficiently compress this space by identifying functionally important features is extremely valuable. Here we establish a method called EvoScan to comprehensively segment and scan the high-fitness sequence space to obtain anchor points that capture its essential features, especially in high dimensions. Our approach is compatible with any biomolecular function that can be coupled to a transcriptional output. We then develop deep learning and large language models to accurately reconstruct the space from these anchors, allowing computational prediction of novel, highly fit sequences without prior homology-derived or structural information. We apply this hybrid experimental–computational method, which we call EvoAI, to a repressor protein and find that only 82 anchors are sufficient to compress the high-fitness sequence space with a compression ratio of 1048. The extreme compressibility of the space informs both applied biomolecular design and understanding of natural evolution.

This is a preview of subscription content, access via your institution

Access options

Buy this article

USD 39.95

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: EvoScan scheme, development and validation on a protein–protein interaction evolution.
Fig. 2: Thorough segment scanning for protein–ligand interaction evolution.
Fig. 3: Applying EvoScan on AmeR for protein–DNA interaction evolution.
Fig. 4: Genetic relationships and features of the 82 anchors generated by EvoScan.
Fig. 5: Anchors and deep learning reconstruct the design space for high-fitness genotypes.

Similar content being viewed by others

Data availability

Source data are provided with this paper. Other data and materials used in this study are available from the corresponding authors by reasonable request. PDB files used in this study include PDB ID 7CB7 and PDB ID 7VLO. All the prediction features and results of mutants are available via Zenodo at https://doi.org/10.5281/zenodo.10686156 (ref. 61).

Code availability

The models of EvoAI are implemented in PyTorch v2.2.0. All codes are freely downloadable via Zenodo at https://doi.org/10.5281/zenodo.10686156 (ref. 61) or via GitHub at https://github.com/Gonglab-THU/EvoAI.

References

  1. Lovelock, S. L. et al. The road to fully programmable protein catalysis. Nature 606, 49–58 (2022).

    Article  CAS  PubMed  Google Scholar 

  2. Labanieh, L. & Mackall, C. L. CAR immune cells: design principles, resistance and the next generation. Nature 614, 635–648 (2023).

    Article  CAS  PubMed  Google Scholar 

  3. Dumontet, C., Reichert, J. M., Senter, P. D., Lambert, J. M. & Beck, A. Antibody–drug conjugates come of age in oncology. Nat. Rev. Drug Discov. 22, 641–661 (2023).

    Article  CAS  PubMed  Google Scholar 

  4. Macken, C. A. & Perelson, A. S. Protein evolution on rugged landscapes. Proc. Natl Acad. Sci. USA 86, 6191–6195 (1989).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. Lutz, S. Beyond directed evolution—semi-rational protein engineering and design. Curr. Opin. Biotechnol. 21, 734–743 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Ding, X., Zou, Z. & Brooks, C. L. III Deciphering protein evolution and fitness landscapes with latent space models. Nat. Commun. 10, 5644 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  7. Tian, P. & Best, R. B. Exploring the sequence fitness landscape of a bridge between protein folds. PLoS Comput. Biol. 16, e1008285 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Fernandez-de-Cossio-Diaz, J., Uguzzoni, G. & Pagnani, A. Unsupervised inference of protein fitness landscape from deep mutational scan. Mol. Biol. Evol. 38, 318–328 (2021).

    Article  CAS  PubMed  Google Scholar 

  9. D’Costa, S., Hinds, E. C., Freschlin, C. R., Song, H. & Romero, P. A. Inferring protein fitness landscapes from laboratory evolution experiments. PLoS Comput. Biol. 19, e1010956 (2023).

    Article  PubMed  PubMed Central  Google Scholar 

  10. Fowler, D. M. & Fields, S. Deep mutational scanning: a new style of protein science. Nat. Methods 11, 801–807 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  11. Stiffler, M. A., Hekstra, D. R. & Ranganathan, R. Evolvability as a function of purifying selection in TEM-1 β-lactamase. Cell 160, 882–892 (2015).

    Article  CAS  PubMed  Google Scholar 

  12. Zheng, L., Baumann, U. & Reymond, J.-L. An efficient one-step site-directed and site-saturation mutagenesis protocol. Nucleic Acids Res. 32, e115 (2004).

    Article  PubMed  PubMed Central  Google Scholar 

  13. McLaughlin, R. N. Jr, Poelwijk, F. J., Raman, A., Gosal, W. S. & Ranganathan, R. The spatial architecture of protein function and adaptation. Nature 491, 138–142 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. Cadwell, R. C. & Joyce, G. F. Randomization of genes by PCR mutagenesis. Genome Res. 2, 28–33 (1992).

    Article  CAS  Google Scholar 

  15. Vanhercke, T., Ampe, C., Tirry, L. & Denolf, P. Reducing mutational bias in random protein libraries. Anal. Biochem. 339, 9–14 (2005).

    Article  CAS  PubMed  Google Scholar 

  16. Esvelt, K. M., Carlson, J. C. & Liu, D. R. A system for the continuous directed evolution of biomolecules. Nature 472, 499–503 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. Miller, S. M., Wang, T. & Liu, D. R. Phage-assisted continuous and non-continuous evolution. Nat. Protoc. 15, 4101–4127 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  18. Ravikumar, A., Arzumanyan, G. A., Obadi, M. K. A., Javanpour, A. A. & Liu, C. C. Scalable, continuous evolution of genes at mutation rates above genomic error thresholds. Cell 175, 1946–1957.e1913 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  19. Sarkisyan, K. S. et al. Local fitness landscape of the green fluorescent protein. Nature 533, 397–401 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Hopf, T. A. et al. Mutation effects predicted from sequence co-variation. Nat. Biotechnol. 35, 128–135 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  21. Riesselman, A. J., Ingraham, J. B. & Marks, D. S. Deep generative models of genetic variation capture the effects of mutations. Nat. Methods 15, 816–822 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. Yang, K. K., Wu, Z. & Arnold, F. H. Machine-learning-guided directed evolution for protein engineering. Nat. Methods 16, 687–694 (2019).

    Article  CAS  PubMed  Google Scholar 

  23. Luo, Y. et al. ECNet is an evolutionary context-integrated deep learning framework for protein engineering. Nat. Commun. 12, 5743 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  24. Wu, Z., Johnston, K. E., Arnold, F. H. & Yang, K. K. Protein sequence design with deep generative models. Curr. Opin. Chem. Biol. 65, 18–27 (2021).

    Article  CAS  PubMed  Google Scholar 

  25. Somermeyer, L. G. et al. Heterogeneity of the GFP fitness landscape and data-driven protein design. eLife 11, e75842 (2022).

    Article  CAS  Google Scholar 

  26. Shen, M. W., Zhao, K. T. & Liu, D. R. Reconstruction of evolving gene variants and fitness from short sequencing reads. Nat. Chem. Biol. 17, 1188–1198 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  27. Biswas, S., Khimulya, G., Alley, E. C., Esvelt, K. M. & Church, G. M. Low-N protein engineering with data-efficient deep learning. Nat. Methods 18, 389–396 (2021).

    Article  CAS  PubMed  Google Scholar 

  28. Papkou, A., Garcia-Pastor, L., Escudero, J. A. & Wagner, A. A rugged yet easily navigable fitness landscape. Science 382, eadh3860 (2023).

    Article  CAS  PubMed  Google Scholar 

  29. Halperin, S. O. et al. CRISPR-guided DNA polymerases enable diversification of all nucleotides in a tunable window. Nature 560, 248–252 (2018).

    Article  CAS  PubMed  Google Scholar 

  30. Baas, P. DNA replication of single-stranded Escherichia coli DNA phages. Biochim. Biophys. Acta Gene Struct. Expr. 825, 111–139 (1985).

    Article  CAS  Google Scholar 

  31. Jinek, M. et al. A programmable dual-RNA-guided DNA endonuclease in adaptive bacterial immunity. Science 337, 816–821 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  32. Ran, F. A. et al. Genome engineering using the CRISPR–Cas9 system. Nat. Protoc. 8, 2281–2308 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  33. Dietsch, F. et al. Small p53 derived peptide suitable for robust nanobodies dimerization. J. Immunol. Methods 498, 113144 (2021).

    Article  CAS  PubMed  Google Scholar 

  34. Di Lallo, G., Castagnoli, L., Ghelardini, P. & Paolozzi, L. A two-hybrid system based on chimeric operator recognition for studying protein homo/heterodimerization in Escherichia coli. Microbiology 147, 1651–1656 (2001).

    Article  PubMed  Google Scholar 

  35. Gao, K. et al. Perspectives on SARS-CoV-2 main protease inhibitors. J. Med. Chem. 64, 16922–16955 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  36. Li, J. et al. Structural basis of the main proteases of coronavirus bound to drug candidate PF-07321332. J. Virol. 96, e02013–e02021 (2022).

    Article  PubMed  PubMed Central  Google Scholar 

  37. Fu, L. et al. Both Boceprevir and GC376 efficaciously inhibit SARS-CoV-2 by targeting its main protease. Nat. Commun. 11, 4417 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  38. Owen, D. R. et al. An oral SARS-CoV-2 Mpro inhibitor clinical candidate for the treatment of COVID-19. Science 374, 1586–1593 (2021).

    Article  CAS  PubMed  Google Scholar 

  39. Iketani, S. et al. Functional map of SARS-CoV-2 3CL protease reveals tolerant and immutable sites. Cell Host Microbe 30, 1354–1362 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  40. Iketani, S. et al. Multiple pathways for SARS-CoV-2 resistance to nirmatrelvir. Nature 613, 558–564 (2023).

    Article  CAS  PubMed  Google Scholar 

  41. Dickinson, B. C., Packer, M. S., Badran, A. H. & Liu, D. R. A system for the continuous directed evolution of proteases rapidly reveals drug-resistance mutations. Nat. Commun. 5, 5352 (2014).

    Article  CAS  PubMed  Google Scholar 

  42. Packer, M. S., Rees, H. A. & Liu, D. R. Phage-assisted continuous evolution of proteases with altered substrate specificity. Nat. Commun. 8, 956 (2017).

    Article  PubMed  PubMed Central  Google Scholar 

  43. Blum, T. R. et al. Phage-assisted evolution of botulinum neurotoxin proteases with reprogrammed specificity. Science 371, 803–810 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  44. Duan, Y. et al. Molecular mechanisms of SARS-CoV-2 resistance to nirmatrelvir. Nature 622, 376–382 (2023).

    Article  CAS  PubMed  Google Scholar 

  45. Nashed, N. T., Aniana, A., Ghirlando, R., Chiliveri, S. C. & Louis, J. M. Modulation of the monomer-dimer equilibrium and catalytic activity of SARS-CoV-2 main protease by a transition-state analog inhibitor. Commun. Biol. 5, 160 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  46. Stanton, B. C. et al. Genomic mining of prokaryotic repressors for orthogonal logic gates. Nat. Chem. Biol. 10, 99–105 (2014).

    Article  CAS  PubMed  Google Scholar 

  47. Ramos, J. L. et al. The TetR family of transcriptional repressors. Microbiol. Mol. Biol. Rev. 69, 326–356 (2005).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  48. Nielsen, A. A. et al. Genetic circuit design automation. Science 352, aac7341 (2016).

    Article  PubMed  Google Scholar 

  49. Brophy, J. A. N. & Voigt, C. A. Principles of genetic circuit design. Nat. Methods 11, 508–520 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  50. DeBenedictis, E. A. et al. Systematic molecular evolution enables robust biomolecule discovery. Nat. Methods 19, 55–64 (2021).

    Article  PubMed  PubMed Central  Google Scholar 

  51. Weinreich, D. M. & Chao, L. Rapid evolutionary escape by large populations from local fitness peaks is likely in nature. Evolution 59, 1175–1182 (2005).

    CAS  PubMed  Google Scholar 

  52. Weissman, D. B., Feldman, M. W. & Fisher, D. S. The rate of fitness-valley crossing in sexual populations. Genetics 186, 1389–1410 (2010).

    Article  PubMed  PubMed Central  Google Scholar 

  53. Dickinson, B. C., Leconte, A. M., Allen, B., Esvelt, K. M. & Liu, D. R. Experimental interrogation of the path dependence and stochasticity of protein evolution using phage-assisted continuous evolution. Proc. Natl Acad. Sci. USA 110, 9007–9012 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  54. Carlson, J. C., Badran, A. H., Guggiana-Nilo, D. A. & Liu, D. R. Negative selection and stringency modulation in phage-assisted continuous evolution. Nat. Chem. Biol. 10, 216–222 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  55. Green, M. R. & Sambrook, J. The Inoue method for preparation and transformation of competent Escherichia coli: “ultracompetent” cells. Cold Spring Harb. Protoc. 2020, 101196 (2020).

    Article  PubMed  Google Scholar 

  56. Chen, R., Li, L. & Weng, Z. ZDOCK: an initial‐stage protein‐docking algorithm. Proteins Struct. Funct. Bioinf. 52, 80–87 (2003).

    Article  CAS  Google Scholar 

  57. Tamura, K., Stecher, G. & Kumar, S. MEGA11: molecular evolutionary genetics analysis version 11. Mol. Biol. Evol. 38, 3022–3027 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  58. Letunic, I. & Bork, P. Interactive Tree Of Life (iTOL) v5: an online tool for phylogenetic tree display and annotation. Nucleic Acids Res. 49, W293–W296 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  59. Liang, J. C., Chang, A. L., Kennedy, A. B. & Smolke, C. D. A high-throughput, quantitative cell-based screen for efficient tailoring of RNA device activity. Nucleic Acids Res. 40, e154 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  60. Xu, Y., Liu, D. & Gong, H. Improving the prediction of protein stability changes upon mutations by geometric learning and a pre-training strategy. Nat. Comput. Sci. https://doi.org/10.1038/s43588-024-00716-2 (2024).

  61. Ma, Z. et al. EvoAI enables extreme compression and reconstruction of the protein sequence space. Zenodo https://doi.org/10.5281/zenodo.10686156 (2024).

Download references

Acknowledgements

This study was supported by Ministry of Science and Technology of China grant 2023YFA0915601 (S.Z.), National Natural Science Foundation of China grants U22A20552 (S.Z.) and 32171416 (S.Z.), Tsinghua University Dushi Plan Foundation (S.Z.), Beijing Frontier Research Center for Biological Structure (S.Z.) and US NIH R01 EB022376/EB031172 (D.R.L.) and R35 GM118062 (D.R.L.). We thank J. Zheng (Westlake University) for helpful discussions. We thank C. Zhang (Tsinghua University) for the kind gift of the EvolvR gene. We apologize to authors whose work cannot be cited owing to referencing restrictions.

Author information

Authors and Affiliations

Authors

Contributions

S.Z. conceptualized and supervised the project. S.Z., Z.M. and W.L. designed the experiments. Z.M., W.L. and H.Q. performed the evolution experiments in EvoScan. Z.M., W.L., Y.S. and Z.L. performed the flow cytometry assays and phage propagation assays of obtained variants. G.L. conducted the mammalian cell experiments. H.G., B.T., Y.X. and J.C. designed and developed the deep learning models. Z.M. and Y.S. wrote the first draft. B.W.T., D.R.L., C.A.V. and S.Z. wrote the final manuscript. All authors contributed to the drafting and revision of the manuscript.

Corresponding author

Correspondence to Shuyi Zhang.

Ethics declarations

Competing interests

S.Z. and Z.M. have filed a patent application based on this work. The other authors declare no competing interests.

Peer review

Peer review information

Nature Methods thanks the anonymous reviewers for their contribution to the peer review of this work. Primary Handling Editor: Arunima Singh, in collaboration with the Nature Methods team. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 PANCE for EGFP-nanobody.

(a) Genetic circuit design of PANCE for EGFP-nanobody. (b) Diagram of PANCE process for EGFP-nanobody. (c) Phage propagation assays of the cI-mutation nanobody compared to WT and ΔgIII. (d) Flow cytometry assays of interaction between the mutated cIp22 and the WT cI434. Concentration of IPTG was 0, 50, 100, 200, 500 and 1000 μM. Data are mean ± SD of three experiments.

Source data

Extended Data Fig. 2 The protease activity testing system.

(a) Schematic diagram of reverse two-hybrid protease activity testing system. Matched substrate was used as the linker of the cI repressor pair, and the active protease cut the substrate and rescued the activity of the p434 promoter repressed by the fused cI repressor. Expression of the downstream reporters (fluorescence protein, pIII for phage propagation, etc.) was controlled by the p434 promoter. (b) Flow cytometry assays for HCV protease by the reverse two-hybrid protease activity testing system. Different concentrations of vanillic acid (0, 50, 100, 200 μM) and IPTG (100 μM) were used. (c) Schematic diagram of the T7-based protease activity testing system. T7 lysozyme and T7 polymerase were linked together with a linker carrying the protease substrate. As the protease cut the substrate, released T7 polymerase recognized T7 promoter and expressed the downstream reporter. (d) Flow cytometry assays for Mpro by T7-based protease activity system (IPTG = 100, 200 μM, Vanillic acid=0, 50, 100, 200 μM). (e) Phage propagation assays for Mpro and its variants under different concentrations of IPTG (IPTG = 0, 100, 200, 500, 1000 μM). (f) Phage propagation assays for inhibitors with different concentrations. Inhibitor concentration gradient was set as 0, 5, 10, 20, 40 μM. The concentration of IPTG was set as 1000 μM. (g) Flow cytometry assays for representative Mpro variants obtained by EvoScan (IPTG = 100 μM, vanillic acid=100 μM, GC376 = 50 μM, PF-07321332 = 50 μM). (h) Resistance Index of Mpro variants on both inhibitors (50 μM). Concentration of IPTG was 100 μM, and vanillic acid was 100 μM. Data are mean ± SD of three experiments.

Source data

Extended Data Fig. 3 PANCE to obtain Mpro variants that escape the inhibition of GC376.

(a) Genetic circuit design of PANCE for Mpro. (b) Diagram of PANCE process for Mpro. 4 replication groups were performed. (c) Evolution results of PANCE after 96 passages. (d) Flow cytometry assays of Mpro variants. The A191V and N119D single mutation were also measured and served as control groups. Concentration of IPTG was 100 μM, concentration of vanillic acid was 100 μM, and concentration of GC376 was 50 μM. (e) Resistance Index of variants against GC376 after 96 passages. The A191V and N119D single mutation were also shown as control groups. (f) Mutations from PANCE mapped to the crystal structure of Mpro (PDB ID: 7CB7). Data are mean ± SD of three experiments.

Source data

Extended Data Fig. 4 PANCE for AmeR evolution.

(a) Genetic circuit design of PANCE for AmeR evolution. (b) Propagation assays of combinatorial AP designs. Different RBS combinations for PhlF and gIII on APs were tested. (c) Workflow of PANCE for AmeR. 12 replication groups were performed. (d) Evolution results of PANCE for AmeR after 16 passages. (e) Flow cytometry assay of variants from PANCE for AmeR. Data are mean ± SD of three experiments.

Source data

Extended Data Fig. 5 AmeR variant activity measurement and evolution path analysis.

(a) Fold repression of 82 AmeR variants from EvoScan. Data are mean ± SD of three experiments. The circuit for flow cytometry assay was shown. (b) Possible evolutionary paths that lead to the same variants with multiple mutations.

Source data

Extended Data Fig. 6 Genetic circuit construction with WT AmeR and the S57R variant.

(a) Schematic diagram of AmeR flow cytometry assay in mammalian cell (HEK293T). (b) The genetic circuit for AmeR flow cytometry assay and the results. 6 M is the AmeR variant R43G A53T S57R A75V P94L D119N. (c) Genetic circuit design of A IMPLY B. (d) Flow cytometry assay of the genetic circuit A IMPLY B. (e) Genetic circuit design of A NIMPLY B. (f) Flow cytometry assay of the genetic circuit A NIMPLY B. (g) Genetic circuit design of NAND. (h) Flow cytometry assay of the genetic circuit NAND. All circuits in this figure used 100 μM vanillic acid (input A) and 1000 μM IPTG (input B) as input signals. Data are mean ± SD of three experiments.

Source data

Extended Data Fig. 7 Overall algorithm flow of the GeoFitness geometric encoder.

For each protein variant, sequence information was transferred into a 1-dimensional vector, and the structure of the sample collected from PDB or predicted by Alphafold2 was transferred into a 2-dimensional vector. These two inputs were integrated through the geometric encoder block and the output was put into the Multi-Layer Perceptron (MLP) for generation of predicted fitness.

Extended Data Fig. 8 Cross-Validation of EvoAI model training.

(a) Schematic diagram of the data set in 10-fold cross-validation. (b) CV-test spearman correlation coefficient of different layers during cross-validation. Curves are shown as the mean of 10 groups. The shadow shows the 95% confidence interval of Spearman correlation values among training process. (c) Influence of layer number of MLP on 10-fold cross-validation of model training. 2-layer MLP shows the best performance with higher spearman correlation coefficient compared to 1-layer MLP and smaller variance compared to 3-layer MLP. The centre line represented the median value, while the box contained a quarter to three quarters of the dataset. The minima and maxima were also shown by the whiskers.

Source data

Extended Data Fig. 9 GeoFitness values of AmeR mutations.

(a) GeoFitness values of all single site mutations generated by the pre-trained model. The first 28 sites of N-terminal were discarded in prediction because of low confidence. A larger score indicates a higher likelihood that this mutation will improve protein function. (b) GeoFitness value ranking of mutations from all the anchors. The selected top 11 sites (13 mutations in total) were colored in red. (c) Predicted score of the top 10 variants each with a combination of 6 mutations designed by EvoAI. (d) Spearman correlation coefficient of the predicted fold repression rank and the experimental fold repression rank of the top 10 variants. The shaded area around the fitted line represents the 95% confidence interval. (e) Top 15 mutations with the highest predicted GeoFitness values from all single mutations. (f) Experimental fold repression of the designed variants using model trained by EvoScan anchors (Top, Middle, Bottom) or deep mutational scanning (DMS) information. Data points are mean of three biological replicates. The centre line represented the median value, while the box contained a quarter to three quarters of the dataset. The minima and maxima were also shown by the whiskers. (g) AUPRC plot for EvoAI-generated variants. (h) AUPRC plot for the test set during EvoAI training.

Source data

Supplementary information

Source data

Source Data Fig. 1

Statistical source data.

Source Data Fig. 2

Statistical source data.

Source Data Fig. 3

Statistical source data.

Source Data Fig. 4

Statistical source data.

Source Data Fig. 5

Statistical source data.

Source Data Extended Data Fig./Table 1

Statistical source data.

Source Data Extended Data Fig./Table 2

Statistical source data.

Source Data Extended Data Fig./Table 3

Statistical source data.

Source Data Extended Data Fig./Table 4

Statistical source data.

Source Data Extended Data Fig./Table 5

Statistical source data.

Source Data Extended Data Fig./Table 6

Statistical source data.

Source Data Extended Data Fig./Table 8

Statistical source data.

Source Data Extended Data Fig./Table 9

Statistical source data.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ma, Z., Li, W., Shen, Y. et al. EvoAI enables extreme compression and reconstruction of the protein sequence space. Nat Methods 22, 102–112 (2025). https://doi.org/10.1038/s41592-024-02504-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Version of record:

  • Issue date:

  • DOI: https://doi.org/10.1038/s41592-024-02504-2

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing