Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

AlphaCD: a machine learning model capable of highly accurate characterization for 21,335 cytidine deaminases

Abstract

The vast scope but limited-supporting evidence in sequence databases hinders identification of proteins with specific functionality. Here, we experimentally characterized catalytic efficiency, target site window, motif preference, and off-target activity of 1100 apolipoprotein B mRNA-editing enzyme, catalytic polypeptide (APOBEC)-like family cytidine deaminases (CDs) fused with nCas9 in HEK293T cells, thereby generating the largest dataset of experimentally validated functions for a single protein family to date. These data, together with amino acid sequence, three-dimensional structure, and eight additional features, were used to construct a machine learning (ML) model, AlphaCD, which showed high accuracy in predicting catalytic efficiency (0.92) and off-target activity (0.84), as well as target windows (0.73) and catalytic motifs (0.78). We applied the trained model to predict the above catalytic features of 21,335 CDs in Uniprot, and subsampling of 28 CDs further validated its prediction accuracy (0.84, 0.87, 0.75, 0.73, respectively). Alanine scanning-based mutagenesis was then employed to reduce off-targets in one example CD, which produced a remarkably high fidelity, high efficiency cytosine base editor, thus demonstrating AlphaCD application in high-accuracy, high-throughput protein functional characterization, and providing a strategy for accelerated characterization of other proteins.

This is a preview of subscription content, access via your institution

Access options

Buy this article

USD 39.95

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Characterization of diversity in CD catalytic characteristics used to train an ML-based functional prediction model.
Fig. 2: Identification of features potentially influencing catalytic efficiency, off-target catalytic efficiency, catalytic window, and preferential catalytic motif.
Fig. 3: Construction of ML-based models for characterization of 21,335 CDs.
Fig. 4: A high-efficiency CD engineered to reduce off-target effects.
Fig. 5: A0A2R2Z4E4E100A CBE editing of pathogenic genes.

Similar content being viewed by others

Data availability

The raw sequence data were deposited in National Center for Biotechnology Information Sequence Read Archive database under accession code PRJNA1157606. Source data are provided in this paper.

References

  1. Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  2. Tunyasuvunakool, K. et al. Highly accurate protein structure prediction for the human proteome. Nature 596, 590–596 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Rives, A. et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl. Acad. Sci. USA 118, e2016239118 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. Huang, J. et al. Discovery of deaminase functions by structure-based protein clustering. Cell 186, 3182–3195.e14 (2023).

    Article  CAS  PubMed  Google Scholar 

  5. Xu, K. et al. Structure-guided discovery of highly efficient cytidine deaminases with sequence-context independence. Nat. Biomed. Eng. 9, 93–108 (2025).

    Article  CAS  PubMed  Google Scholar 

  6. Gligorijevic, V. et al. Structure-based protein function prediction using graph convolutional networks. Nat. Commun. 12, 3168 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  7. Camacho, C. et al. BLAST + : architecture and applications. BMC Bioinform. 10, 421 (2009).

    Article  Google Scholar 

  8. Potter, S. C. et al. HMMER web server: 2018 update. Nucleic Acids Res. 46, W200–W204 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Altae-Tran, H. et al. Uncovering the functional diversity of rare CRISPR-Cas systems with deep terascale clustering. Science 382, eadi1910 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. Al-Shayeb, B. et al. Diverse virus-encoded CRISPR-Cas systems include streamlined genome editors. Cell 185, 4574–4586.e16 (2022).

    Article  CAS  PubMed  Google Scholar 

  11. Senior, A. W. et al. Improved protein structure prediction using potentials from deep learning. Nature 577, 706–710 (2020).

    Article  CAS  PubMed  Google Scholar 

  12. Barrio-Hernandez, I. et al. Clustering predicted structures at the scale of the known protein universe. Nature 622, 637–645 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  13. Durairaj, J. et al. Uncovering new families and folds in the natural protein universe. Nature 622, 646–653 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. Yoon, P. H. et al. Structure-guided discovery of ancestral CRISPR-Cas13 ribonucleases. Science 385, 538–543 (2024).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  15. Jordan, M. I. & Mitchell, T. M. Machine learning: Trends, perspectives, and prospects. Science 349, 255–260 (2015).

    Article  CAS  PubMed  Google Scholar 

  16. Butler, K. T. et al. Machine learning for molecular and materials science. Nature 559, 547–555 (2018).

    Article  CAS  PubMed  Google Scholar 

  17. Greener, J. G. et al. A guide to machine learning for biologists. Nat. Rev. Mol. Cell Biol. 23, 40–55 (2022).

    Article  CAS  PubMed  Google Scholar 

  18. Hsu, C. et al. Generative models for protein structures and sequences. Nat. Biotechnol. 42, 196–199 (2024).

    Article  CAS  PubMed  Google Scholar 

  19. Anishchenko, I. et al. De novo protein design by deep network hallucination. Nature 600, 547–552 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Gelman, S. et al. Neural networks to learn protein sequence-function relationships from deep mutational scanning data. Proc. Natl. Acad. Sci. USA 118, e2104878118 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  21. Santos-Junior, C. D. et al. Discovery of antimicrobial peptides in the global microbiome with machine learning. Cell 187, 3761–3778.e16 (2024).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. Huang, J. et al. Identification of potent antimicrobial peptides via a machine-learning pipeline that mines the entire space of peptide sequences. Nat. Biomed. Eng. 7, 797–810 (2023).

    Article  CAS  PubMed  Google Scholar 

  23. Wan, F. et al. Deep-learning-enabled antibiotic discovery through molecular de-extinction. Nat. Biomed. Eng. 8, 854–871 (2024).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  24. Tsuboyama, K. et al. Mega-scale experimental analysis of protein folding stability in biology and design. Nature 620, 434–444 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  25. Goshisht, M. K. et al. Machine learning and deep learning in synthetic biology: Key architectures, applications, and challenges. ACS Omega 9, 9921–9945 (2024).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  26. Kouba, P. Machine learning-guided protein engineering. ACS Catal. 13, 13863–13895 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  27. Richter, M. F. et al. Phage-assisted evolution of an adenine base editor with improved Cas domain compatibility and activity. Nat. Biotechnol. 38, 883–891 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  28. Doman, J. L. et al. Evaluation and minimization of Cas9-independent off-target DNA editing by cytosine base editors. Nat. Biotechnol. 38, 620–628 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  29. Yu, Y. et al. Cytosine base editors with minimized unguided DNA and RNA off-target events and high on-target activity. Nat. Commun. 11, 2052 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  30. Fu, L. et al. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 28, 3150–3152 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  31. Steinegger, M. & Soding, J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol. 35, 1026–1028 (2017).

    Article  CAS  PubMed  Google Scholar 

  32. van Kempen, M. et al. Fast and accurate protein structure search with Foldseek. Nat. Biotechnol. 42, 243–246 (2024).

    Article  PubMed  Google Scholar 

  33. Breiman, L. Random forests. Mach. Learn 45, 5–32 (2001).

    Article  Google Scholar 

  34. Morrison, K. L. & Weiss, G. A. Combinatorial alanine-scanning. Curr. Opin. Chem. Biol. 5, 302–307 (2001).

    Article  CAS  PubMed  Google Scholar 

  35. Weiss, G. A. et al. Rapid mapping of protein functional epitopes by combinatorial alanine scanning. Proc. Natl. Acad. Sci. USA 97, 8950–8954 (2000).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  36. Anfinsen, C. B. Principles that govern the folding of protein chains. Science 181, 223–230 (1973).

    Article  CAS  PubMed  Google Scholar 

  37. Aharoni, A. et al. The ‘evolvability’ of promiscuous protein functions. Nat. Genet. 37, 73–76 (2005).

    Article  CAS  PubMed  Google Scholar 

  38. Li, A. et al. Cytosine base editing systems with minimized off-target effect and molecular size. Nat. Commun. 13, 4531 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  39. Neugebauer, M. E. et al. Evolution of an adenine base editor into a small, efficient cytosine base editor with low off-target activity. Nat. Biotechnol. 41, 673–685 (2023).

    Article  CAS  PubMed  Google Scholar 

  40. Zhang, S. et al. TadA reprogramming to generate potent miniature base editors with high precision. Nat. Commun. 14, 413 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

This study was supported by the Biological Breeding-Major Projects (2023ZD04074), the National Natural Science Foundation of China (32371549 to E.Z., and 32202645 to K.X.), the Innovation Program of Chinese Academy of Agricultural Sciences (CAAS-CSIAF-202401), China Postdoctoral Science Foundation (2021M693442 and 2023T160703 to K.X.).

Author information

Authors and Affiliations

Authors

Contributions

E.Z., K.X., G.H., M.W., and H.Z. designed the study. K.X., H.Z., and J.L. performed experiments. K.X., G.H., M.W. and H.F. performed data analysis. E.Z. supervised the project. K.X., E.Z., and M.W. wrote the manuscript.

Corresponding author

Correspondence to Erwei Zuo.

Ethics declarations

Competing interests

The engineered CBEs are covered in a pending patent application (E.Z. and K.X.).

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xu, K., Hua, G., Wu, M. et al. AlphaCD: a machine learning model capable of highly accurate characterization for 21,335 cytidine deaminases. Cell Res 35, 750–761 (2025). https://doi.org/10.1038/s41422-025-01164-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Version of record:

  • Issue date:

  • DOI: https://doi.org/10.1038/s41422-025-01164-x

Search

Quick links