Abstract
The vast scope but limited-supporting evidence in sequence databases hinders identification of proteins with specific functionality. Here, we experimentally characterized catalytic efficiency, target site window, motif preference, and off-target activity of 1100 apolipoprotein B mRNA-editing enzyme, catalytic polypeptide (APOBEC)-like family cytidine deaminases (CDs) fused with nCas9 in HEK293T cells, thereby generating the largest dataset of experimentally validated functions for a single protein family to date. These data, together with amino acid sequence, three-dimensional structure, and eight additional features, were used to construct a machine learning (ML) model, AlphaCD, which showed high accuracy in predicting catalytic efficiency (0.92) and off-target activity (0.84), as well as target windows (0.73) and catalytic motifs (0.78). We applied the trained model to predict the above catalytic features of 21,335 CDs in Uniprot, and subsampling of 28 CDs further validated its prediction accuracy (0.84, 0.87, 0.75, 0.73, respectively). Alanine scanning-based mutagenesis was then employed to reduce off-targets in one example CD, which produced a remarkably high fidelity, high efficiency cytosine base editor, thus demonstrating AlphaCD application in high-accuracy, high-throughput protein functional characterization, and providing a strategy for accelerated characterization of other proteins.
This is a preview of subscription content, access via your institution
Access options
Subscribe to this journal
Receive 12 digital issues and online access to articles
$119.00 per year
only $9.92 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to the full article PDF.
USD 39.95
Prices may be subject to local taxes which are calculated during checkout





Similar content being viewed by others
Data availability
The raw sequence data were deposited in National Center for Biotechnology Information Sequence Read Archive database under accession code PRJNA1157606. Source data are provided in this paper.
References
Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).
Tunyasuvunakool, K. et al. Highly accurate protein structure prediction for the human proteome. Nature 596, 590–596 (2021).
Rives, A. et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl. Acad. Sci. USA 118, e2016239118 (2021).
Huang, J. et al. Discovery of deaminase functions by structure-based protein clustering. Cell 186, 3182–3195.e14 (2023).
Xu, K. et al. Structure-guided discovery of highly efficient cytidine deaminases with sequence-context independence. Nat. Biomed. Eng. 9, 93–108 (2025).
Gligorijevic, V. et al. Structure-based protein function prediction using graph convolutional networks. Nat. Commun. 12, 3168 (2021).
Camacho, C. et al. BLAST + : architecture and applications. BMC Bioinform. 10, 421 (2009).
Potter, S. C. et al. HMMER web server: 2018 update. Nucleic Acids Res. 46, W200–W204 (2018).
Altae-Tran, H. et al. Uncovering the functional diversity of rare CRISPR-Cas systems with deep terascale clustering. Science 382, eadi1910 (2023).
Al-Shayeb, B. et al. Diverse virus-encoded CRISPR-Cas systems include streamlined genome editors. Cell 185, 4574–4586.e16 (2022).
Senior, A. W. et al. Improved protein structure prediction using potentials from deep learning. Nature 577, 706–710 (2020).
Barrio-Hernandez, I. et al. Clustering predicted structures at the scale of the known protein universe. Nature 622, 637–645 (2023).
Durairaj, J. et al. Uncovering new families and folds in the natural protein universe. Nature 622, 646–653 (2023).
Yoon, P. H. et al. Structure-guided discovery of ancestral CRISPR-Cas13 ribonucleases. Science 385, 538–543 (2024).
Jordan, M. I. & Mitchell, T. M. Machine learning: Trends, perspectives, and prospects. Science 349, 255–260 (2015).
Butler, K. T. et al. Machine learning for molecular and materials science. Nature 559, 547–555 (2018).
Greener, J. G. et al. A guide to machine learning for biologists. Nat. Rev. Mol. Cell Biol. 23, 40–55 (2022).
Hsu, C. et al. Generative models for protein structures and sequences. Nat. Biotechnol. 42, 196–199 (2024).
Anishchenko, I. et al. De novo protein design by deep network hallucination. Nature 600, 547–552 (2021).
Gelman, S. et al. Neural networks to learn protein sequence-function relationships from deep mutational scanning data. Proc. Natl. Acad. Sci. USA 118, e2104878118 (2021).
Santos-Junior, C. D. et al. Discovery of antimicrobial peptides in the global microbiome with machine learning. Cell 187, 3761–3778.e16 (2024).
Huang, J. et al. Identification of potent antimicrobial peptides via a machine-learning pipeline that mines the entire space of peptide sequences. Nat. Biomed. Eng. 7, 797–810 (2023).
Wan, F. et al. Deep-learning-enabled antibiotic discovery through molecular de-extinction. Nat. Biomed. Eng. 8, 854–871 (2024).
Tsuboyama, K. et al. Mega-scale experimental analysis of protein folding stability in biology and design. Nature 620, 434–444 (2023).
Goshisht, M. K. et al. Machine learning and deep learning in synthetic biology: Key architectures, applications, and challenges. ACS Omega 9, 9921–9945 (2024).
Kouba, P. Machine learning-guided protein engineering. ACS Catal. 13, 13863–13895 (2023).
Richter, M. F. et al. Phage-assisted evolution of an adenine base editor with improved Cas domain compatibility and activity. Nat. Biotechnol. 38, 883–891 (2020).
Doman, J. L. et al. Evaluation and minimization of Cas9-independent off-target DNA editing by cytosine base editors. Nat. Biotechnol. 38, 620–628 (2020).
Yu, Y. et al. Cytosine base editors with minimized unguided DNA and RNA off-target events and high on-target activity. Nat. Commun. 11, 2052 (2020).
Fu, L. et al. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 28, 3150–3152 (2012).
Steinegger, M. & Soding, J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol. 35, 1026–1028 (2017).
van Kempen, M. et al. Fast and accurate protein structure search with Foldseek. Nat. Biotechnol. 42, 243–246 (2024).
Breiman, L. Random forests. Mach. Learn 45, 5–32 (2001).
Morrison, K. L. & Weiss, G. A. Combinatorial alanine-scanning. Curr. Opin. Chem. Biol. 5, 302–307 (2001).
Weiss, G. A. et al. Rapid mapping of protein functional epitopes by combinatorial alanine scanning. Proc. Natl. Acad. Sci. USA 97, 8950–8954 (2000).
Anfinsen, C. B. Principles that govern the folding of protein chains. Science 181, 223–230 (1973).
Aharoni, A. et al. The ‘evolvability’ of promiscuous protein functions. Nat. Genet. 37, 73–76 (2005).
Li, A. et al. Cytosine base editing systems with minimized off-target effect and molecular size. Nat. Commun. 13, 4531 (2022).
Neugebauer, M. E. et al. Evolution of an adenine base editor into a small, efficient cytosine base editor with low off-target activity. Nat. Biotechnol. 41, 673–685 (2023).
Zhang, S. et al. TadA reprogramming to generate potent miniature base editors with high precision. Nat. Commun. 14, 413 (2023).
Acknowledgements
This study was supported by the Biological Breeding-Major Projects (2023ZD04074), the National Natural Science Foundation of China (32371549 to E.Z., and 32202645 to K.X.), the Innovation Program of Chinese Academy of Agricultural Sciences (CAAS-CSIAF-202401), China Postdoctoral Science Foundation (2021M693442 and 2023T160703 to K.X.).
Author information
Authors and Affiliations
Contributions
E.Z., K.X., G.H., M.W., and H.Z. designed the study. K.X., H.Z., and J.L. performed experiments. K.X., G.H., M.W. and H.F. performed data analysis. E.Z. supervised the project. K.X., E.Z., and M.W. wrote the manuscript.
Corresponding author
Ethics declarations
Competing interests
The engineered CBEs are covered in a pending patent application (E.Z. and K.X.).
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Xu, K., Hua, G., Wu, M. et al. AlphaCD: a machine learning model capable of highly accurate characterization for 21,335 cytidine deaminases. Cell Res 35, 750–761 (2025). https://doi.org/10.1038/s41422-025-01164-x
Received:
Accepted:
Published:
Version of record:
Issue date:
DOI: https://doi.org/10.1038/s41422-025-01164-x


