Abstract
The relationship between pH and enzyme catalytic activity, especially the optimal pH (pHopt) at which enzymes function, is critical for biotechnological applications. Hence, computational methods to predict pHopt will enhance enzyme discovery and design by facilitating accurate identification of enzymes that function optimally at specific pH levels, and by elucidating sequence–function relationships. Here we proposed and evaluated various machine learning methods for predicting pHopt, conducting extensive hyperparameter optimization and training over 11,000 model instances. Our results demonstrate that models utilizing language model embeddings markedly outperform other methods in predicting pHopt. We present EpHod, the best-performing model, to predict pHopt, making it publicly available to researchers. From sequence data, EpHod directly learns structural and biophysical features that relate to pHopt, including proximity of residues to the catalytic centre and the accessibility of solvent molecules. Overall, EpHod presents a promising advancement in pHopt prediction and will potentially speed up the development of enzyme technologies.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$32.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 digital issues and online access to articles
$119.00 per year
only $9.92 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to the full article PDF.
USD 39.95
Prices may be subject to local taxes which are calculated during checkout






Similar content being viewed by others
Data availability
Data for training the models, as well as the weights of the optimal EpHod model, are available via Zenodo at https://doi.org/10.5281/zenodo.14252615 (ref. 104).
Code availability
Code for EpHod and training the other models is available via GitHub at https://github.com/beckham-lab/EpHod (ref. 105).
References
Barroca, M. et al. Deciphering the factors defining the pH-dependence of a commercial glycoside hydrolase family 8 enzyme. Enzyme Microb. Technol. 96, 163–169 (2017).
Reed, C. J., Lewis, H., Trejo, E., Winston, V. & Evilia, C. Protein adaptations in Archaeal extremophiles. Archaea 2013, 373275 (2013).
Protze, J. et al. An extracellular tetrathionate hydrolase from the thermoacidophilic archaeon Acidianus ambivalens with an activity optimum at pH 1. Front. Microbiol. 2, 68 (2011).
Pradeep, G. C. et al. An extremely alkaline novel chitinase from Streptomyces sp. CS495. Process Biochem. 49, 223–229 (2014).
Ferrer, M., Golyshina, O., Beloqui, A. & Golyshin, P. N. Mining enzymes from extreme environments. Curr. Opin. Microbiol. 10, 207–214 (2007).
Thomas, N. et al. Engineering of highly active and diverse nuclease enzymes by combining machine learning and ultra-high-throughput screening. Preprint at bioRxiv https://doi.org/10.1101/2024.03.21.585615 (2024).
Verma, D. & Satyanarayana, T. Xylanolytic extremozymes retrieved from environmental metagenomes: characteristics, genetic engineering, and applications. Front. Microbiol. 11, 551109 (2020).
Shahraki, M. F. et al. A computational learning paradigm to targeted discovery of biocatalysts from metagenomic data: a case study of lipase identification. Biotechnol. Bioeng. 119, 1115–1128 (2022).
Erickson, E. et al. Sourcing thermotolerant poly(ethylene terephthalate) hydrolase scaffolds from natural diversity. Nat. Commun. 13, 7850 (2022).
Wang, C.-H., Liu, X.-L., Huang, R.-B., He, B.-F. & Zhao, M.-M. Enhanced acidic adaptation of Bacillus subtilis Ca-independent alpha-amylase by rational engineering of pKa values. Biochem. Eng. J. 139, 146–153 (2018).
dos Santos, J. P., da Rosa Zavareze, E., Dias, A. R. G. & Vanier, N. L. Immobilization of xylanase and xylanase-β-cyclodextrin complex in polyvinyl alcohol via electrospinning improves enzyme activity at a wide pH and temperature range. Int. J. Biol. Macromol. 118, 1676–1684 (2018).
Giri, P., Pagar, A. D., Patil, M. D. & Yun, H. Chemical modification of enzymes to improve biocatalytic performance. Biotechnol. Adv. 53, 107868 (2021).
Xue, Y. et al. Chemical modification of stem bromelain with anhydride groups to enhance its stability and catalytic activity. J. Mol. Catal. B 63, 188–193 (2010).
Li, C. Effects of chemical modification by chitooligosaccharide on enzyme activity and stability of yeast β-d-fructofuranosidase. Enzyme Microb. Technol. 64–65, 24–32 (2014).
Li, S.-F., Cheng, F., Wang, Y.-J. & Zheng, Y.-G. Strategies for tailoring pH performances of glycoside hydrolases. Crit. Rev. Biotechnol. 43, 121–141 (2023).
Shi, X., Wu, D., Xu, Y. & Yu, X. Engineering the optimum pH of β-galactosidase from Aspergillus oryzae for efficient hydrolysis of lactose. J. Dairy Sci. 105, 4772–4782 (2022).
Hebditch, M. & Warwicker, J. Web-based display of protein surface and pH-dependent properties for assessing the developability of biotherapeutics. Sci. Rep. 9, 1969 (2019).
Schmitz, M. et al. patcHwork: a user-friendly pH sensitivity analysis web server for protein sequences and structures. Nucleic Acids Res. 50, W560–W567 (2022).
Oeller, M. et al. Sequence-based prediction of pH-dependent protein solubility using CamSol. Brief. Bioinform. 24, bbad004 (2023).
Zhang, G., Li, H. & Fang, B. Discriminating acidic and alkaline enzymes using a random forest model with secondary structure amino acid composition. Process Biochem. 44, 654–660 (2009).
Lin, H., Chen, W. & Ding, H. AcalPred: a sequence-based tool for discriminating between acidic and alkaline enzymes. PLoS ONE 8, e75726 (2013).
Fan, G.-L., Li, Q.-Z. & Zuo, Y.-C. Predicting acidic and alkaline enzymes by incorporating the average chemical shift and gene ontology informations into the general form of Chou’s PseAAC. Process Biochem. 48, 1048–1053 (2013).
Khan, Z. U., Hayat, M. & Khan, M. A. Discrimination of acidic and alkaline enzyme using Chou’s pseudo amino acid composition in conjunction with probabilistic neural network model. J. Theor. Biol. 365, 197–203 (2015).
Yan, S. & Wu, G. Predicting pH optimum for activity of beta-glucosidases. J. Biomed. Sci. Eng. 12, 354–367 (2019).
Wang, X., Li, H., Gao, P., Liu, Y. & Zeng, W. Combining support vector machine with dual g-gap dipeptides to discriminate between acidic and alkaline enzymes. Lett. Org. Chem. 16, 325–331 (2019).
Li, X. et al. A sequence embedding method for enzyme optimal condition analysis. BMC Bioinform. 21, 512 (2020).
Schomburg, I. et al. The BRENDA enzyme information system—from a database to an expert system. J. Biotechnol. 261, 194–206 (2017).
Puissant, J. et al. The pH optimum of soil exoenzymes adapt to long term changes in soil pH. Soil Biol. Biochem. 138, 107601 (2019).
Li, G. et al. Learning deep representations of enzyme thermal adaptation. Protein Sci. 31, e4480 (2022).
Reimer, L. C. et al. BacDive in 2022: the knowledge base for standardized bacterial and archaeal data. Nucleic Acids Res. 50, D741–D746 (2022).
Sayers, E. W. et al. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 49, D10–D17 (2021).
Teufel, F. et al. SignalP 6.0 predicts all five types of signal peptides using protein language models. Nat. Biotechnol. 40, 1023–1025 (2022).
Booth, I. R. Regulation of cytoplasmic pH in bacteria. Microbiol. Rev. 49, 359–378 (1985).
Baker-Austin, C. & Dopson, M. Life in acid: pH homeostasis in acidophiles. Trends Microbiol. 15, 165–171 (2007).
Hough, D. W. & Danson, M. J. Extremozymes. Curr. Opin. Chem. Biol. 3, 39–46 (1999).
Tan, C. et al. A survey on deep transfer learning. In Artificial Neural Networks and Machine Learning–ICANN 2018: 27th International Conference on Artificial Neural Networks (eds Kůrková, V. et al.) 270–279 (Springer, 2018).
Gado, J. E., Beckham, G. T. & Payne, C. M. Improving enzyme optimum temperature prediction with resampling strategies and ensemble learning. J. Chem. Inf. Model. 60, 4098–4107 (2020).
Branco, P., Torgo, L. & Ribeiro, R. P. Pre-processing approaches for imbalanced distributions in regression. Neurocomputing 343, 76–99 (2019).
Yang, Y., Zha, K., Chen, Y.-C., Wang, H. & Katabi, D. Delving into deep imbalanced regression. In Proc. 38th International Conference on Machine Learning (eds Meila, M. and Zhang, T.) 11842–11851 (PMLR, 2021).
Chen, Z. et al. iFeature: a Python package and web server for features extraction and selection from protein and peptide sequences. Bioinformatics 34, 2499–2502 (2018).
Unsal, S. et al. Learning functional properties of proteins with language models. Nat. Mach. Intell. 4, 227–245 (2022).
Rives, A. et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl Acad. Sci. USA 118, e2016239118 (2021).
Meier, J. et al. Language models enable zero-shot prediction of the effects of mutations on protein function. In Advances in Neural Information Processing Systems (eds Ranzato, M. et al.) 29287–29303 (Curran Associates, 2021).
Elnaggar, A. et al. ProtTrans: toward understanding the language of life through self-supervised learning. IEEE Trans. Pattern Anal. Mach. Intell. 44, 7112–7127 (2022).
Notin, P. et al. Tranception: protein fitness prediction with autoregressive transformers and inference-time retrieval. In Proc. 39th International Conference on Machine Learning (eds Chaudhuri, K. et al.) 16990–17017 (PMLR, 2022).
Nijkamp, E., Ruffolo, J., Weinstein, E. N., Naik, N. & Madani, A. ProGen2: exploring the boundaries of protein language models. Cell Syst. 14, 968–978.e3 (2023).
Yang, K. K., Lu, A. X. & Fusi, N. Convolutions are competitive with transformers for protein sequence pretraining. Cell Syst. 15, 286–294 (2022).
Stärk, H., Dallago, C., Heinzinger, M. & Rost, B. Light attention predicts protein location from the language of life. Bioinform. Adv. 1, vbab035 (2021).
Bergstra, J. & Bengio, Y. Random search for hyper-parameter optimization. J. Mach. Learn. Res. 13, 281–305 (2012).
Li, G. et al. Performance of regression models as a function of experiment noise. Bioinform. Biol. Insights 15, 11779322211020315 (2021).
Detlefsen, N. S., Hauberg, S. & Boomsma, W. Learning meaningful representations of protein sequences. Nat. Commun. 13, 1914 (2022).
Kroll, A. & Lercher, M. J. Machine learning models for the prediction of enzyme properties should be tested on proteins not used for model training. Preprint at bioRxiv https://doi.org/10.1101/2023.02.06.526991 (2023).
Suplatov, D. et al. Computational design of a pH stable enzyme: understanding molecular mechanism of penicillin acylase’s adaptation to alkaline conditions. PLoS ONE 9, e100643 (2014).
Huang, Y., Krauss, G., Cottaz, S., Driguez, H. & Lipps, G. A highly acid-stable and thermostable endo-β-glucanase from the thermoacidophilic archaeon Sulfolobus solfataricus. Biochem. J. 385, 581–588 (2005).
Mamo, G., Thunnissen, M., Hatti-Kaul, R. & Mattiasson, B. An alkaline active xylanase: insights into mechanisms of high pH catalytic adaptation. Biochimie 91, 1187–1196 (2009).
Wang, Y., Xu, M., Yang, T., Zhang, X. & Rao, Z. Surface charge-based rational design of aspartase modifies the optimal pH for efficient β-aminobutyric acid production. Int. J. Biol. Macromol. 164, 4165–4172 (2020).
Jakob, F. et al. Surface charge engineering of a Bacillus gibsonii subtilisin protease. Appl. Microbiol. Biotechnol. 97, 6793–6802 (2013).
Yang, T. et al. N20D/N116E combined mutant downward shifted the pH optimum of Bacillus subtilis NADH oxidase. Biology 12, 522 (2023).
Masui, A., Fujiwara, N., Yamamoto, K., Takagi, M. & Imanaka, T. Rational design for stabilization and optimum pH shift of serine protease AprN. J. Ferment. Bioeng. 85, 30–36 (1998).
Turunen, O., Vuorio, M., Fenel, F. & Leisola, M. Engineering of multiple arginines into the Ser/Thr surface of Trichoderma reesei endo-1,4-β-xylanase II increases the thermotolerance and shifts the pH optimum towards alkaline pH. Protein Eng. 15, 141–145 (2002).
Li, Q., Jiang, T., Liu, R., Feng, X. & Li, C. Tuning the pH profile of β-glucuronidase by rational site-directed mutagenesis for efficient transformation of glycyrrhizin. Appl. Microbiol. Biotechnol. 103, 4813–4823 (2019).
Pokhrel, S., Joo, J. C. & Yoo, Y. J. Shifting the optimum pH of Bacillus circulans xylanase towards acidic side by introducing arginine. Biotechnol. Bioprocess Eng. 18, 35–42 (2013).
Carvalho, D. V., Pereira, E. M. & Cardoso, J. S. Machine learning interpretability: a survey on methods and metrics. Electronics 8, 832 (2019).
Olsson, M. H. M., Søndergaard, C. R., Rostkowski, M. & Jensen, J. H. PROPKA3: consistent treatment of internal and surface residues in empirical pKa predictions. J. Chem. Theory Comput. 7, 525–537 (2011).
Talley, K. & Alexov, E. On the pH-optimum of activity and stability of proteins. Proteins 78, 2699–2706 (2010).
Alexov, E. Numerical calculations of the pH of maximal protein stability. The effect of the sequence composition and three-dimensional structure. Eur. J. Biochem. 271, 173–185 (2004).
Yu, T. et al. Enzyme function prediction using contrastive learning. Science 379, 1358–1363 (2023).
Pak, M. A., Dovidchenko, N. V., Sharma, S. M. & Ivankov, D. N. New mega dataset combined with deep neural network makes a progress in predicting impact of mutation on protein stability. Preprint at bioRxiv https://doi.org/10.1101/2022.12.31.522396 (2023).
Kroll, A., Rousset, Y., Hu, X.-P., Liebrand, N. A. & Lercher, M. J. Turnover number predictions for kinetically uncharacterized enzymes using machine and deep learning. Nat. Commun. 14, 4139 (2023).
Gelman, S., Fahlberg, S. A., Heinzelman, P., Romero, P. A. & Gitter, A. Neural networks to learn protein sequence–function relationships from deep mutational scanning data. Proc. Natl Acad. Sci. USA 118, e2104878118 (2021).
Dallago, C. et al. FLIP: benchmark tasks in fitness landscape inference for proteins. Preprint at bioRxiv https://doi.org/10.1101/2021.11.09.467890 (2022).
Sledzieski, S. et al. Democratizing protein language models with parameter-efficient fine-tuning. Proc. Natl Acad. Sci. USA 121, e2405840121 (2024).
Xu, M. et al. PEER: a comprehensive and multi-task benchmark for protein sequence understanding. Adv. Neural Inf. Process. Syst. 35, 35156–35173 (2022).
Ferdous, S., Shihab, I. F. & Reuel, N. F. Effects of sequence features on machine-learned enzyme classification fidelity. Biochem. Eng. J. 187, 108612 (2022).
Li, G., Rabe, K. S., Nielsen, J. & Engqvist, M. K. M. Machine learning applied to predicting microorganism growth temperatures and enzyme catalytic optima. ACS Synth. Biol. 8, 1411–1420 (2019).
Liu, H., HaoChen, J. Z., Gaidon, A. & Ma, T. Self-supervised learning is more robust to dataset imbalance. Preprint at https://arxiv.org/abs/2110.05025 (2022).
Zaretckii, M., Buslaev, P., Kozlovskii, I., Morozov, A. & Popov, P. Approaching optimal pH enzyme prediction with large language models. ACS Synth. Biol. https://doi.org/10.1021/acssynbio.4c00465 (2024).
Song, Y. et al. Accurately predicting enzyme functions through geometric graph learning on ESMFold-predicted structures. Nat. Commun. 15, 8180 (2024).
Richardson, L. et al. MGnify: the microbiome sequence data analysis resource in 2023. Nucleic Acids Res. 51, D753–D759 (2023).
UniProt Consortium. UniProt: the Universal Protein Knowledgebase in 2023. Nucleic Acids Res. 51, D523–D531 (2023).
Hopf, T. A. et al. Mutation effects predicted from sequence co-variation. Nat. Biotechnol. 35, 128–135 (2017).
Hie, B. L. & Yang, K. K. Adaptive machine learning for protein engineering. Curr. Opin. Struct. Biol. 72, 145–152 (2022).
Yang, K. K., Wu, Z. & Arnold, F. H. Machine-learning-guided directed evolution for protein engineering. Nat. Methods 16, 687–694 (2019).
Hawkins-Hooker, A. et al. Generating functional protein variants with variational autoencoders. PLoS Comput. Biol. 17, e1008736 (2021).
Strokach, A. & Kim, P. M. Deep generative modeling for protein design. Curr. Opin. Struct. Biol. 72, 226–236 (2022).
Steinegger, M. & Söding, J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol. 35, 1026–1028 (2017).
Cock, P. J. A. et al. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 25, 1422–1423 (2009).
Cui, Y., Jia, M., Lin, T.-Y., Song, Y. & Belongie, S. Proc. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (IEEE, 2019).
Riesselman, A. J., Ingraham, J. B. & Marks, D. S. Deep generative models of genetic variation capture the effects of mutations. Nat. Methods 15, 816–822 (2018).
Guyon, I., Weston, J., Barnhill, S. & Vapnik, V. Gene selection for cancer classification using support vector machines. Mach. Learn. 46, 389–422 (2002).
Shin, J.-E. et al. Protein design and variant prediction using autoregressive generative models. Nat. Commun. 12, 2403 (2021).
van den Oord, A. et al. WaveNet: a generative model for raw audio. Preprint at https://arxiv.org/abs/1609.03499 (2016).
Cho, K. et al. Learning phrase representations using RNN encoder–decoder for statistical machine translation. In Proc. 2014 Conference on Empirical Methods in Natural Language Processing (eds Moschitti, A. et al.) 1724–1734 (Association for Computational Linguistics, 2014); https://doi.org/10.3115/v1/D14-1179
Pascanu, R., Mikolov, T. & Bengio, Y. On the difficulty of training recurrent neural networks. In Proc. 30th International Conference on Machine Learning (eds Dasgupta, S. & McAllester, D.) 1310–1318 (PMLR, 2013).
Hochreiter, S. & Schmidhuber, J. Long short-term memory. Neural Comput. 9, 1735–1780 (1997).
Alley, E. C., Khimulya, G., Biswas, S., AlQuraishi, M. & Church, G. M. Unified rational protein engineering with sequence-based deep representation learning. Nat. Methods 16, 1315–1322 (2019).
Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
Paszke, A. et al. PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems (eds Wallach, H. et al.) 8024–8035 (Curran Associates, 2019).
Varadi, M. et al. AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Res. 50, D439–D444 (2022).
Mirdita, M. et al. ColabFold: making protein folding accessible to all. Nat. Methods 19, 679–682 (2022).
Shrake, A. & Rupley, J. A. Environment and exposure to solvent of protein atoms. Lysozyme and insulin. J. Mol. Biol. 79, 351–371 (1973).
Rost, B. & Sander, C. Conservation and prediction of solvent accessibility in protein families. Proteins 20, 216–226 (1994).
Savojardo, C., Manfredi, M., Martelli, P. L. & Casadio, R. Solvent accessibility of residues undergoing pathogenic variations in humans: from protein structures to protein sequences. Front. Mol. Biosci. 7, 626363 (2021).
Gado, J. E. et al. Machine learning prediction of enzyme optimal pH. Zenodo https://doi.org/10.5281/ZENODO.14252615 (2023).
Gado, J. jafetgado/EpHod: v1.0.0. Zenodo https://doi.org/10.5281/ZENODO.15015125 (2025).
Austin, H. P. et al. Characterization and engineering of a plastic-degrading aromatic polyesterase. Proc. Natl Acad. Sci. USA 115, E4350–E4357 (2018).
Acknowledgements
We thank R. Estanboulieh and R. Orenbuch for their support in visualizations and analyses. We thank J. Law, E. Komp, A. Kollasch, D. Ritter, P. Notin, J. Frazer and M. Dias for helpful discussions. This material is based upon work supported by the US Department of Energy, Office of Science, Office of Biological and Environmental Research, Genomic Science Program under award no. DE-SC0022024 to N.P.G. and G.T.B. This work was authored in part by Alliance for Sustainable Energy, LLC, the manager and operator of the National Renewable Energy Laboratory for the US Department of Energy (DOE) under contract no. DE-AC36-08GO28308. Partial funding to J.E.G. and G.T.B. was provided by the US Department of Energy, Office of Energy Efficiency and Renewable Energy, Advanced Materials and Manufacturing Technologies Office (AMMTO) and Bioenergy Technologies Office (BETO) as part of the BOTTLE Consortium and was supported by AMMTO and BETO under contract no. DE-AC36-08GO28308 with the National Renewable Energy Laboratory, operated by Alliance for Sustainable Energy, LLC. Partial funding to J.E.G. and G.T.B. was also provided by the US Department of Energy Office of Energy Efficiency and Renewable Energy BETO for the Agile BioFoundry. We thank G. Bentley at DOE and members of the Agile BioFoundry for helpful discussions. Partial funding to J.E.G. and G.T.B. was also provided by the US Department of Energy Office of Science Biological and Environmental Research via DE-SC0023278. The views expressed herein do not necessarily represent the views of the DOE or the US Government. The US Government, and the publisher, by accepting the article for publication, acknowledges that the US Government retains a non-exclusive, paid-up, irrevocable, worldwide license to publish or reproduce the published form of this work, or allow others to do so, for US Government purposes.
Author information
Authors and Affiliations
Contributions
G.T.B. and J.E.G. conceived of the project, and J.E.G., D.M., N.P.G., C.S. and G.T.B. designed the study. J.E.G. trained the predictive models, and A.Y.S., M.K. and J.E.G. evaluated the models. The paper was written by J.E.G. and edited and approved by all authors.
Corresponding author
Ethics declarations
Competing interests
D.M. is an advisor for Dyno Therapeutics, Octant, Jura Bio, Tectonic Therapeutic and Genentech, and a cofounder of Seismic Therapeutics. C.S. is an advisor for CytoReason Ltd. G.T.B. is an advisor for Bluestem Biosciences, Crysalis Biosciences and Samsara Eco. The other authors declare no competing interests.
Peer review
Peer review information
Nature Machine Intelligence thanks Jiri Damborsky, Martin Lercher and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Supplementary Information (download PDF )
Supplementary Figs. 1–13 and Tables 1–7.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Gado, J.E., Knotts, M., Shaw, A.Y. et al. Machine learning prediction of enzyme optimum pH. Nat Mach Intell 7, 716–729 (2025). https://doi.org/10.1038/s42256-025-01026-6
Received:
Accepted:
Published:
Version of record:
Issue date:
DOI: https://doi.org/10.1038/s42256-025-01026-6
This article is cited by
-
Recent advances in enzyme engineering for improved deconstruction of poly(ethylene terephthalate) (PET) plastics
Communications Materials (2025)


