Machine learning prediction of enzyme optimum pH

Gado, Japheth E.; Knotts, Matthew; Shaw, Ada Y.; Marks, Debora; Gauthier, Nicholas P.; Sander, Chris; Beckham, Gregg T.

doi:10.1038/s42256-025-01026-6

Article
Published: 29 April 2025

Machine learning prediction of enzyme optimum pH

Nature Machine Intelligence volume 7, pages 716–729 (2025) Cite this article

6692 Accesses
31 Citations
6 Altmetric
Metrics details

Subjects

A preprint version of the article is available at bioRxiv.

Abstract

The relationship between pH and enzyme catalytic activity, especially the optimal pH (pH_opt) at which enzymes function, is critical for biotechnological applications. Hence, computational methods to predict pH_opt will enhance enzyme discovery and design by facilitating accurate identification of enzymes that function optimally at specific pH levels, and by elucidating sequence–function relationships. Here we proposed and evaluated various machine learning methods for predicting pH_opt, conducting extensive hyperparameter optimization and training over 11,000 model instances. Our results demonstrate that models utilizing language model embeddings markedly outperform other methods in predicting pH_opt. We present EpHod, the best-performing model, to predict pH_opt, making it publicly available to researchers. From sequence data, EpHod directly learns structural and biophysical features that relate to pH_opt, including proximity of residues to the catalytic centre and the accessibility of solvent molecules. Overall, EpHod presents a promising advancement in pH_opt prediction and will potentially speed up the development of enzyme technologies.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on SpringerLink
Instant access to the full article PDF.

USD 39.95

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Overview of model training and the distribution of the training data.**

Fig. 2: Performance of optimal models from each method on the complete held-out pH_opt testing set (n = 1,971) and on a subset of the testing set with less than 20% sequence identity to the training set (n = 999).

**Fig. 3: EpHod training and evaluation of performance.**

**Fig. 4: Analysis of attention weights in EpHod’s RLAT model revealing physicochemical and structural features associated with pH_opt using the full dataset (n = 9,855).**

**Fig. 5: Visualization of per-residue attention weights of EpHod’s RLAT model on selected enzyme protein structures.**

**Fig. 6: Comparison of the predictive performance of EpHod with alternative structural and biophysical methods on the full held-out testing set (n = 1,971).**

Accurately predicting enzyme functions through geometric graph learning on ESMFold-predicted structures

Article Open access 18 September 2024

Molecular mechanisms and hotspots of pH sensing in ASIC1a revealed by computational and functional analysis

Article Open access 26 November 2025

A machine learning model with minimize feature parameters for multi-type hydrogen evolution catalyst prediction

Article Open access 24 April 2025

Data availability

Data for training the models, as well as the weights of the optimal EpHod model, are available via Zenodo at https://doi.org/10.5281/zenodo.14252615 (ref. ¹⁰⁴).

Code availability

Code for EpHod and training the other models is available via GitHub at https://github.com/beckham-lab/EpHod (ref. ¹⁰⁵).

References

Barroca, M. et al. Deciphering the factors defining the pH-dependence of a commercial glycoside hydrolase family 8 enzyme. Enzyme Microb. Technol. 96, 163–169 (2017).
Article Google Scholar
Reed, C. J., Lewis, H., Trejo, E., Winston, V. & Evilia, C. Protein adaptations in Archaeal extremophiles. Archaea 2013, 373275 (2013).
Article Google Scholar
Protze, J. et al. An extracellular tetrathionate hydrolase from the thermoacidophilic archaeon Acidianus ambivalens with an activity optimum at pH 1. Front. Microbiol. 2, 68 (2011).
Article Google Scholar
Pradeep, G. C. et al. An extremely alkaline novel chitinase from Streptomyces sp. CS495. Process Biochem. 49, 223–229 (2014).
Article Google Scholar
Ferrer, M., Golyshina, O., Beloqui, A. & Golyshin, P. N. Mining enzymes from extreme environments. Curr. Opin. Microbiol. 10, 207–214 (2007).
Article Google Scholar
Thomas, N. et al. Engineering of highly active and diverse nuclease enzymes by combining machine learning and ultra-high-throughput screening. Preprint at bioRxiv https://doi.org/10.1101/2024.03.21.585615 (2024).
Verma, D. & Satyanarayana, T. Xylanolytic extremozymes retrieved from environmental metagenomes: characteristics, genetic engineering, and applications. Front. Microbiol. 11, 551109 (2020).
Article Google Scholar
Shahraki, M. F. et al. A computational learning paradigm to targeted discovery of biocatalysts from metagenomic data: a case study of lipase identification. Biotechnol. Bioeng. 119, 1115–1128 (2022).
Article Google Scholar
Erickson, E. et al. Sourcing thermotolerant poly(ethylene terephthalate) hydrolase scaffolds from natural diversity. Nat. Commun. 13, 7850 (2022).
Article Google Scholar
Wang, C.-H., Liu, X.-L., Huang, R.-B., He, B.-F. & Zhao, M.-M. Enhanced acidic adaptation of Bacillus subtilis Ca-independent alpha-amylase by rational engineering of pK_a values. Biochem. Eng. J. 139, 146–153 (2018).
Article Google Scholar
dos Santos, J. P., da Rosa Zavareze, E., Dias, A. R. G. & Vanier, N. L. Immobilization of xylanase and xylanase-β-cyclodextrin complex in polyvinyl alcohol via electrospinning improves enzyme activity at a wide pH and temperature range. Int. J. Biol. Macromol. 118, 1676–1684 (2018).
Article Google Scholar
Giri, P., Pagar, A. D., Patil, M. D. & Yun, H. Chemical modification of enzymes to improve biocatalytic performance. Biotechnol. Adv. 53, 107868 (2021).
Article Google Scholar
Xue, Y. et al. Chemical modification of stem bromelain with anhydride groups to enhance its stability and catalytic activity. J. Mol. Catal. B 63, 188–193 (2010).
Article Google Scholar
Li, C. Effects of chemical modification by chitooligosaccharide on enzyme activity and stability of yeast β-d-fructofuranosidase. Enzyme Microb. Technol. 64–65, 24–32 (2014).
Article Google Scholar
Li, S.-F., Cheng, F., Wang, Y.-J. & Zheng, Y.-G. Strategies for tailoring pH performances of glycoside hydrolases. Crit. Rev. Biotechnol. 43, 121–141 (2023).
Article Google Scholar
Shi, X., Wu, D., Xu, Y. & Yu, X. Engineering the optimum pH of β-galactosidase from Aspergillus oryzae for efficient hydrolysis of lactose. J. Dairy Sci. 105, 4772–4782 (2022).
Article Google Scholar
Hebditch, M. & Warwicker, J. Web-based display of protein surface and pH-dependent properties for assessing the developability of biotherapeutics. Sci. Rep. 9, 1969 (2019).
Article Google Scholar
Schmitz, M. et al. patcHwork: a user-friendly pH sensitivity analysis web server for protein sequences and structures. Nucleic Acids Res. 50, W560–W567 (2022).
Article Google Scholar
Oeller, M. et al. Sequence-based prediction of pH-dependent protein solubility using CamSol. Brief. Bioinform. 24, bbad004 (2023).
Article Google Scholar
Zhang, G., Li, H. & Fang, B. Discriminating acidic and alkaline enzymes using a random forest model with secondary structure amino acid composition. Process Biochem. 44, 654–660 (2009).
Article Google Scholar
Lin, H., Chen, W. & Ding, H. AcalPred: a sequence-based tool for discriminating between acidic and alkaline enzymes. PLoS ONE 8, e75726 (2013).
Article Google Scholar
Fan, G.-L., Li, Q.-Z. & Zuo, Y.-C. Predicting acidic and alkaline enzymes by incorporating the average chemical shift and gene ontology informations into the general form of Chou’s PseAAC. Process Biochem. 48, 1048–1053 (2013).
Article Google Scholar
Khan, Z. U., Hayat, M. & Khan, M. A. Discrimination of acidic and alkaline enzyme using Chou’s pseudo amino acid composition in conjunction with probabilistic neural network model. J. Theor. Biol. 365, 197–203 (2015).
Article MathSciNet Google Scholar
Yan, S. & Wu, G. Predicting pH optimum for activity of beta-glucosidases. J. Biomed. Sci. Eng. 12, 354–367 (2019).
Article Google Scholar
Wang, X., Li, H., Gao, P., Liu, Y. & Zeng, W. Combining support vector machine with dual g-gap dipeptides to discriminate between acidic and alkaline enzymes. Lett. Org. Chem. 16, 325–331 (2019).
Article Google Scholar
Li, X. et al. A sequence embedding method for enzyme optimal condition analysis. BMC Bioinform. 21, 512 (2020).
Article Google Scholar
Schomburg, I. et al. The BRENDA enzyme information system—from a database to an expert system. J. Biotechnol. 261, 194–206 (2017).
Article Google Scholar
Puissant, J. et al. The pH optimum of soil exoenzymes adapt to long term changes in soil pH. Soil Biol. Biochem. 138, 107601 (2019).
Article Google Scholar
Li, G. et al. Learning deep representations of enzyme thermal adaptation. Protein Sci. 31, e4480 (2022).
Article Google Scholar
Reimer, L. C. et al. BacDive in 2022: the knowledge base for standardized bacterial and archaeal data. Nucleic Acids Res. 50, D741–D746 (2022).
Article Google Scholar
Sayers, E. W. et al. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 49, D10–D17 (2021).
Article Google Scholar
Teufel, F. et al. SignalP 6.0 predicts all five types of signal peptides using protein language models. Nat. Biotechnol. 40, 1023–1025 (2022).
Article Google Scholar
Booth, I. R. Regulation of cytoplasmic pH in bacteria. Microbiol. Rev. 49, 359–378 (1985).
Article Google Scholar
Baker-Austin, C. & Dopson, M. Life in acid: pH homeostasis in acidophiles. Trends Microbiol. 15, 165–171 (2007).
Article Google Scholar
Hough, D. W. & Danson, M. J. Extremozymes. Curr. Opin. Chem. Biol. 3, 39–46 (1999).
Article Google Scholar
Tan, C. et al. A survey on deep transfer learning. In Artificial Neural Networks and Machine Learning–ICANN 2018: 27th International Conference on Artificial Neural Networks (eds Kůrková, V. et al.) 270–279 (Springer, 2018).
Gado, J. E., Beckham, G. T. & Payne, C. M. Improving enzyme optimum temperature prediction with resampling strategies and ensemble learning. J. Chem. Inf. Model. 60, 4098–4107 (2020).
Article Google Scholar
Branco, P., Torgo, L. & Ribeiro, R. P. Pre-processing approaches for imbalanced distributions in regression. Neurocomputing 343, 76–99 (2019).
Article Google Scholar
Yang, Y., Zha, K., Chen, Y.-C., Wang, H. & Katabi, D. Delving into deep imbalanced regression. In Proc. 38th International Conference on Machine Learning (eds Meila, M. and Zhang, T.) 11842–11851 (PMLR, 2021).
Chen, Z. et al. iFeature: a Python package and web server for features extraction and selection from protein and peptide sequences. Bioinformatics 34, 2499–2502 (2018).
Article Google Scholar
Unsal, S. et al. Learning functional properties of proteins with language models. Nat. Mach. Intell. 4, 227–245 (2022).
Article Google Scholar
Rives, A. et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl Acad. Sci. USA 118, e2016239118 (2021).
Article Google Scholar
Meier, J. et al. Language models enable zero-shot prediction of the effects of mutations on protein function. In Advances in Neural Information Processing Systems (eds Ranzato, M. et al.) 29287–29303 (Curran Associates, 2021).
Elnaggar, A. et al. ProtTrans: toward understanding the language of life through self-supervised learning. IEEE Trans. Pattern Anal. Mach. Intell. 44, 7112–7127 (2022).
Article Google Scholar
Notin, P. et al. Tranception: protein fitness prediction with autoregressive transformers and inference-time retrieval. In Proc. 39th International Conference on Machine Learning (eds Chaudhuri, K. et al.) 16990–17017 (PMLR, 2022).
Nijkamp, E., Ruffolo, J., Weinstein, E. N., Naik, N. & Madani, A. ProGen2: exploring the boundaries of protein language models. Cell Syst. 14, 968–978.e3 (2023).
Article Google Scholar
Yang, K. K., Lu, A. X. & Fusi, N. Convolutions are competitive with transformers for protein sequence pretraining. Cell Syst. 15, 286–294 (2022).
Stärk, H., Dallago, C., Heinzinger, M. & Rost, B. Light attention predicts protein location from the language of life. Bioinform. Adv. 1, vbab035 (2021).
Article Google Scholar
Bergstra, J. & Bengio, Y. Random search for hyper-parameter optimization. J. Mach. Learn. Res. 13, 281–305 (2012).
MathSciNet Google Scholar
Li, G. et al. Performance of regression models as a function of experiment noise. Bioinform. Biol. Insights 15, 11779322211020315 (2021).
Article Google Scholar
Detlefsen, N. S., Hauberg, S. & Boomsma, W. Learning meaningful representations of protein sequences. Nat. Commun. 13, 1914 (2022).
Article Google Scholar
Kroll, A. & Lercher, M. J. Machine learning models for the prediction of enzyme properties should be tested on proteins not used for model training. Preprint at bioRxiv https://doi.org/10.1101/2023.02.06.526991 (2023).
Suplatov, D. et al. Computational design of a pH stable enzyme: understanding molecular mechanism of penicillin acylase’s adaptation to alkaline conditions. PLoS ONE 9, e100643 (2014).
Article Google Scholar
Huang, Y., Krauss, G., Cottaz, S., Driguez, H. & Lipps, G. A highly acid-stable and thermostable endo-β-glucanase from the thermoacidophilic archaeon Sulfolobus solfataricus. Biochem. J. 385, 581–588 (2005).
Article Google Scholar
Mamo, G., Thunnissen, M., Hatti-Kaul, R. & Mattiasson, B. An alkaline active xylanase: insights into mechanisms of high pH catalytic adaptation. Biochimie 91, 1187–1196 (2009).
Article Google Scholar
Wang, Y., Xu, M., Yang, T., Zhang, X. & Rao, Z. Surface charge-based rational design of aspartase modifies the optimal pH for efficient β-aminobutyric acid production. Int. J. Biol. Macromol. 164, 4165–4172 (2020).
Article Google Scholar
Jakob, F. et al. Surface charge engineering of a Bacillus gibsonii subtilisin protease. Appl. Microbiol. Biotechnol. 97, 6793–6802 (2013).
Article Google Scholar
Yang, T. et al. N20D/N116E combined mutant downward shifted the pH optimum of Bacillus subtilis NADH oxidase. Biology 12, 522 (2023).
Article Google Scholar
Masui, A., Fujiwara, N., Yamamoto, K., Takagi, M. & Imanaka, T. Rational design for stabilization and optimum pH shift of serine protease AprN. J. Ferment. Bioeng. 85, 30–36 (1998).
Article Google Scholar
Turunen, O., Vuorio, M., Fenel, F. & Leisola, M. Engineering of multiple arginines into the Ser/Thr surface of Trichoderma reesei endo-1,4-β-xylanase II increases the thermotolerance and shifts the pH optimum towards alkaline pH. Protein Eng. 15, 141–145 (2002).
Article Google Scholar
Li, Q., Jiang, T., Liu, R., Feng, X. & Li, C. Tuning the pH profile of β-glucuronidase by rational site-directed mutagenesis for efficient transformation of glycyrrhizin. Appl. Microbiol. Biotechnol. 103, 4813–4823 (2019).
Article Google Scholar
Pokhrel, S., Joo, J. C. & Yoo, Y. J. Shifting the optimum pH of Bacillus circulans xylanase towards acidic side by introducing arginine. Biotechnol. Bioprocess Eng. 18, 35–42 (2013).
Article Google Scholar
Carvalho, D. V., Pereira, E. M. & Cardoso, J. S. Machine learning interpretability: a survey on methods and metrics. Electronics 8, 832 (2019).
Article Google Scholar
Olsson, M. H. M., Søndergaard, C. R., Rostkowski, M. & Jensen, J. H. PROPKA3: consistent treatment of internal and surface residues in empirical pKa predictions. J. Chem. Theory Comput. 7, 525–537 (2011).
Article Google Scholar
Talley, K. & Alexov, E. On the pH-optimum of activity and stability of proteins. Proteins 78, 2699–2706 (2010).
Article Google Scholar
Alexov, E. Numerical calculations of the pH of maximal protein stability. The effect of the sequence composition and three-dimensional structure. Eur. J. Biochem. 271, 173–185 (2004).
Article Google Scholar
Yu, T. et al. Enzyme function prediction using contrastive learning. Science 379, 1358–1363 (2023).
Article Google Scholar
Pak, M. A., Dovidchenko, N. V., Sharma, S. M. & Ivankov, D. N. New mega dataset combined with deep neural network makes a progress in predicting impact of mutation on protein stability. Preprint at bioRxiv https://doi.org/10.1101/2022.12.31.522396 (2023).
Kroll, A., Rousset, Y., Hu, X.-P., Liebrand, N. A. & Lercher, M. J. Turnover number predictions for kinetically uncharacterized enzymes using machine and deep learning. Nat. Commun. 14, 4139 (2023).
Article Google Scholar
Gelman, S., Fahlberg, S. A., Heinzelman, P., Romero, P. A. & Gitter, A. Neural networks to learn protein sequence–function relationships from deep mutational scanning data. Proc. Natl Acad. Sci. USA 118, e2104878118 (2021).
Article Google Scholar
Dallago, C. et al. FLIP: benchmark tasks in fitness landscape inference for proteins. Preprint at bioRxiv https://doi.org/10.1101/2021.11.09.467890 (2022).
Sledzieski, S. et al. Democratizing protein language models with parameter-efficient fine-tuning. Proc. Natl Acad. Sci. USA 121, e2405840121 (2024).
Article Google Scholar
Xu, M. et al. PEER: a comprehensive and multi-task benchmark for protein sequence understanding. Adv. Neural Inf. Process. Syst. 35, 35156–35173 (2022).
Google Scholar
Ferdous, S., Shihab, I. F. & Reuel, N. F. Effects of sequence features on machine-learned enzyme classification fidelity. Biochem. Eng. J. 187, 108612 (2022).
Article Google Scholar
Li, G., Rabe, K. S., Nielsen, J. & Engqvist, M. K. M. Machine learning applied to predicting microorganism growth temperatures and enzyme catalytic optima. ACS Synth. Biol. 8, 1411–1420 (2019).
Article Google Scholar
Liu, H., HaoChen, J. Z., Gaidon, A. & Ma, T. Self-supervised learning is more robust to dataset imbalance. Preprint at https://arxiv.org/abs/2110.05025 (2022).
Zaretckii, M., Buslaev, P., Kozlovskii, I., Morozov, A. & Popov, P. Approaching optimal pH enzyme prediction with large language models. ACS Synth. Biol. https://doi.org/10.1021/acssynbio.4c00465 (2024).
Song, Y. et al. Accurately predicting enzyme functions through geometric graph learning on ESMFold-predicted structures. Nat. Commun. 15, 8180 (2024).
Article Google Scholar
Richardson, L. et al. MGnify: the microbiome sequence data analysis resource in 2023. Nucleic Acids Res. 51, D753–D759 (2023).
Article Google Scholar
UniProt Consortium. UniProt: the Universal Protein Knowledgebase in 2023. Nucleic Acids Res. 51, D523–D531 (2023).
Article Google Scholar
Hopf, T. A. et al. Mutation effects predicted from sequence co-variation. Nat. Biotechnol. 35, 128–135 (2017).
Article Google Scholar
Hie, B. L. & Yang, K. K. Adaptive machine learning for protein engineering. Curr. Opin. Struct. Biol. 72, 145–152 (2022).
Article Google Scholar
Yang, K. K., Wu, Z. & Arnold, F. H. Machine-learning-guided directed evolution for protein engineering. Nat. Methods 16, 687–694 (2019).
Article Google Scholar
Hawkins-Hooker, A. et al. Generating functional protein variants with variational autoencoders. PLoS Comput. Biol. 17, e1008736 (2021).
Article Google Scholar
Strokach, A. & Kim, P. M. Deep generative modeling for protein design. Curr. Opin. Struct. Biol. 72, 226–236 (2022).
Article Google Scholar
Steinegger, M. & Söding, J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol. 35, 1026–1028 (2017).
Article Google Scholar
Cock, P. J. A. et al. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 25, 1422–1423 (2009).
Article Google Scholar
Cui, Y., Jia, M., Lin, T.-Y., Song, Y. & Belongie, S. Proc. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (IEEE, 2019).
Riesselman, A. J., Ingraham, J. B. & Marks, D. S. Deep generative models of genetic variation capture the effects of mutations. Nat. Methods 15, 816–822 (2018).
Article Google Scholar
Guyon, I., Weston, J., Barnhill, S. & Vapnik, V. Gene selection for cancer classification using support vector machines. Mach. Learn. 46, 389–422 (2002).
Article Google Scholar
Shin, J.-E. et al. Protein design and variant prediction using autoregressive generative models. Nat. Commun. 12, 2403 (2021).
Article Google Scholar
van den Oord, A. et al. WaveNet: a generative model for raw audio. Preprint at https://arxiv.org/abs/1609.03499 (2016).
Cho, K. et al. Learning phrase representations using RNN encoder–decoder for statistical machine translation. In Proc. 2014 Conference on Empirical Methods in Natural Language Processing (eds Moschitti, A. et al.) 1724–1734 (Association for Computational Linguistics, 2014); https://doi.org/10.3115/v1/D14-1179
Pascanu, R., Mikolov, T. & Bengio, Y. On the difficulty of training recurrent neural networks. In Proc. 30th International Conference on Machine Learning (eds Dasgupta, S. & McAllester, D.) 1310–1318 (PMLR, 2013).
Hochreiter, S. & Schmidhuber, J. Long short-term memory. Neural Comput. 9, 1735–1780 (1997).
Article Google Scholar
Alley, E. C., Khimulya, G., Biswas, S., AlQuraishi, M. & Church, G. M. Unified rational protein engineering with sequence-based deep representation learning. Nat. Methods 16, 1315–1322 (2019).
Article Google Scholar
Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
MathSciNet Google Scholar
Paszke, A. et al. PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems (eds Wallach, H. et al.) 8024–8035 (Curran Associates, 2019).
Varadi, M. et al. AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Res. 50, D439–D444 (2022).
Article Google Scholar
Mirdita, M. et al. ColabFold: making protein folding accessible to all. Nat. Methods 19, 679–682 (2022).
Article Google Scholar
Shrake, A. & Rupley, J. A. Environment and exposure to solvent of protein atoms. Lysozyme and insulin. J. Mol. Biol. 79, 351–371 (1973).
Article Google Scholar
Rost, B. & Sander, C. Conservation and prediction of solvent accessibility in protein families. Proteins 20, 216–226 (1994).
Article Google Scholar
Savojardo, C., Manfredi, M., Martelli, P. L. & Casadio, R. Solvent accessibility of residues undergoing pathogenic variations in humans: from protein structures to protein sequences. Front. Mol. Biosci. 7, 626363 (2021).
Article Google Scholar
Gado, J. E. et al. Machine learning prediction of enzyme optimal pH. Zenodo https://doi.org/10.5281/ZENODO.14252615 (2023).
Gado, J. jafetgado/EpHod: v1.0.0. Zenodo https://doi.org/10.5281/ZENODO.15015125 (2025).
Austin, H. P. et al. Characterization and engineering of a plastic-degrading aromatic polyesterase. Proc. Natl Acad. Sci. USA 115, E4350–E4357 (2018).
Article Google Scholar

Download references

Acknowledgements

We thank R. Estanboulieh and R. Orenbuch for their support in visualizations and analyses. We thank J. Law, E. Komp, A. Kollasch, D. Ritter, P. Notin, J. Frazer and M. Dias for helpful discussions. This material is based upon work supported by the US Department of Energy, Office of Science, Office of Biological and Environmental Research, Genomic Science Program under award no. DE-SC0022024 to N.P.G. and G.T.B. This work was authored in part by Alliance for Sustainable Energy, LLC, the manager and operator of the National Renewable Energy Laboratory for the US Department of Energy (DOE) under contract no. DE-AC36-08GO28308. Partial funding to J.E.G. and G.T.B. was provided by the US Department of Energy, Office of Energy Efficiency and Renewable Energy, Advanced Materials and Manufacturing Technologies Office (AMMTO) and Bioenergy Technologies Office (BETO) as part of the BOTTLE Consortium and was supported by AMMTO and BETO under contract no. DE-AC36-08GO28308 with the National Renewable Energy Laboratory, operated by Alliance for Sustainable Energy, LLC. Partial funding to J.E.G. and G.T.B. was also provided by the US Department of Energy Office of Energy Efficiency and Renewable Energy BETO for the Agile BioFoundry. We thank G. Bentley at DOE and members of the Agile BioFoundry for helpful discussions. Partial funding to J.E.G. and G.T.B. was also provided by the US Department of Energy Office of Science Biological and Environmental Research via DE-SC0023278. The views expressed herein do not necessarily represent the views of the DOE or the US Government. The US Government, and the publisher, by accepting the article for publication, acknowledges that the US Government retains a non-exclusive, paid-up, irrevocable, worldwide license to publish or reproduce the published form of this work, or allow others to do so, for US Government purposes.

Author information

Authors and Affiliations

Renewable Resources and Enabling Sciences Center, National Renewable Energy Laboratory, Golden, CO, USA
Japheth E. Gado & Gregg T. Beckham
BOTTLE Consortium, Golden, CO, USA
Japheth E. Gado & Gregg T. Beckham
Agile BioFoundry, Emeryville, CA, USA
Japheth E. Gado & Gregg T. Beckham
Department of Systems Biology, Harvard Medical School, Boston, MA, USA
Japheth E. Gado, Matthew Knotts, Ada Y. Shaw, Debora Marks, Nicholas P. Gauthier & Chris Sander
Broad Institute of Harvard and MIT, Cambridge, MA, USA
Debora Marks & Chris Sander
Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA, USA
Nicholas P. Gauthier

Authors

Japheth E. Gado
View author publications
Search author on:PubMed Google Scholar
Matthew Knotts
View author publications
Search author on:PubMed Google Scholar
Ada Y. Shaw
View author publications
Search author on:PubMed Google Scholar
Debora Marks
View author publications
Search author on:PubMed Google Scholar
Nicholas P. Gauthier
View author publications
Search author on:PubMed Google Scholar
Chris Sander
View author publications
Search author on:PubMed Google Scholar
Gregg T. Beckham
View author publications
Search author on:PubMed Google Scholar

Contributions

G.T.B. and J.E.G. conceived of the project, and J.E.G., D.M., N.P.G., C.S. and G.T.B. designed the study. J.E.G. trained the predictive models, and A.Y.S., M.K. and J.E.G. evaluated the models. The paper was written by J.E.G. and edited and approved by all authors.

Corresponding author

Correspondence to Gregg T. Beckham.

Ethics declarations

Competing interests

D.M. is an advisor for Dyno Therapeutics, Octant, Jura Bio, Tectonic Therapeutic and Genentech, and a cofounder of Seismic Therapeutics. C.S. is an advisor for CytoReason Ltd. G.T.B. is an advisor for Bluestem Biosciences, Crysalis Biosciences and Samsara Eco. The other authors declare no competing interests.

Peer review

Peer review information

Nature Machine Intelligence thanks Jiri Damborsky, Martin Lercher and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information (download PDF )

Supplementary Figs. 1–13 and Tables 1–7.

Reporting Summary (download PDF )

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Gado, J.E., Knotts, M., Shaw, A.Y. et al. Machine learning prediction of enzyme optimum pH. Nat Mach Intell 7, 716–729 (2025). https://doi.org/10.1038/s42256-025-01026-6

Download citation

Received: 23 June 2023
Accepted: 18 March 2025
Published: 29 April 2025
Version of record: 29 April 2025
Issue date: May 2025
DOI: https://doi.org/10.1038/s42256-025-01026-6

This article is cited by

Recent advances in enzyme engineering for improved deconstruction of poly(ethylene terephthalate) (PET) plastics
- Thomas M. Groseclose
- Hau B. Nguyen
Communications Materials (2025)