Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Machine learning prediction of enzyme optimum pH

A preprint version of the article is available at bioRxiv.

Abstract

The relationship between pH and enzyme catalytic activity, especially the optimal pH (pHopt) at which enzymes function, is critical for biotechnological applications. Hence, computational methods to predict pHopt will enhance enzyme discovery and design by facilitating accurate identification of enzymes that function optimally at specific pH levels, and by elucidating sequence–function relationships. Here we proposed and evaluated various machine learning methods for predicting pHopt, conducting extensive hyperparameter optimization and training over 11,000 model instances. Our results demonstrate that models utilizing language model embeddings markedly outperform other methods in predicting pHopt. We present EpHod, the best-performing model, to predict pHopt, making it publicly available to researchers. From sequence data, EpHod directly learns structural and biophysical features that relate to pHopt, including proximity of residues to the catalytic centre and the accessibility of solvent molecules. Overall, EpHod presents a promising advancement in pHopt prediction and will potentially speed up the development of enzyme technologies.

This is a preview of subscription content, access via your institution

Access options

Buy this article

USD 39.95

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Overview of model training and the distribution of the training data.
Fig. 2: Performance of optimal models from each method on the complete held-out pHopt testing set (n = 1,971) and on a subset of the testing set with less than 20% sequence identity to the training set (n = 999).
Fig. 3: EpHod training and evaluation of performance.
Fig. 4: Analysis of attention weights in EpHod’s RLAT model revealing physicochemical and structural features associated with pHopt using the full dataset (n = 9,855).
Fig. 5: Visualization of per-residue attention weights of EpHod’s RLAT model on selected enzyme protein structures.
Fig. 6: Comparison of the predictive performance of EpHod with alternative structural and biophysical methods on the full held-out testing set (n = 1,971).

Similar content being viewed by others

Data availability

Data for training the models, as well as the weights of the optimal EpHod model, are available via Zenodo at https://doi.org/10.5281/zenodo.14252615 (ref. 104).

Code availability

Code for EpHod and training the other models is available via GitHub at https://github.com/beckham-lab/EpHod (ref. 105).

References

  1. Barroca, M. et al. Deciphering the factors defining the pH-dependence of a commercial glycoside hydrolase family 8 enzyme. Enzyme Microb. Technol. 96, 163–169 (2017).

    Article  Google Scholar 

  2. Reed, C. J., Lewis, H., Trejo, E., Winston, V. & Evilia, C. Protein adaptations in Archaeal extremophiles. Archaea 2013, 373275 (2013).

    Article  Google Scholar 

  3. Protze, J. et al. An extracellular tetrathionate hydrolase from the thermoacidophilic archaeon Acidianus ambivalens with an activity optimum at pH 1. Front. Microbiol. 2, 68 (2011).

    Article  Google Scholar 

  4. Pradeep, G. C. et al. An extremely alkaline novel chitinase from Streptomyces sp. CS495. Process Biochem. 49, 223–229 (2014).

    Article  Google Scholar 

  5. Ferrer, M., Golyshina, O., Beloqui, A. & Golyshin, P. N. Mining enzymes from extreme environments. Curr. Opin. Microbiol. 10, 207–214 (2007).

    Article  Google Scholar 

  6. Thomas, N. et al. Engineering of highly active and diverse nuclease enzymes by combining machine learning and ultra-high-throughput screening. Preprint at bioRxiv https://doi.org/10.1101/2024.03.21.585615 (2024).

  7. Verma, D. & Satyanarayana, T. Xylanolytic extremozymes retrieved from environmental metagenomes: characteristics, genetic engineering, and applications. Front. Microbiol. 11, 551109 (2020).

    Article  Google Scholar 

  8. Shahraki, M. F. et al. A computational learning paradigm to targeted discovery of biocatalysts from metagenomic data: a case study of lipase identification. Biotechnol. Bioeng. 119, 1115–1128 (2022).

    Article  Google Scholar 

  9. Erickson, E. et al. Sourcing thermotolerant poly(ethylene terephthalate) hydrolase scaffolds from natural diversity. Nat. Commun. 13, 7850 (2022).

    Article  Google Scholar 

  10. Wang, C.-H., Liu, X.-L., Huang, R.-B., He, B.-F. & Zhao, M.-M. Enhanced acidic adaptation of Bacillus subtilis Ca-independent alpha-amylase by rational engineering of pKa values. Biochem. Eng. J. 139, 146–153 (2018).

    Article  Google Scholar 

  11. dos Santos, J. P., da Rosa Zavareze, E., Dias, A. R. G. & Vanier, N. L. Immobilization of xylanase and xylanase-β-cyclodextrin complex in polyvinyl alcohol via electrospinning improves enzyme activity at a wide pH and temperature range. Int. J. Biol. Macromol. 118, 1676–1684 (2018).

    Article  Google Scholar 

  12. Giri, P., Pagar, A. D., Patil, M. D. & Yun, H. Chemical modification of enzymes to improve biocatalytic performance. Biotechnol. Adv. 53, 107868 (2021).

    Article  Google Scholar 

  13. Xue, Y. et al. Chemical modification of stem bromelain with anhydride groups to enhance its stability and catalytic activity. J. Mol. Catal. B 63, 188–193 (2010).

    Article  Google Scholar 

  14. Li, C. Effects of chemical modification by chitooligosaccharide on enzyme activity and stability of yeast β-d-fructofuranosidase. Enzyme Microb. Technol. 64–65, 24–32 (2014).

    Article  Google Scholar 

  15. Li, S.-F., Cheng, F., Wang, Y.-J. & Zheng, Y.-G. Strategies for tailoring pH performances of glycoside hydrolases. Crit. Rev. Biotechnol. 43, 121–141 (2023).

    Article  Google Scholar 

  16. Shi, X., Wu, D., Xu, Y. & Yu, X. Engineering the optimum pH of β-galactosidase from Aspergillus oryzae for efficient hydrolysis of lactose. J. Dairy Sci. 105, 4772–4782 (2022).

    Article  Google Scholar 

  17. Hebditch, M. & Warwicker, J. Web-based display of protein surface and pH-dependent properties for assessing the developability of biotherapeutics. Sci. Rep. 9, 1969 (2019).

    Article  Google Scholar 

  18. Schmitz, M. et al. patcHwork: a user-friendly pH sensitivity analysis web server for protein sequences and structures. Nucleic Acids Res. 50, W560–W567 (2022).

    Article  Google Scholar 

  19. Oeller, M. et al. Sequence-based prediction of pH-dependent protein solubility using CamSol. Brief. Bioinform. 24, bbad004 (2023).

    Article  Google Scholar 

  20. Zhang, G., Li, H. & Fang, B. Discriminating acidic and alkaline enzymes using a random forest model with secondary structure amino acid composition. Process Biochem. 44, 654–660 (2009).

    Article  Google Scholar 

  21. Lin, H., Chen, W. & Ding, H. AcalPred: a sequence-based tool for discriminating between acidic and alkaline enzymes. PLoS ONE 8, e75726 (2013).

    Article  Google Scholar 

  22. Fan, G.-L., Li, Q.-Z. & Zuo, Y.-C. Predicting acidic and alkaline enzymes by incorporating the average chemical shift and gene ontology informations into the general form of Chou’s PseAAC. Process Biochem. 48, 1048–1053 (2013).

    Article  Google Scholar 

  23. Khan, Z. U., Hayat, M. & Khan, M. A. Discrimination of acidic and alkaline enzyme using Chou’s pseudo amino acid composition in conjunction with probabilistic neural network model. J. Theor. Biol. 365, 197–203 (2015).

    Article  MathSciNet  Google Scholar 

  24. Yan, S. & Wu, G. Predicting pH optimum for activity of beta-glucosidases. J. Biomed. Sci. Eng. 12, 354–367 (2019).

    Article  Google Scholar 

  25. Wang, X., Li, H., Gao, P., Liu, Y. & Zeng, W. Combining support vector machine with dual g-gap dipeptides to discriminate between acidic and alkaline enzymes. Lett. Org. Chem. 16, 325–331 (2019).

    Article  Google Scholar 

  26. Li, X. et al. A sequence embedding method for enzyme optimal condition analysis. BMC Bioinform. 21, 512 (2020).

    Article  Google Scholar 

  27. Schomburg, I. et al. The BRENDA enzyme information system—from a database to an expert system. J. Biotechnol. 261, 194–206 (2017).

    Article  Google Scholar 

  28. Puissant, J. et al. The pH optimum of soil exoenzymes adapt to long term changes in soil pH. Soil Biol. Biochem. 138, 107601 (2019).

    Article  Google Scholar 

  29. Li, G. et al. Learning deep representations of enzyme thermal adaptation. Protein Sci. 31, e4480 (2022).

    Article  Google Scholar 

  30. Reimer, L. C. et al. BacDive in 2022: the knowledge base for standardized bacterial and archaeal data. Nucleic Acids Res. 50, D741–D746 (2022).

    Article  Google Scholar 

  31. Sayers, E. W. et al. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 49, D10–D17 (2021).

    Article  Google Scholar 

  32. Teufel, F. et al. SignalP 6.0 predicts all five types of signal peptides using protein language models. Nat. Biotechnol. 40, 1023–1025 (2022).

    Article  Google Scholar 

  33. Booth, I. R. Regulation of cytoplasmic pH in bacteria. Microbiol. Rev. 49, 359–378 (1985).

    Article  Google Scholar 

  34. Baker-Austin, C. & Dopson, M. Life in acid: pH homeostasis in acidophiles. Trends Microbiol. 15, 165–171 (2007).

    Article  Google Scholar 

  35. Hough, D. W. & Danson, M. J. Extremozymes. Curr. Opin. Chem. Biol. 3, 39–46 (1999).

    Article  Google Scholar 

  36. Tan, C. et al. A survey on deep transfer learning. In Artificial Neural Networks and Machine Learning–ICANN 2018: 27th International Conference on Artificial Neural Networks (eds Kůrková, V. et al.) 270–279 (Springer, 2018).

  37. Gado, J. E., Beckham, G. T. & Payne, C. M. Improving enzyme optimum temperature prediction with resampling strategies and ensemble learning. J. Chem. Inf. Model. 60, 4098–4107 (2020).

    Article  Google Scholar 

  38. Branco, P., Torgo, L. & Ribeiro, R. P. Pre-processing approaches for imbalanced distributions in regression. Neurocomputing 343, 76–99 (2019).

    Article  Google Scholar 

  39. Yang, Y., Zha, K., Chen, Y.-C., Wang, H. & Katabi, D. Delving into deep imbalanced regression. In Proc. 38th International Conference on Machine Learning (eds Meila, M. and Zhang, T.) 11842–11851 (PMLR, 2021).

  40. Chen, Z. et al. iFeature: a Python package and web server for features extraction and selection from protein and peptide sequences. Bioinformatics 34, 2499–2502 (2018).

    Article  Google Scholar 

  41. Unsal, S. et al. Learning functional properties of proteins with language models. Nat. Mach. Intell. 4, 227–245 (2022).

    Article  Google Scholar 

  42. Rives, A. et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl Acad. Sci. USA 118, e2016239118 (2021).

    Article  Google Scholar 

  43. Meier, J. et al. Language models enable zero-shot prediction of the effects of mutations on protein function. In Advances in Neural Information Processing Systems (eds Ranzato, M. et al.) 29287–29303 (Curran Associates, 2021).

  44. Elnaggar, A. et al. ProtTrans: toward understanding the language of life through self-supervised learning. IEEE Trans. Pattern Anal. Mach. Intell. 44, 7112–7127 (2022).

    Article  Google Scholar 

  45. Notin, P. et al. Tranception: protein fitness prediction with autoregressive transformers and inference-time retrieval. In Proc. 39th International Conference on Machine Learning (eds Chaudhuri, K. et al.) 16990–17017 (PMLR, 2022).

  46. Nijkamp, E., Ruffolo, J., Weinstein, E. N., Naik, N. & Madani, A. ProGen2: exploring the boundaries of protein language models. Cell Syst. 14, 968–978.e3 (2023).

    Article  Google Scholar 

  47. Yang, K. K., Lu, A. X. & Fusi, N. Convolutions are competitive with transformers for protein sequence pretraining. Cell Syst. 15, 286–294 (2022).

  48. Stärk, H., Dallago, C., Heinzinger, M. & Rost, B. Light attention predicts protein location from the language of life. Bioinform. Adv. 1, vbab035 (2021).

    Article  Google Scholar 

  49. Bergstra, J. & Bengio, Y. Random search for hyper-parameter optimization. J. Mach. Learn. Res. 13, 281–305 (2012).

    MathSciNet  Google Scholar 

  50. Li, G. et al. Performance of regression models as a function of experiment noise. Bioinform. Biol. Insights 15, 11779322211020315 (2021).

    Article  Google Scholar 

  51. Detlefsen, N. S., Hauberg, S. & Boomsma, W. Learning meaningful representations of protein sequences. Nat. Commun. 13, 1914 (2022).

    Article  Google Scholar 

  52. Kroll, A. & Lercher, M. J. Machine learning models for the prediction of enzyme properties should be tested on proteins not used for model training. Preprint at bioRxiv https://doi.org/10.1101/2023.02.06.526991 (2023).

  53. Suplatov, D. et al. Computational design of a pH stable enzyme: understanding molecular mechanism of penicillin acylase’s adaptation to alkaline conditions. PLoS ONE 9, e100643 (2014).

    Article  Google Scholar 

  54. Huang, Y., Krauss, G., Cottaz, S., Driguez, H. & Lipps, G. A highly acid-stable and thermostable endo-β-glucanase from the thermoacidophilic archaeon Sulfolobus solfataricus. Biochem. J. 385, 581–588 (2005).

    Article  Google Scholar 

  55. Mamo, G., Thunnissen, M., Hatti-Kaul, R. & Mattiasson, B. An alkaline active xylanase: insights into mechanisms of high pH catalytic adaptation. Biochimie 91, 1187–1196 (2009).

    Article  Google Scholar 

  56. Wang, Y., Xu, M., Yang, T., Zhang, X. & Rao, Z. Surface charge-based rational design of aspartase modifies the optimal pH for efficient β-aminobutyric acid production. Int. J. Biol. Macromol. 164, 4165–4172 (2020).

    Article  Google Scholar 

  57. Jakob, F. et al. Surface charge engineering of a Bacillus gibsonii subtilisin protease. Appl. Microbiol. Biotechnol. 97, 6793–6802 (2013).

    Article  Google Scholar 

  58. Yang, T. et al. N20D/N116E combined mutant downward shifted the pH optimum of Bacillus subtilis NADH oxidase. Biology 12, 522 (2023).

    Article  Google Scholar 

  59. Masui, A., Fujiwara, N., Yamamoto, K., Takagi, M. & Imanaka, T. Rational design for stabilization and optimum pH shift of serine protease AprN. J. Ferment. Bioeng. 85, 30–36 (1998).

    Article  Google Scholar 

  60. Turunen, O., Vuorio, M., Fenel, F. & Leisola, M. Engineering of multiple arginines into the Ser/Thr surface of Trichoderma reesei endo-1,4-β-xylanase II increases the thermotolerance and shifts the pH optimum towards alkaline pH. Protein Eng. 15, 141–145 (2002).

    Article  Google Scholar 

  61. Li, Q., Jiang, T., Liu, R., Feng, X. & Li, C. Tuning the pH profile of β-glucuronidase by rational site-directed mutagenesis for efficient transformation of glycyrrhizin. Appl. Microbiol. Biotechnol. 103, 4813–4823 (2019).

    Article  Google Scholar 

  62. Pokhrel, S., Joo, J. C. & Yoo, Y. J. Shifting the optimum pH of Bacillus circulans xylanase towards acidic side by introducing arginine. Biotechnol. Bioprocess Eng. 18, 35–42 (2013).

    Article  Google Scholar 

  63. Carvalho, D. V., Pereira, E. M. & Cardoso, J. S. Machine learning interpretability: a survey on methods and metrics. Electronics 8, 832 (2019).

    Article  Google Scholar 

  64. Olsson, M. H. M., Søndergaard, C. R., Rostkowski, M. & Jensen, J. H. PROPKA3: consistent treatment of internal and surface residues in empirical pKa predictions. J. Chem. Theory Comput. 7, 525–537 (2011).

    Article  Google Scholar 

  65. Talley, K. & Alexov, E. On the pH-optimum of activity and stability of proteins. Proteins 78, 2699–2706 (2010).

    Article  Google Scholar 

  66. Alexov, E. Numerical calculations of the pH of maximal protein stability. The effect of the sequence composition and three-dimensional structure. Eur. J. Biochem. 271, 173–185 (2004).

    Article  Google Scholar 

  67. Yu, T. et al. Enzyme function prediction using contrastive learning. Science 379, 1358–1363 (2023).

    Article  Google Scholar 

  68. Pak, M. A., Dovidchenko, N. V., Sharma, S. M. & Ivankov, D. N. New mega dataset combined with deep neural network makes a progress in predicting impact of mutation on protein stability. Preprint at bioRxiv https://doi.org/10.1101/2022.12.31.522396 (2023).

  69. Kroll, A., Rousset, Y., Hu, X.-P., Liebrand, N. A. & Lercher, M. J. Turnover number predictions for kinetically uncharacterized enzymes using machine and deep learning. Nat. Commun. 14, 4139 (2023).

    Article  Google Scholar 

  70. Gelman, S., Fahlberg, S. A., Heinzelman, P., Romero, P. A. & Gitter, A. Neural networks to learn protein sequence–function relationships from deep mutational scanning data. Proc. Natl Acad. Sci. USA 118, e2104878118 (2021).

    Article  Google Scholar 

  71. Dallago, C. et al. FLIP: benchmark tasks in fitness landscape inference for proteins. Preprint at bioRxiv https://doi.org/10.1101/2021.11.09.467890 (2022).

  72. Sledzieski, S. et al. Democratizing protein language models with parameter-efficient fine-tuning. Proc. Natl Acad. Sci. USA 121, e2405840121 (2024).

    Article  Google Scholar 

  73. Xu, M. et al. PEER: a comprehensive and multi-task benchmark for protein sequence understanding. Adv. Neural Inf. Process. Syst. 35, 35156–35173 (2022).

    Google Scholar 

  74. Ferdous, S., Shihab, I. F. & Reuel, N. F. Effects of sequence features on machine-learned enzyme classification fidelity. Biochem. Eng. J. 187, 108612 (2022).

    Article  Google Scholar 

  75. Li, G., Rabe, K. S., Nielsen, J. & Engqvist, M. K. M. Machine learning applied to predicting microorganism growth temperatures and enzyme catalytic optima. ACS Synth. Biol. 8, 1411–1420 (2019).

    Article  Google Scholar 

  76. Liu, H., HaoChen, J. Z., Gaidon, A. & Ma, T. Self-supervised learning is more robust to dataset imbalance. Preprint at https://arxiv.org/abs/2110.05025 (2022).

  77. Zaretckii, M., Buslaev, P., Kozlovskii, I., Morozov, A. & Popov, P. Approaching optimal pH enzyme prediction with large language models. ACS Synth. Biol. https://doi.org/10.1021/acssynbio.4c00465 (2024).

  78. Song, Y. et al. Accurately predicting enzyme functions through geometric graph learning on ESMFold-predicted structures. Nat. Commun. 15, 8180 (2024).

    Article  Google Scholar 

  79. Richardson, L. et al. MGnify: the microbiome sequence data analysis resource in 2023. Nucleic Acids Res. 51, D753–D759 (2023).

    Article  Google Scholar 

  80. UniProt Consortium. UniProt: the Universal Protein Knowledgebase in 2023. Nucleic Acids Res. 51, D523–D531 (2023).

    Article  Google Scholar 

  81. Hopf, T. A. et al. Mutation effects predicted from sequence co-variation. Nat. Biotechnol. 35, 128–135 (2017).

    Article  Google Scholar 

  82. Hie, B. L. & Yang, K. K. Adaptive machine learning for protein engineering. Curr. Opin. Struct. Biol. 72, 145–152 (2022).

    Article  Google Scholar 

  83. Yang, K. K., Wu, Z. & Arnold, F. H. Machine-learning-guided directed evolution for protein engineering. Nat. Methods 16, 687–694 (2019).

    Article  Google Scholar 

  84. Hawkins-Hooker, A. et al. Generating functional protein variants with variational autoencoders. PLoS Comput. Biol. 17, e1008736 (2021).

    Article  Google Scholar 

  85. Strokach, A. & Kim, P. M. Deep generative modeling for protein design. Curr. Opin. Struct. Biol. 72, 226–236 (2022).

    Article  Google Scholar 

  86. Steinegger, M. & Söding, J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol. 35, 1026–1028 (2017).

    Article  Google Scholar 

  87. Cock, P. J. A. et al. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 25, 1422–1423 (2009).

    Article  Google Scholar 

  88. Cui, Y., Jia, M., Lin, T.-Y., Song, Y. & Belongie, S. Proc. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (IEEE, 2019).

  89. Riesselman, A. J., Ingraham, J. B. & Marks, D. S. Deep generative models of genetic variation capture the effects of mutations. Nat. Methods 15, 816–822 (2018).

    Article  Google Scholar 

  90. Guyon, I., Weston, J., Barnhill, S. & Vapnik, V. Gene selection for cancer classification using support vector machines. Mach. Learn. 46, 389–422 (2002).

    Article  Google Scholar 

  91. Shin, J.-E. et al. Protein design and variant prediction using autoregressive generative models. Nat. Commun. 12, 2403 (2021).

    Article  Google Scholar 

  92. van den Oord, A. et al. WaveNet: a generative model for raw audio. Preprint at https://arxiv.org/abs/1609.03499 (2016).

  93. Cho, K. et al. Learning phrase representations using RNN encoder–decoder for statistical machine translation. In Proc. 2014 Conference on Empirical Methods in Natural Language Processing (eds Moschitti, A. et al.) 1724–1734 (Association for Computational Linguistics, 2014); https://doi.org/10.3115/v1/D14-1179

  94. Pascanu, R., Mikolov, T. & Bengio, Y. On the difficulty of training recurrent neural networks. In Proc. 30th International Conference on Machine Learning (eds Dasgupta, S. & McAllester, D.) 1310–1318 (PMLR, 2013).

  95. Hochreiter, S. & Schmidhuber, J. Long short-term memory. Neural Comput. 9, 1735–1780 (1997).

    Article  Google Scholar 

  96. Alley, E. C., Khimulya, G., Biswas, S., AlQuraishi, M. & Church, G. M. Unified rational protein engineering with sequence-based deep representation learning. Nat. Methods 16, 1315–1322 (2019).

    Article  Google Scholar 

  97. Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).

    MathSciNet  Google Scholar 

  98. Paszke, A. et al. PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems (eds Wallach, H. et al.) 8024–8035 (Curran Associates, 2019).

  99. Varadi, M. et al. AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Res. 50, D439–D444 (2022).

    Article  Google Scholar 

  100. Mirdita, M. et al. ColabFold: making protein folding accessible to all. Nat. Methods 19, 679–682 (2022).

    Article  Google Scholar 

  101. Shrake, A. & Rupley, J. A. Environment and exposure to solvent of protein atoms. Lysozyme and insulin. J. Mol. Biol. 79, 351–371 (1973).

    Article  Google Scholar 

  102. Rost, B. & Sander, C. Conservation and prediction of solvent accessibility in protein families. Proteins 20, 216–226 (1994).

    Article  Google Scholar 

  103. Savojardo, C., Manfredi, M., Martelli, P. L. & Casadio, R. Solvent accessibility of residues undergoing pathogenic variations in humans: from protein structures to protein sequences. Front. Mol. Biosci. 7, 626363 (2021).

    Article  Google Scholar 

  104. Gado, J. E. et al. Machine learning prediction of enzyme optimal pH. Zenodo https://doi.org/10.5281/ZENODO.14252615 (2023).

  105. Gado, J. jafetgado/EpHod: v1.0.0. Zenodo https://doi.org/10.5281/ZENODO.15015125 (2025).

  106. Austin, H. P. et al. Characterization and engineering of a plastic-degrading aromatic polyesterase. Proc. Natl Acad. Sci. USA 115, E4350–E4357 (2018).

    Article  Google Scholar 

Download references

Acknowledgements

We thank R. Estanboulieh and R. Orenbuch for their support in visualizations and analyses. We thank J. Law, E. Komp, A. Kollasch, D. Ritter, P. Notin, J. Frazer and M. Dias for helpful discussions. This material is based upon work supported by the US Department of Energy, Office of Science, Office of Biological and Environmental Research, Genomic Science Program under award no. DE-SC0022024 to N.P.G. and G.T.B. This work was authored in part by Alliance for Sustainable Energy, LLC, the manager and operator of the National Renewable Energy Laboratory for the US Department of Energy (DOE) under contract no. DE-AC36-08GO28308. Partial funding to J.E.G. and G.T.B. was provided by the US Department of Energy, Office of Energy Efficiency and Renewable Energy, Advanced Materials and Manufacturing Technologies Office (AMMTO) and Bioenergy Technologies Office (BETO) as part of the BOTTLE Consortium and was supported by AMMTO and BETO under contract no. DE-AC36-08GO28308 with the National Renewable Energy Laboratory, operated by Alliance for Sustainable Energy, LLC. Partial funding to J.E.G. and G.T.B. was also provided by the US Department of Energy Office of Energy Efficiency and Renewable Energy BETO for the Agile BioFoundry. We thank G. Bentley at DOE and members of the Agile BioFoundry for helpful discussions. Partial funding to J.E.G. and G.T.B. was also provided by the US Department of Energy Office of Science Biological and Environmental Research via DE-SC0023278. The views expressed herein do not necessarily represent the views of the DOE or the US Government. The US Government, and the publisher, by accepting the article for publication, acknowledges that the US Government retains a non-exclusive, paid-up, irrevocable, worldwide license to publish or reproduce the published form of this work, or allow others to do so, for US Government purposes.

Author information

Authors and Affiliations

Authors

Contributions

G.T.B. and J.E.G. conceived of the project, and J.E.G., D.M., N.P.G., C.S. and G.T.B. designed the study. J.E.G. trained the predictive models, and A.Y.S., M.K. and J.E.G. evaluated the models. The paper was written by J.E.G. and edited and approved by all authors.

Corresponding author

Correspondence to Gregg T. Beckham.

Ethics declarations

Competing interests

D.M. is an advisor for Dyno Therapeutics, Octant, Jura Bio, Tectonic Therapeutic and Genentech, and a cofounder of Seismic Therapeutics. C.S. is an advisor for CytoReason Ltd. G.T.B. is an advisor for Bluestem Biosciences, Crysalis Biosciences and Samsara Eco. The other authors declare no competing interests.

Peer review

Peer review information

Nature Machine Intelligence thanks Jiri Damborsky, Martin Lercher and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gado, J.E., Knotts, M., Shaw, A.Y. et al. Machine learning prediction of enzyme optimum pH. Nat Mach Intell 7, 716–729 (2025). https://doi.org/10.1038/s42256-025-01026-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Version of record:

  • Issue date:

  • DOI: https://doi.org/10.1038/s42256-025-01026-6

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing