Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Advancing molecular machine learning representations with stereoelectronics-infused molecular graphs

A preprint version of the article is available at arXiv.

Abstract

Molecular representation is a critical element in our understanding of the physical world and the foundation for modern molecular machine learning. Previous molecular machine learning models have used strings, fingerprints, global features and simple molecular graphs that are inherently information-sparse representations. However, as the complexity of prediction tasks increases, the molecular representation needs to encode higher fidelity information. This work introduces a new approach to infusing quantum-chemical-rich information into molecular graphs via stereoelectronic effects, enhancing expressivity and interpretability. Learning to predict the stereoelectronics-infused representation with a tailored double graph neural network workflow enables its application to any downstream molecular machine learning task without expensive quantum-chemical calculations. We show that the explicit addition of stereoelectronic information substantially improves the performance of message-passing two-dimensional machine learning models for molecular property prediction. We show that the learned representations trained on small molecules can accurately extrapolate to much larger molecular structures, yielding chemical insight into orbital interactions for previously intractable systems, such as entire proteins, opening new avenues of molecular design. Finally, we have developed a web application (simg.cheme.cmu.edu) where users can rapidly explore stereoelectronic information for their own molecular systems.

This is a preview of subscription content, access via your institution

Access options

Buy this article

USD 39.95

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Common molecular representations and overview of our approach.
Fig. 2: Approach to SIMG* construction.
Fig. 3: Active learning approach to select additional NBO data to generate for the model.
Fig. 4: Quality of SIMG* predictions.
Fig. 5: Assessment of model performance in identifying structural features and graph–distant proximal orbital interactions in proteins.
Fig. 6: Property prediction performance of models employing different molecular representations.

Similar content being viewed by others

Data availability

The data and model weights are available at https://huggingface.co/gomesgroup/simg.

Code availability

The code is available at https://github.com/gomesgroup/simg (ref. 64). A web application is available at https://simg.cheme.cmu.edu where users can rapidly explore stereoelectronic information for their own molecular systems.

References

  1. Hoffmann, R. & Laszlo, P. Representation in chemistry. Angew. Chem. Int. Ed. Engl. 30, 1–16 (1991).

    Article  Google Scholar 

  2. Cooke, H. A historical study of structures for communication of organic chemistry information prior to 1950. Org. Biomol. Chem. 2, 3179 (2004).

    Article  Google Scholar 

  3. Springer, M. T. Improving students’ understanding of molecular structure through broad-based use of computer models in the undergraduate organic chemistry lecture. J. Chem. Educ. 91, 1162–1168 (2014).

    Article  Google Scholar 

  4. Gómez-Bombarelli, R. et al. Design of efficient molecular organic light-emitting diodes by a high-throughput virtual screening and experimental approach. Nat. Mater. 15, 1120–1127 (2016).

    Article  Google Scholar 

  5. Zhavoronkov, A. et al. Deep learning enables rapid identification of potent DDR1 kinase inhibitors. Nat. Biotechnol. 37, 1038–1040 (2019).

    Article  Google Scholar 

  6. Dara, S., Dhamercherla, S., Jadav, S. S., Babu, C. M. & Ahsan, M. J. Machine learning in drug discovery: a review. Artif. Intell. Rev. 55, 1947–1999 (2022).

    Article  Google Scholar 

  7. Gallegos, L. C., Luchini, G., St. John, P. C., Kim, S. & Paton, R. S. Importance of engineered and learned molecular representations in predicting organic reactivity, selectivity, and chemical properties. Acc. Chem. Res. 54, 827–836 (2021).

    Article  Google Scholar 

  8. Sandfort, F., Strieth-Kalthoff, F., Kühnemund, M., Beecks, C. & Glorius, F. A structure-based platform for predicting chemical reactivity. Chem 6, 1379–1390 (2020).

    Article  Google Scholar 

  9. Ross, J. et al. Large-scale chemical language representations capture molecular structure and properties. Nat. Mach. Intell. 4, 1256–1264 (2022).

    Article  Google Scholar 

  10. Yang, Z., Chakraborty, M. & White, A. D. Predicting chemical shifts with graph neural networks. Chem. Sci. 12, 10802–10809 (2021).

    Article  Google Scholar 

  11. Zhou, J. et al. Graph neural networks: a review of methods and applications. AI Open 1, 57–81 (2020).

    Article  Google Scholar 

  12. Fang, X. et al. Geometry-enhanced molecular representation learning for property prediction. Nat. Mach. Intell. 4, 127–134 (2022).

    Article  Google Scholar 

  13. Batzner, S. et al. E(3)-equivariant graph neural networks for data-efficient and accurate interatomic potentials. Nat. Commun. 13, 2453 (2022).

    Article  Google Scholar 

  14. Qi, Y., Gong, W. & Yan, Q. Bridging deep learning force fields and electronic structures with a physics-informed approach. Preprint at https://doi.org/10.48550/arXiv.2403.13675 (2024).

  15. Fabrizio, A., Briling, K. R. & Corminboeuf, C. SPAHM: the spectrum of approximated Hamiltonian matrices representations. Digital Discovery 1, 286–294 (2022).

    Article  Google Scholar 

  16. Elton, D. C., Boukouvalas, Z., Butrico, M. S., Fuge, M. D. & Chung, P. W. Applying machine learning techniques to predict the properties of energetic materials. Sci. Rep. 8, 9059 (2018).

    Article  Google Scholar 

  17. Rupp, M., Tkatchenko, A., Müller, K.-R. & von Lilienfeld, O. A. Fast and accurate modeling of molecular atomization energies with machine learning. Phys. Rev. Lett. 108, 058301 (2012).

    Article  Google Scholar 

  18. Pozdnyakov, S. N. & Ceriotti, M. Smooth, exact rotational symmetrization for deep learning on point clouds. Preprint at https://doi.org/10.48550/arXiv.2305.19302 (2023).

  19. Abramson, J. et al. Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature 630, 493–500 (2024).

    Article  Google Scholar 

  20. Černý, J. & Hobza, P. Non-covalent interactions in biomacromolecules. Phys. Chem. Chem. Phys. 9, 5291 (2007).

    Article  Google Scholar 

  21. Anighoro, A. in Quantum Mechanics in Drug Discovery (ed. Heifetz, A.) 75–86 (Humana, Springer, 2020).

  22. Wheeler, S. E., Seguin, T. J., Guan, Y. & Doney, A. C. Noncovalent interactions in organocatalysis and the prospect of computational catalyst design. Acc. Chem. Res. 49, 1061–1069 (2016).

    Article  Google Scholar 

  23. Weinhold, F. & Landis, C. R. Natural bond orbitals and extensions of localized bonding concepts. Chem. Educ. Res. Pract. 2, 91–104 (2001).

    Article  Google Scholar 

  24. Llenga, S. & Gryn’ova, G. Matrix of orthogonalized atomic orbital coefficients representation for radicals and ions. J. Chem. Phys. 158, 214116 (2023).

    Article  Google Scholar 

  25. Bartók, A. P., Kondor, R. & Csányi, G. On representing chemical environments. Phys. Rev. B 87, 184115 (2013).

    Article  Google Scholar 

  26. Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A. V. & Gulin, A. Catboost: unbiased boosting with categorical features. Adv. Neur. Inf. Proc. Syst. 31, 6638–6648 (2018).

    Google Scholar 

  27. NVIDIA. MegaMolBART. GitHub https://github.com/NVIDIA/MegaMolBART (2022).

  28. Heid, E. et al. Chemprop: a machine learning package for chemical property prediction. J. Chem. Inf. Model 64, 9–17 (2024).

    Article  Google Scholar 

  29. Alabugin, I. V. Stereoelectronic Effects: A Bridge Between Structure and Reactivity (Wiley, 2016).

  30. Echenique, P. & Alonso, J. L. A mathematical and computational review of Hartree–Fock SCF methods in quantum chemistry. Mol. Phys. 105, 3057–3098 (2007).

    Article  Google Scholar 

  31. Burke, K. & Wagner, L. O. DFT in a nutshell. Int. J. Quantum. Chem. 113, 96–101 (2013).

    Article  Google Scholar 

  32. Goerigk, L. & Grimme, S. Double-hybrid density functionals. Wiley Interdiscip. Rev. Comput. Mol. Sci. 4, 576–600 (2014).

    Article  Google Scholar 

  33. Kneiding, H. et al. Deep learning metal complex properties with natural quantum graphs. Digital Discovery 2, 618–633 (2023).

    Article  Google Scholar 

  34. Johnson, E. R. et al. Revealing noncovalent interactions. J. Am. Chem. Soc. 132, 6498–6506 (2010).

    Article  Google Scholar 

  35. Axelrod, S. & Gómez-Bombarelli, R. GEOM, energy-annotated molecular conformations for property prediction and molecular generation. Sci. Data 9, 185 (2022).

    Article  Google Scholar 

  36. Malinin, A., Prokhorenkova, L. & Ustimenko, A. Uncertainty in gradient boosting via ensembles. Preprint at https://doi.org/10.48550/arXiv.2006.10562 (2020).

  37. Chua, K., Calandra, R., McAllister, R. & Levine, S. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In Proc. 32nd International Conference on Neural Information Processing Systems 4759–4770 (NIPS, 2018).

  38. Goan, E. & Fookes, C. in Case Studies in Applied Bayesian Data Science (eds Mengerson, K. L. et al.) 45–87 (Springer, 2020).

  39. Beluch, W. H., Genewein, T., Nurnberger, A. & Kohler, J. M. The power of ensembles for active learning in image classification. In Proc. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition 9368–9377 (IEEE, 2018).

  40. León, I., Alonso, E. R., Cabezas, C., Mata, S. & Alonso, J. L. Unveiling the n→π* interactions in dipeptides. Commun Chem 2, 3 (2019).

    Article  Google Scholar 

  41. Newberry, R. W., Bartlett, G. J., VanVeller, B., Woolfson, D. N. & Raines, R. T. Signatures of n→π* interactions in proteins. Protein Sci. 23, 284–288 (2014).

    Article  Google Scholar 

  42. Hodges, J. A. & Raines, R. T. Energetics of an n → π* interaction that impacts protein structure. Org Lett 8, 4695–4697 (2006).

    Article  Google Scholar 

  43. Zhou, Y., Morais-Cabral, J. H., Kaufman, A. & MacKinnon, R. Chemistry of ion coordination and hydration revealed by a K+ channel–Fab complex at 2.0 Å resolution. Nature 414, 43–48 (2001).

    Article  Google Scholar 

  44. Bartlett, G. J., Choudhary, A., Raines, R. T. & Woolfson, D. N. n→π* interactions in proteins. Nat. Chem. Biol. 6, 615–620 (2010).

    Article  Google Scholar 

  45. dos Passos Gomes, G. & Alabugin, I. V. Drawing catalytic power from charge separation: stereoelectronic and zwitterionic assistance in the Au(I)-catalyzed Bergman cyclization. J. Am. Chem. Soc. 139, 3406–3416 (2017).

    Article  Google Scholar 

  46. Gomes, G. D. P., Vil’, V., Terent’ev, A. & Alabugin, I. V. Stereoelectronic source of the anomalous stability of bis-peroxides. Chem. Sci. 6, 6783–6791 (2015).

    Article  Google Scholar 

  47. Grabowski, S. J. Tetrel bond–σ-hole bond as a preliminary stage of the SN2 reaction. Phys. Chem. Chem. Phys. 16, 1824–1834 (2014).

    Article  Google Scholar 

  48. Sarazin, Y., Liu, B., Roisnel, T., Maron, L. & Carpentier, J.-F. Discrete, solvent-free alkaline-earth metal cations: metal···fluorine interactions and ROP catalytic activity. J. Am. Chem. Soc. 133, 9069–9087 (2011).

    Article  Google Scholar 

  49. Ramakrishnan, R., Dral, P. O., Rupp, M. & von Lilienfeld, O. A. Quantum chemistry structures and properties of 134 kilo molecules. Sci. Data 1, 140022 (2014).

    Article  Google Scholar 

  50. Mardirossian, N. & Head-Gordon, M. ωB97M-V: a combinatorially optimized, range-separated hybrid, meta-GGA density functional with VV10 nonlocal correlation. J. Chem. Phys. 144, 214110 (2016).

    Article  Google Scholar 

  51. Shao, Y. et al. Advances in molecular quantum chemistry contained in the Q-Chem 4 program package. Mol. Phys. 113, 184–215 (2015).

    Article  Google Scholar 

  52. Glendening, E. D., Landis, C. R. & Weinhold, F. NBO 7.0: new vistas in localized and delocalized chemical bonding theory. J. Comput. Chem. https://doi.org/10.1002/jcc.25873 (2019).

  53. Ong, S. P. et al. Python Materials Genomics (pymatgen): a robust, open-source python library for materials analysis. Comput. Mater. Sci. 68, 314–319 (2013).

    Article  Google Scholar 

  54. Blau, S., Spotte-Smith, E. W. C., Wood, B., Dwaraknath, S. & Persson, K. Accurate, automated density functional theory for complex molecules using on-the-fly error correction. Preprint at chemRxiv https://doi.org/10.26434/chemrxiv.13076030.v1 (2020).

  55. Mathew, K. et al. Atomate: A high-level interface to generate, execute, and analyze computational materials science workflows. Comput. Mater. Sci. 139, 140–152 (2017).

    Article  Google Scholar 

  56. Paszke, A. et al. PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems Vol. 32 (eds. Wallach, H. et al.) 8024–8035 (Curran Associates, Inc., 2019).

  57. Falcon, W. A. et al. PyTorch Lightning. GitHub https://github.com/PyTorchLightning/pytorch-lightning (2019).

  58. Fey, M. & Lenssen, J. E. Fast graph representation learning with PyTorch Geometric. Preprint at https://doi.org/10.48550/arXiv.1903.02428 (2019).

  59. Corso, G., Cavalleri, L., Beaini, D., Liò, P. & Veličković, P. Principal neighbourhood aggregation for graph nets. Preprint at https://doi.org/10.48550/arXiv.2004.05718 (2020).

  60. Li, G., Müller, M., Thabet, A. & Ghanem, B. DeepGCNs: can GCNs go as deep as CNNs? Preprint at https://doi.org/10.48550/arXiv.1904.03751 (2019).

  61. Godwin, J. et al. Simple GNN regularisation for 3D molecular property prediction & beyond. Preprint at https://doi.org/10.48550/arXiv.2106.07971 (2021).

  62. Veličković, P. et al. Graph attention networks. Preprint at https://doi.org/10.48550/arXiv.1710.10903 (2017).

  63. Cai, C. & Wang, Y. A note on over-smoothing for graph neural networks. Preprint at https://doi.org/10.48550/arXiv.2006.13318 (2020).

  64. Boiko, D. et al. Advancing molecular machine learned representations with stereoelectronics-infused molecular graphs. Zenodo https://doi.org/10.5281/zenodo.14393496 (2024).

Download references

Acknowledgements

We thank NSF ACCESS (project no. CHE220012), Google Cloud Platform, NVIDIA Academic Hardware Grant Program (project titled ‘New molecular graph representations in joint feature spaces’) for computational resources. G.G. and D.B. acknowledge the financial support by the National Science Foundation Center for Computer-Assisted Synthesis (grant no. 2202693) and a supporting seed grant from X, the moonshot factory (an Alphabet company). G.G. thanks Carnegie Mellon University (CMU) and the departments of chemistry and chemical engineering for the startup support. G.G. thanks F. Weinhold (University of Wisconsin, Madison) for the development of NBO and the many discussions about the theory and software. S.M.B. acknowledges financial support by the Laboratory Directed Research and Development Program of Lawrence Berkeley National Laboratory under US Department of Energy Contract No. DE-AC02-05CH11231. Computational resources for the high-throughput virtual screening and datasets development were provided by the National Energy Research Scientific Computing Center (NERSC), a US Department of Energy Office of Science User Facility under contract no. DE-AC02-05CH11231 and by the Lawrencium computational cluster resource provided by the IT Division at the Lawrence Berkeley National Laboratory (Supported by the Director, Office of Science, Office of Basic Energy Sciences, of the US Department of Energy under contract no. DE-AC02-05CH11231). We thank J. Kitchin (CMU Chemical Engineering) and O. Isayev (CMU Chemistry) for their constructive feedback. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation, the US Department of Energy, Alphabet (and its subsidiaries) or any of the other funding sources.

Author information

Authors and Affiliations

Authors

Contributions

D.A.B. designed the computational pipeline and implemented SIMG* prediction, active learning process, downstream task analysis and the first version of large molecule analysis. T.R. implemented the lone pair prediction model and performed analysis of large molecule predictions. B.S.-L. advised on the development of machine learning pipeline and software development. S.M.B. performed quantum-chemistry calculations and advised on analysis of NBO data. G.G. designed the concept and performed preliminary studies. S.M.B. and G.G. supervised the project. D.A.B., T.R. and G.G. wrote this manuscript with input from all authors.

Corresponding authors

Correspondence to Samuel M. Blau or Gabe Gomes.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Machine Intelligence thanks the anonymous reviewers for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Figs. 1–12.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Boiko, D.A., Reschützegger, T., Sanchez-Lengeling, B. et al. Advancing molecular machine learning representations with stereoelectronics-infused molecular graphs. Nat Mach Intell 7, 771–781 (2025). https://doi.org/10.1038/s42256-025-01031-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Version of record:

  • Issue date:

  • DOI: https://doi.org/10.1038/s42256-025-01031-9

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing