Hessian QM9: A quantum chemistry database of molecular Hessians in implicit solvents

Williams, Nicholas J.; Kabalan, Lara; Stojanovic, Ljiljana; Zólyomi, Viktor; Pyzer-Knapp, Edward O.

doi:10.1038/s41597-024-04361-2

Download PDF

Data Descriptor
Open access
Published: 03 January 2025

Hessian QM9: A quantum chemistry database of molecular Hessians in implicit solvents

Scientific Data volume 12, Article number: 9 (2025) Cite this article

4403 Accesses
5 Citations
Metrics details

Subjects

Abstract

A significant challenge in computational chemistry is developing approximations that accelerate ab initio methods while preserving accuracy. Machine learning interatomic potentials (MLIPs) have emerged as a promising solution for constructing atomistic potentials that can be transferred across different molecular and crystalline systems. Most MLIPs are trained only on energies and forces in vacuum, while an improved description of the potential energy surface could be achieved by including the curvature of the potential energy surface. We present Hessian QM9, the first database of equilibrium configurations and numerical Hessian matrices, consisting of 41,645 molecules from the QM9 dataset at the ωB97x/6-31G* level. Molecular Hessians were calculated in vacuum, as well as water, tetrahydrofuran, and toluene using an implicit solvation model. To demonstrate the utility of this dataset, we show that incorporating second derivatives of the potential energy surface into the loss function of a MLIP significantly improves the prediction of vibrational frequencies in all solvent environments, thus making this dataset extremely useful for studying organic molecules in realistic solvent environments for experimental characterization.

Learning together: Towards foundation models for machine learning interatomic potentials with meta-learning

Article Open access 17 July 2024

QM9star, two Million DFT-computed Equilibrium Structures for Ions and Radicals with Atomic Information

Article Open access 21 October 2024

WS22 database, Wigner Sampling and geometry interpolation for configurationally diverse molecular datasets

Article Open access 15 February 2023

Background & Summary

Since the seminal work of Hohenberg and Kohn¹, density functional theory (DFT) has emerged as the most popular first principles method in quantum chemistry. The rise of DFT can be attributed to its remarkable speed as compared to traditional ab initio quantum chemistry methods. Despite its speed, and the advances in the computational efficiency of DFT including the development of linear scaling DFT², as well as advances in computing hardware, it is still infeasible to apply DFT to large system (>1000 atoms) and long time scales (>1ns). To overcome this limitation, machine learning interatomic potentials (MLIPs) have been developed to perform atomistic simulations at a precision approaching that of DFT but at a cost closer to that of classical force fields. MLIP codes such as ANI^3,4,5,6, CHGNet⁷, Nequip⁸, Allegro⁹, MACE¹⁰, FieldSchNet¹¹ NewtonNet¹² have gradually paved the way to an increasingly greater level of applicability across the chemical space under sufficiently diverse training data assumptions.

MLIP models heavily rely on high-quality and extensive reference data during their training process. The Chemical Space Project enumerated all feasible organic molecules up to 17 atoms leading to the creation of the GDB-17 databases, encompassing an impressive 166 billion molecules¹³. QM9 is the gold standard benchmark dataset where various MLIPs have been, comprising of 133,885 ground-state geometries containing up to nine heavy atoms from the GDB-17 database (H, C, N, O, F)¹⁴. QM9 provides ample information about each chemical species (energy, harmonic frequencies, dipole moments, polarizabilities, enthalpies, and free energies of atomization) where all properties were computed at the B3LYP/6-31G(2df,p) level of quantum chemistry. Despite the growing amount of open-sourced molecular data, few quantum chemical calculations are carried out in an implicit solvent environment^11,15. In this work, we have carried out geometry relaxations to the equilibrium ground state on the QM9 dataset in vacuum, water, tetrahydrofuran and toluene.

N. F. Schmitz et al. showed that higher derivatives are more informative and aid regression models significantly (Fig. 1a and b)¹⁶. More recently, S. Fang et al. presented an E(3)-equivariant message passing graph neural network for predicting the phonon modes of the periodic crystals and molecules by evaluating the second derivative matrices of the potential energy surface¹⁷. They showed that using higher order training data improved the energy models beyond the lower-order energy and force data. While the Hellmann-Feynman theorem provides an analytical method for calculating for first-order derivative of the energy eigenvalue, the second-order derivative must be calculated using finite difference methods^18,19. Carrying out a Taylor expansion of the potential energy surface to the second order, known as the harmonic approximation (Fig. 1a),

$$E({({\overrightarrow{x}}_{i})}_{\alpha })\approx {E}_{0}+\sum _{{({\overrightarrow{x}}_{i})}_{\alpha },{({\overrightarrow{x}}_{j})}_{\beta }}\frac{1}{2}\frac{{{\rm{\partial }}}^{2}E}{{\rm{\partial }}{({\overrightarrow{x}}_{i})}_{\alpha }{\rm{\partial }}{({\overrightarrow{x}}_{j})}_{\beta }},$$

(1)

where α and β are the Euclidean dimensions, i and j are atomic labels, E is the energy, and $\overrightarrow{x}$ is Euclidean coordinate vector. The equation of motion for atoms close the the equilibrium geometry, governed by the dynamical matrix,

$${D}_{\alpha ,\beta }(i,j)=\frac{1}{\sqrt{{m}_{i}{m}_{j}}}\sum \frac{{{\rm{\partial }}}^{2}E}{{\rm{\partial }}{({\overrightarrow{x}}_{i})}_{\alpha }{\rm{\partial }}{({\overrightarrow{x}}_{j})}_{\beta }},$$

(2)

where m is the atomic mass. Eigenvalue decomposition of the (N × 3 × N × 3) tensor, where N is the number of atoms in the system, results in vectors of normal vibrational modes, and a vector of squared quantized vibrational energies,

$$\sum _{j,\beta }{D}_{\alpha \beta }(i,j){e}_{\beta }^{n}(j)={[{\omega }^{n}]}^{2}{e}_{\beta }^{n}(j)$$

(3)

where ${e}_{\beta }^{n}$ is the eigenvector (or eigenbasis) which represent the principal axes along which the collective motion of the atoms occurs, and ω is the vibrational energy and can be correlated to experimental infrared/Raman spectroscopy. S. Fang et al. also illustrated that an E(3)-equivariant message passing neural network can be fine-tuned by integrating direct second-order Hessian data or vibrational modes into the training loss function, thereby substantially improving vibrational mode property prediction for small molecules and crystals¹⁷. Therefore, we have carried out numerical Hessian calculations for over 40,000 samples in each solvent environment to improve the description of the curvature of the potential energy surface at equilibrium. We validate the utility of the higher-order training data by fine-tuning an E(3)-equivariant message passing graph neural network to improve the prediction of molecular vibrational properties. We observed a refinement of stretching modes with a characteristic wavenumber > 400cm⁻¹ after fine-tuning.

Methods

DFT Calculations

All DFT calculations were carried out using the NWChem software²⁰. We employed the ωB97X²¹ functional and the 6-31G* basis set²² to create data compatible with the ANI-1/ANI-1x/ANI-2x datasets. The self-consistent field (SCF) cycle was deemed converged when the changes in total energy and density were less than 10⁻⁶eV. All molecules in the set are neutral with multiplicity equal to 1. The Mura-Knowles radial quadrature (with 49 radial points for heavy atoms, and 45 radial points for H) and Lebedev angular quadrature with 434 points for all elements were used in the integration. Structures from the QM9 dataset were optimized in vacuum and in three solvents of different polarities, tetrahydrofuran (THF) (ϵ_r = 7.6), toluene (ϵ_r = 2.4), and water (ϵ_r = 80.0). The continuum solvation model based on density (SMD) was used²³ to model the solvation effects. We used the default optimization criteria implemented in NWChem.

It is worth to note that there are advantages and disadvantages associated with the use of the 6-31G* basis set and the wB97X functional. Greater accuracy may be achieved with a larger basis, or the ability to gradually improve results in the future by using a split-valence basis such as Def2-SVP might have been introduced, and we could have used a functional with an improved dispersion model to better describe non-bonded interactions. However, the downside of the approach would have been increased computational expense, and not just because the individual calculations would have been more costly, but also because the Hessian dataset is best used in conjunction with a dataset of single point energies and forces on off-equilibrium geometries. We opted to create a dataset that is completely compatible with a number of existing datasets out there so that it could be combined with such existing datasets to train an accurate foundation model for organic compounds. In the future, similar datasets may be created by using better basis sets and/or density functionals, but it will be necessary to create both a Hessian database and a suitably large dataset of off-equilibrium geometries.

Molecule Selection

The Hessian matrices, vibrational frequencies and modes were computed for a subset of 41,645 molecular geometries in vacuum and three solvents, applying the finite differences method, as implemented in NWChem. These candidate samples were selected from the full QM9 dataset²⁴ of 133,885 using Uniform Manifold Approximation and Projection (UMAP)²⁵ for dimensionality reduction, paired with farthest point sampling to ensure diverse configurations, as illustrated in Fig. 2a (see supplementary for more details).

Dataset validation

Convergence testing on the Hessian generation was performed by assessing how the frequencies change as we increase or decrease the parameters for energy, density, and gradient convergence compared to the NWChem defaults. We found that the average change in the frequency was less than 1 cm⁻¹. Similarly we tested the sensitivity to the finite displacement of atoms. In order to remain within the harmonic regime but avoid numerical issues in the second derivatives that may stem from too small displacements, we considered the vales of 0.005, 0.01, and 0.02 in atomic units. The average change in frequency was less than 15 cm⁻¹ which is below the estimated 25 cm⁻¹ overestimation of vibrational frequencies with the dispersion corrected version of the ωB97X functional²⁶, hence we opted for the middle setting of 0.01 a.u. displacement for the Hessian generation. All convergence tests were conducted on a set of 10 molecules randomly selected from the QM9 database.

Data Records

The HessianQM9 dataset is openly available on Figshare²⁷. Cheminformatics data was stored in the Hugging Face dataset format²⁸. For each of the four solvent environments, the data is divided into separate datasets containing vibrational analysis of 41645 optimized geometries. Table 1 details the data for each of the samples, where labels are associated with the QM9 molecule labelling system given by Ramakrishnan et al.¹⁴. Analysis of the datasets is displayed in Fig. 2. We note that only molecules containing H, C, N, O were considered. This was because there were only 2163 molecules containing fluorine in the QM9 dataset which was not sufficient to build a good description of the chemical environment for fluorine atoms, and may have led to reduced overall precision of any models trained on our data.

Table 1 Table with labels, descriptions, units, and typing information for the Hessian QM9 dataset.

Full size table

Technical Validation

The MLIP implemented in this study is structurally equivalent to the NequIP model described by Batzner et al.⁸ (discussed in the supplementary information). Initially, the model was trained on energy and force data, achieving an average mean absolute error (MAE) of 42.44 cm⁻¹ for vibrational energy predictions on the validation set for energies between 4000-400 cm⁻¹. These vibrational energy predictions were obtained by twice applying autodifferentiation to the message-passing neural network. Figure 1c illustrates the MAE for defined energy ranges, demonstrating that the model’s performance in describing motions with characteristic wavenumbers > 400cm⁻¹ was significantly improved by fine-tuning using higher-order gradients, resulting in an MAE of 9.49 cm⁻¹. While fine-tuning significantly enhanced the accuracy of vibrational energy predictions for higher energy motions, such as stretching, it did not substantially improve the predictions for longer-range bending and torsional motions. In fact, we stumbled across a fundamental limitation in eigenvalue decomposition sensitivity to perturbations in the linear transformation matrix, and showed that small eigenvalues are inherently unstable compared to larger eigenvalues (discussed in the supplementary information).

Vibrations with energies below 400 cm⁻¹ typically correspond to large-amplitude motions in flexible molecules or torsional motions, which are often less informative for identifying specific functional groups in organic molecules. Higher energy vibrations (above 400 cm⁻¹) are more likely to be associated with stretching and bending modes of bonds within molecules, which provide more distinct and useful information for chemical analysis. Therefore, we can assume that the numerically calculated low-energy vibrational modes are not only inherently sensitive, but also not particularly useful for characterizing organic molecules.

An important question to consider is whether the use of numerical derivatives of forces for the generation of Hessians introduces any artifacts into our model. We would not expect any such artifacts to be any larger than the general inaccuracy of DFT vibrational frequencies in organic molecules which requires force constant scaling schemes to account for, however, to truly explore whether the predictions may be more accurate using analytical derivatives one would need to generate a new dataset using analytical derivatives. Similarly, alternative, smoother cavity models may likewise change the predictions, but to what extent would need to be tested by generating a control dataset with a smoothed cavity model. Both of these are beyond the scope of this work.

While ground-state Hessians are invaluable for improving predictions related to molecular vibrations and spectroscopy, transition state Hessians play a critical role in modeling reaction kinetics. Transition state Hessians capture the curvature of the potential energy surface near saddle points, providing accurate kinetic parameters for reaction rates. Incorporating TS Hessians into datasets such as T1x would enhance the accuracy of machine learning interatomic potentials in reproducing the dynamics of reactive systems^29,30.

Table 2 illustrates the MAE for the MLIP fine-tuned with Hessian data collected using implicit solvent models as detailed in the Methods Section. The results indicate a relatively good consistency across different solvent environments. Using an implicit solvent in DFT calculations adds significant computational cost, and this dataset highlights the potential for reducing these costs by employing a MLIP fine-tuned on a subset of expensive DFT calculations. Overall, our results affirm the effectiveness of higher-order gradient methods for improving vibrational energy predictions and provide insights into the computational trade-offs involved in more advanced model fine-tuning techniques.

Table 2 Mean absolute error for energy, forces, Hessian, and vibrational energy (4000-400 cm⁻¹) on the validation set for the e3x model fine-tuned with Hessian data in different implicit solvents.

Full size table

Code availability

The E(3)-equivariant message passing neural network model used to predict vibrational properties is available at https://github.com/google-research/e3x.

References

Hohenberg, P. & Kohn, W. Inhomogeneous electron gas. Physical review 136, B864 (1964).
Article ADS MathSciNet MATH Google Scholar
Skylaris, C.-K., Haynes, P. D., Mostofi, A. A. & Payne, M. C. Introducing onetep: Linear-scaling density functional simulations on parallel computers. The Journal of chemical physics122 (2005).
Gao, X., Ramezanghorbani, F., Isayev, O., Smith, J. S. & Roitberg, A. E. Torchani: A free and open source pytorch-based deep learning implementation of the ani neural network potentials. Journal of chemical information and modeling 60, 3408–3415 (2020).
Article CAS PubMed Google Scholar
Smith, J. S. et al. The ani-1ccx and ani-1x data sets, coupled-cluster and density functional theory properties for molecules. Scientific data 7, 134 (2020).
Article CAS PubMed PubMed Central MATH Google Scholar
Devereux, C. et al. Extending the applicability of the ani deep learning molecular potential to sulfur and halogens. Journal of Chemical Theory and Computation 16, 4192–4202 (2020).
Article CAS PubMed MATH Google Scholar
Smith, J. S., Isayev, O. & Roitberg, A. E. Ani-1: an extensible neural network potential with dft accuracy at force field computational cost. Chemical science 8, 3192–3203 (2017).
Article CAS PubMed PubMed Central Google Scholar
Deng, B. et al. CHGNet: Pretrained universal neural network potential for charge-informed atomistic modeling. Nature Machine Intelligence 5, 1031–1041 (2023).
Batzner, S. et al. E (3)-equivariant graph neural networks for data-efficient and accurate interatomic potentials. Nature communications 13, 2453 (2022).
Article ADS CAS PubMed PubMed Central MATH Google Scholar
Musaelian, A. et al. Learning local equivariant representations for large-scale atomistic dynamics. Nature Communications 14, 579 (2023).
Article ADS CAS PubMed PubMed Central MATH Google Scholar
Batatia, I., Kovacs, D. P., Simm, G., Ortner, C. & Csányi, G. Mace: Higher order equivariant message passing neural networks for fast and accurate force fields. Advances in Neural Information Processing Systems 35, 11423–11436 (2022).
MATH Google Scholar
Gastegger, M., Schütt, K. T. & Müller, K.-R. Machine learning of solvent effects on molecular spectra and reactions. Chemical science 12, 11473–11483 (2021).
Article CAS PubMed PubMed Central MATH Google Scholar
Haghighatlari, M. et al. Newtonnet: a newtonian message passing network for deep learning of interatomic potentials and forces. Digital Discovery 1, 333–343 (2022).
Article CAS PubMed PubMed Central Google Scholar
Reymond, J.-L. The chemical space project. Accounts of Chemical Research 48, 722–730 (2015).
Article CAS PubMed MATH Google Scholar
Ramakrishnan, R., Dral, P. O., Rupp, M. & Von Lilienfeld, O. A. Quantum chemistry structures and properties of 134 kilo molecules. Scientific data 1, 1–7 (2014).
Article MATH Google Scholar
Ward, L. et al. Graph-based approaches for predicting solvation energy in multiple solvents: open datasets and machine learning models. The Journal of Physical Chemistry A 125, 5990–5998 (2021).
Article ADS CAS PubMed MATH Google Scholar
Schmitz, N. F., Müller, K.-R. & Chmiela, S. Algorithmic differentiation for automated modeling of machine learned force fields. The Journal of Physical Chemistry Letters 13, 10183–10189 (2022).
Article CAS PubMed PubMed Central MATH Google Scholar
Fang, S., Geiger, M., Checkelsky, J. & Smidt, T. Phonon predictions with e(3)-equivariant graph neural networks. In AI for Accelerated Materials Design - NeurIPS 2023 Workshop (2023).
Feynman, R. P. Forces in molecules. Physical review 56, 340 (1939).
Article ADS CAS MATH Google Scholar
Komornicki, A. & Fitzgerald, G. Molecular gradients and hessians implemented in density functional theory. The Journal of chemical physics 98, 1398–1421 (1993).
Article ADS CAS MATH Google Scholar
Apra, E. et al. Nwchem: Past, present, and future. The Journal of chemical physics152 (2020).
Chai, J.-D. & Head-Gordon, M. Systematic optimization of long-range corrected hybrid density functionals. The Journal of chemical physics128 (2008).
Petersson, a et al. A complete basis set model chemistry. i. the total energies of closed-shell atoms and hydrides of the first-row elements. The Journal of chemical physics 89, 2193–2218 (1988).
Article ADS CAS MATH Google Scholar
Marenich, A. V., Cramer, C. J. & Truhlar, D. G. Universal solvation model based on solute electron density and on a continuum model of the solvent defined by the bulk dielectric constant and atomic surface tensions. The Journal of Physical Chemistry B 113, 6378–6396 (2009).
Article CAS PubMed MATH Google Scholar
Setianto, S., Panatarani, C., Singh, D. & Joni, I. M. Semi-empirical infrared spectra simulation of pyrene-like molecules insight for simple analysis of functionalization graphene quantum dots. Scientific Reports 13, 2282 (2023).
Article ADS CAS PubMed PubMed Central Google Scholar
McInnes, L., Healy, J. & Melville, J. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018).
Zapata Trujillo, J. C. & McKemmish, L. K. Meta-analysis of uniform scaling factors for harmonic frequency calculations. Wiley Interdisciplinary Reviews: Computational Molecular Science12 (2022).
Williams, N. J., Kabalan, L., Stojanovic, L., Zolyomi, V. & Pyzer-Knapp, E. O. Hessian qm9: A quantum chemistry database of molecular hessians in implicit solvents. Figshare https://doi.org/10.6084/m9.figshare.26363959 (2024).
Lhoest, Q. et al. Datasets: A community library for natural language processing. arXiv preprint arXiv:2109.02846 (2021).
Schreiner, M., Bhowmik, A., Vegge, T., Busk, J. & Winther, O. Transition1x-a dataset for building generalizable reactive machine learning potentials. Scientific Data 9, 779 (2022).
Article PubMed PubMed Central Google Scholar
Yuan, E. C.-Y. et al. Analytical ab initio hessian from a deep learning potential for transition state optimization. Nature Communications 15, 8865 (2024).
Article CAS PubMed PubMed Central MATH Google Scholar

Download references

Acknowledgements

This work was supported by the Hartree National Centre for Digital Innovation, a collaboration between STFC and IBM.

Author information

Authors and Affiliations

IBM Research, Hursley, SO21 2JN, UK
Nicholas J. Williams & Edward O. Pyzer-Knapp
Department of Materials, Imperial College London, Exhibition Road, London, SW7 2AZ, UK
Nicholas J. Williams
Hartree Centre, Science and Technology Facilities Council, Daresbury Laboratory, Daresbury, WA4 4AB, UK
Lara Kabalan, Ljiljana Stojanovic & Viktor Zólyomi

Authors

Nicholas J. Williams
View author publications
Search author on:PubMed Google Scholar
Lara Kabalan
View author publications
Search author on:PubMed Google Scholar
Ljiljana Stojanovic
View author publications
Search author on:PubMed Google Scholar
Viktor Zólyomi
View author publications
Search author on:PubMed Google Scholar
Edward O. Pyzer-Knapp
View author publications
Search author on:PubMed Google Scholar

Contributions

N.J.W. carried out sample selection from the QM9 dataset, as well as creating, training and validating the E(3)-equivariant graph neural network MLIP. V.Z. supervised the high-throughput generation of DFT data. N.J.W., L.K., L.S., and V.Z. generated the data. L.K. and L.S. performed convergence testing and data validation. E.P.K. helped to conceive the project, aided with algorithms for sample selection. N.J.W., L.K., L.S., and V.Z. contributed equally to preparing the manuscript.

Corresponding author

Correspondence to Nicholas J. Williams.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Williams, N.J., Kabalan, L., Stojanovic, L. et al. Hessian QM9: A quantum chemistry database of molecular Hessians in implicit solvents. Sci Data 12, 9 (2025). https://doi.org/10.1038/s41597-024-04361-2

Download citation

Received: 15 August 2024
Accepted: 20 December 2024
Published: 03 January 2025
DOI: https://doi.org/10.1038/s41597-024-04361-2