Background & Summary

Since the seminal work of Hohenberg and Kohn1, density functional theory (DFT) has emerged as the most popular first principles method in quantum chemistry. The rise of DFT can be attributed to its remarkable speed as compared to traditional ab initio quantum chemistry methods. Despite its speed, and the advances in the computational efficiency of DFT including the development of linear scaling DFT2, as well as advances in computing hardware, it is still infeasible to apply DFT to large system (>1000 atoms) and long time scales (>1ns). To overcome this limitation, machine learning interatomic potentials (MLIPs) have been developed to perform atomistic simulations at a precision approaching that of DFT but at a cost closer to that of classical force fields. MLIP codes such as ANI3,4,5,6, CHGNet7, Nequip8, Allegro9, MACE10, FieldSchNet11 NewtonNet12 have gradually paved the way to an increasingly greater level of applicability across the chemical space under sufficiently diverse training data assumptions.

MLIP models heavily rely on high-quality and extensive reference data during their training process. The Chemical Space Project enumerated all feasible organic molecules up to 17 atoms leading to the creation of the GDB-17 databases, encompassing an impressive 166 billion molecules13. QM9 is the gold standard benchmark dataset where various MLIPs have been, comprising of 133,885 ground-state geometries containing up to nine heavy atoms from the GDB-17 database (H, C, N, O, F)14. QM9 provides ample information about each chemical species (energy, harmonic frequencies, dipole moments, polarizabilities, enthalpies, and free energies of atomization) where all properties were computed at the B3LYP/6-31G(2df,p) level of quantum chemistry. Despite the growing amount of open-sourced molecular data, few quantum chemical calculations are carried out in an implicit solvent environment11,15. In this work, we have carried out geometry relaxations to the equilibrium ground state on the QM9 dataset in vacuum, water, tetrahydrofuran and toluene.

N. F. Schmitz et al. showed that higher derivatives are more informative and aid regression models significantly (Fig. 1a and b)16. More recently, S. Fang et al. presented an E(3)-equivariant message passing graph neural network for predicting the phonon modes of the periodic crystals and molecules by evaluating the second derivative matrices of the potential energy surface17. They showed that using higher order training data improved the energy models beyond the lower-order energy and force data. While the Hellmann-Feynman theorem provides an analytical method for calculating for first-order derivative of the energy eigenvalue, the second-order derivative must be calculated using finite difference methods18,19. Carrying out a Taylor expansion of the potential energy surface to the second order, known as the harmonic approximation (Fig. 1a),

$$E({({\overrightarrow{x}}_{i})}_{\alpha })\approx {E}_{0}+\sum _{{({\overrightarrow{x}}_{i})}_{\alpha },{({\overrightarrow{x}}_{j})}_{\beta }}\frac{1}{2}\frac{{{\rm{\partial }}}^{2}E}{{\rm{\partial }}{({\overrightarrow{x}}_{i})}_{\alpha }{\rm{\partial }}{({\overrightarrow{x}}_{j})}_{\beta }},$$
(1)

where α and β are the Euclidean dimensions, i and j are atomic labels, E is the energy, and \(\overrightarrow{x}\) is Euclidean coordinate vector. The equation of motion for atoms close the the equilibrium geometry, governed by the dynamical matrix,

$${D}_{\alpha ,\beta }(i,j)=\frac{1}{\sqrt{{m}_{i}{m}_{j}}}\sum \frac{{{\rm{\partial }}}^{2}E}{{\rm{\partial }}{({\overrightarrow{x}}_{i})}_{\alpha }{\rm{\partial }}{({\overrightarrow{x}}_{j})}_{\beta }},$$
(2)

where m is the atomic mass. Eigenvalue decomposition of the (N × 3 × N × 3) tensor, where N is the number of atoms in the system, results in vectors of normal vibrational modes, and a vector of squared quantized vibrational energies,

$$\sum _{j,\beta }{D}_{\alpha \beta }(i,j){e}_{\beta }^{n}(j)={[{\omega }^{n}]}^{2}{e}_{\beta }^{n}(j)$$
(3)

where \({e}_{\beta }^{n}\) is the eigenvector (or eigenbasis) which represent the principal axes along which the collective motion of the atoms occurs, and ω is the vibrational energy and can be correlated to experimental infrared/Raman spectroscopy. S. Fang et al. also illustrated that an E(3)-equivariant message passing neural network can be fine-tuned by integrating direct second-order Hessian data or vibrational modes into the training loss function, thereby substantially improving vibrational mode property prediction for small molecules and crystals17. Therefore, we have carried out numerical Hessian calculations for over 40,000 samples in each solvent environment to improve the description of the curvature of the potential energy surface at equilibrium. We validate the utility of the higher-order training data by fine-tuning an E(3)-equivariant message passing graph neural network to improve the prediction of molecular vibrational properties. We observed a refinement of stretching modes with a characteristic wavenumber > 400cm−1 after fine-tuning.

Fig. 1
figure 1

a) illustration of the global scalar/energy of a graph/molecule (left) and the vector/forces of the node/atom (right). Below is an illustration of the second-order derivative of the potential energy surface between two nodes/atoms giving a 3 × 3 Hessian matrix. (b) illustrates conditioning a model to a one-dimensional function (black dashed line), where the top plot uses the function value and its gradient to train the model (filled blue line), while the bottom plot also includes the Hessian to train the model (filled orange line) demonstrating an improved approximation with higher-order derivatives. (c) mean absolute error (MAE) between the predicted and calculated vibrational frequencies for a MLIP trained on energy and forces (blue) and a MLIP fine-tuned on Hessian data. The box-plot on the left shows the MAE for the full spectrum between 4000 − 400 cm−1, subsequent box-plots represent characteristic wavenumbers, where 3600 − 2800 cm−1 represents stretching vibrations for C-H, O-H and N-H bonds, 2000 − 1500 cm−1 represents stretching vibrations of shorter double bonds C=C and C=O, 1500 − 400 cm−1 represents the fingerprint region which contains a complex pattern of absorption bands that are specific to each molecule, 400 − 10 cm−1 represents longer range molecular bending and torsional motions.

Methods

DFT Calculations

All DFT calculations were carried out using the NWChem software20. We employed the ωB97X21 functional and the 6-31G* basis set22 to create data compatible with the ANI-1/ANI-1x/ANI-2x datasets. The self-consistent field (SCF) cycle was deemed converged when the changes in total energy and density were less than 10−6eV. All molecules in the set are neutral with multiplicity equal to 1. The Mura-Knowles radial quadrature (with 49 radial points for heavy atoms, and 45 radial points for H) and Lebedev angular quadrature with 434 points for all elements were used in the integration. Structures from the QM9 dataset were optimized in vacuum and in three solvents of different polarities, tetrahydrofuran (THF) (ϵr = 7.6), toluene (ϵr = 2.4), and water (ϵr = 80.0). The continuum solvation model based on density (SMD) was used23 to model the solvation effects. We used the default optimization criteria implemented in NWChem.

It is worth to note that there are advantages and disadvantages associated with the use of the 6-31G* basis set and the wB97X functional. Greater accuracy may be achieved with a larger basis, or the ability to gradually improve results in the future by using a split-valence basis such as Def2-SVP might have been introduced, and we could have used a functional with an improved dispersion model to better describe non-bonded interactions. However, the downside of the approach would have been increased computational expense, and not just because the individual calculations would have been more costly, but also because the Hessian dataset is best used in conjunction with a dataset of single point energies and forces on off-equilibrium geometries. We opted to create a dataset that is completely compatible with a number of existing datasets out there so that it could be combined with such existing datasets to train an accurate foundation model for organic compounds. In the future, similar datasets may be created by using better basis sets and/or density functionals, but it will be necessary to create both a Hessian database and a suitably large dataset of off-equilibrium geometries.

Molecule Selection

The Hessian matrices, vibrational frequencies and modes were computed for a subset of 41,645 molecular geometries in vacuum and three solvents, applying the finite differences method, as implemented in NWChem. These candidate samples were selected from the full QM9 dataset24 of 133,885 using Uniform Manifold Approximation and Projection (UMAP)25 for dimensionality reduction, paired with farthest point sampling to ensure diverse configurations, as illustrated in Fig. 2a (see supplementary for more details).

Fig. 2
figure 2

a) UMAP density plot of the QM9 dataset using SOAP descriptors where contour lines show the density of the whole dataset and the black dots show the 40,000 sampled data-points using farthest point sampling. (be) histograms of the energy differences, eV, from linear regression of atomic energies for sampled ground state configurations in different environments. The energy differences are calculated as deviations from a reference model where the atomic energies of hydrogen, carbon, nitrogen, and oxygen are approximately −16.67, −1035.81, −1489.65, and −2047.05 eV, respectively. In vacuum, (b) the mean energy difference of −0.05 eV and a standard deviation of 1.19 eV, in THF, (c) the mean energy difference is −0.44 eV with a standard 1.17 eV, in toluene, (d) the mean is −0.37 eV and the standard deviation is 1.19 eV, and in water, (e) the mean energy difference is −0.45 eV and the standard deviation is 1.11 eV.

Dataset validation

Convergence testing on the Hessian generation was performed by assessing how the frequencies change as we increase or decrease the parameters for energy, density, and gradient convergence compared to the NWChem defaults. We found that the average change in the frequency was less than 1 cm−1. Similarly we tested the sensitivity to the finite displacement of atoms. In order to remain within the harmonic regime but avoid numerical issues in the second derivatives that may stem from too small displacements, we considered the vales of 0.005, 0.01, and 0.02 in atomic units. The average change in frequency was less than 15 cm−1 which is below the estimated 25 cm−1 overestimation of vibrational frequencies with the dispersion corrected version of the ωB97X functional26, hence we opted for the middle setting of 0.01 a.u. displacement for the Hessian generation. All convergence tests were conducted on a set of 10 molecules randomly selected from the QM9 database.

Data Records

The HessianQM9 dataset is openly available on Figshare27. Cheminformatics data was stored in the Hugging Face dataset format28. For each of the four solvent environments, the data is divided into separate datasets containing vibrational analysis of 41645 optimized geometries. Table 1 details the data for each of the samples, where labels are associated with the QM9 molecule labelling system given by Ramakrishnan et al.14. Analysis of the datasets is displayed in Fig. 2. We note that only molecules containing H, C, N, O were considered. This was because there were only 2163 molecules containing fluorine in the QM9 dataset which was not sufficient to build a good description of the chemical environment for fluorine atoms, and may have led to reduced overall precision of any models trained on our data.

Table 1 Table with labels, descriptions, units, and typing information for the Hessian QM9 dataset.

Technical Validation

The MLIP implemented in this study is structurally equivalent to the NequIP model described by Batzner et al.8 (discussed in the supplementary information). Initially, the model was trained on energy and force data, achieving an average mean absolute error (MAE) of 42.44 cm−1 for vibrational energy predictions on the validation set for energies between 4000-400 cm−1. These vibrational energy predictions were obtained by twice applying autodifferentiation to the message-passing neural network. Figure 1c illustrates the MAE for defined energy ranges, demonstrating that the model’s performance in describing motions with characteristic wavenumbers > 400cm−1 was significantly improved by fine-tuning using higher-order gradients, resulting in an MAE of 9.49 cm−1. While fine-tuning significantly enhanced the accuracy of vibrational energy predictions for higher energy motions, such as stretching, it did not substantially improve the predictions for longer-range bending and torsional motions. In fact, we stumbled across a fundamental limitation in eigenvalue decomposition sensitivity to perturbations in the linear transformation matrix, and showed that small eigenvalues are inherently unstable compared to larger eigenvalues (discussed in the supplementary information).

Vibrations with energies below 400 cm−1 typically correspond to large-amplitude motions in flexible molecules or torsional motions, which are often less informative for identifying specific functional groups in organic molecules. Higher energy vibrations (above 400 cm−1) are more likely to be associated with stretching and bending modes of bonds within molecules, which provide more distinct and useful information for chemical analysis. Therefore, we can assume that the numerically calculated low-energy vibrational modes are not only inherently sensitive, but also not particularly useful for characterizing organic molecules.

An important question to consider is whether the use of numerical derivatives of forces for the generation of Hessians introduces any artifacts into our model. We would not expect any such artifacts to be any larger than the general inaccuracy of DFT vibrational frequencies in organic molecules which requires force constant scaling schemes to account for, however, to truly explore whether the predictions may be more accurate using analytical derivatives one would need to generate a new dataset using analytical derivatives. Similarly, alternative, smoother cavity models may likewise change the predictions, but to what extent would need to be tested by generating a control dataset with a smoothed cavity model. Both of these are beyond the scope of this work.

While ground-state Hessians are invaluable for improving predictions related to molecular vibrations and spectroscopy, transition state Hessians play a critical role in modeling reaction kinetics. Transition state Hessians capture the curvature of the potential energy surface near saddle points, providing accurate kinetic parameters for reaction rates. Incorporating TS Hessians into datasets such as T1x would enhance the accuracy of machine learning interatomic potentials in reproducing the dynamics of reactive systems29,30.

Table 2 illustrates the MAE for the MLIP fine-tuned with Hessian data collected using implicit solvent models as detailed in the Methods Section. The results indicate a relatively good consistency across different solvent environments. Using an implicit solvent in DFT calculations adds significant computational cost, and this dataset highlights the potential for reducing these costs by employing a MLIP fine-tuned on a subset of expensive DFT calculations. Overall, our results affirm the effectiveness of higher-order gradient methods for improving vibrational energy predictions and provide insights into the computational trade-offs involved in more advanced model fine-tuning techniques.

Table 2 Mean absolute error for energy, forces, Hessian, and vibrational energy (4000-400 cm−1) on the validation set for the e3x model fine-tuned with Hessian data in different implicit solvents.