Introduction

One of the most impactful applications of artificial intelligence methods to the field of materials science has been the introduction of machine learning interatomic potentials (MLIPs)1,2,3,4. These are by now capable of delivering energies and forces at the level of density functional theory (DFT), or beyond, at a computational cost that is often several orders of magnitude lower. As such, they are now accelerating or even replacing the expensive DFT calculations, truly enabling the in-silico design and development of complex materials.

Various representation methods for crystal structures (embedding techniques) have been proposed in the past years5,6,7,8,9. These methods encode crystal structure information into learnable features, thereby improving data efficiency for the models. Furthermore, new machine learning models and strategies were developed and improved. These advancements gained significant momentum with the introduction of message passing neural network frameworks10, which were later enhanced by incorporating continuous-filter convolutions for message passing11. Message passing addressed the issue of exponentially expanding descriptor sizes in earlier machine learning models, enabling the prediction of much larger and more complex systems.

The training of MLIPs has also been facilitated by the continuous accumulation of DFT calculations over the decades, and by the creation of comprehensive databases, such as the Materials Project12, the Open Quantum Materials Database13, Aflowlib14, Alexandria15, NOMAD16, etc. These databases contain materials with almost all chemical elements and in all types of crystal structures. They also provide a variety of computed properties, including total energies, forces, stresses, etc. not only for compounds at dynamical equilibrium, but also for geometry optimization or molecular dynamics paths.

Until recently, MLIPs were typically trained for specific chemical systems and were often limited to a narrow range of geometries and atomic arrangements. This paradigm shifted in 2019 with the introduction of the Materials Graph Network (MEGNet)17, a framework designed for universal machine learning in materials science. Universal MLIPs (uMLIPs) are foundational models capable of handling all chemistries and crystal structures. MEGNet already demonstrated relatively low prediction errors across a wide array of properties in both molecules and crystals. Its performance was significantly enhanced by incorporating atomic coordinates, lattice vectors in crystals, and 3-body interactions18, enabling uMLIPs to predict ground-state geometries with a mean absolute error of 0.035 eV/atom in the energy, when compared to DFT. Further advancements, such as the use of higher-order body messages, have resulted in models that are accurate, fast, and highly parallelizable19. Since then, there has been a surge of developments, with new and improved models being published at an almost monthly rate20,21,22,23,24.

In spite of the rapid progress of uMLIPs, challenges remain. Since these models are mostly trained and evaluated on existing datasets12,15,23, containing mainly equilibrium or near-equilibrium geometries, they struggle to reproduce meta-stable or highly distorted structures25. To resolve this problem, further information on the off-equilibrium structures from molecular dynamic results can be used26. Alternatively, by gradually distorting the optimized geometries, one can step away from the minima of the potential energy surface27. Models trained on such augmented datasets show superior performance at predicting equilibrium structures and energies27. Moreover, compared to those trained without off-equilibrium data, models trained with augmented datasets perform better on predicting the first derivatives of the energy27. While multiple evaluations of uMLIPs can be found in the literature28,29,30, direct phonon prediction capabilities have not been comprehensively characterized.

Here we benchmark seven uMLIP models, specifically M3GNet, CHGNet, MACE-MP-0, SevenNet-0, MatterSim-v1, ORB, and eqV2-M, for the calculation of phonon properties. These properties are obtained from the second derivatives (i.e., the curvature) of the potential energy surface, and therefore sample a small neighborhood around the dynamically stable minima. Phonons are extremely important in materials science, as they are fundamental in determining the free energy (and therefore thermodynamic stability), dynamical stability, thermal properties, etc. We note that all seven models are also featured in the Matbench Discovery leaderboard17,19,20,21,23,25,27,31 (ranked 12th, 11th, 10th, 8th, 3rd, 2nd, and 1st, respectively, at the time of writing).

M3GNet17 is one of the pioneering uMLIPs and still remains a key model in the field. It employs three-body interactions and incorporates atomic positions, enabling the calculation of forces through the automatic differentiation of the neural network. CHGNet23 is another of the earlier models, but it still demonstrates excellent performance while having one of the smallest architectures with just over 400 thousand parameters. MACE-MP-019 utilizes the atomic cluster expansion32 as a local descriptor, reducing the number of necessary message-passing steps while maintaining efficiency. SevenNet-021, built upon NequIP8, focuses on parallelizing the message-passing process. This approach preserves NequIP’s data efficiency, accuracy, and equivariant character. MatterSim-v131 builds upon M3GNet, leveraging active learning and efficient sampling across the chemical space. Its goal is to enhance the accuracy of energy and force predictions over a broader range of scenarios while maintaining a straightforward architecture that is easy to fine-tune. The ORB model20 combines the smooth overlap of atomic positions6 with a graph network simulator33. Finally, eqV2-M27 is using the model developed by ref. 22 utilizing equivariant transformers to achieve higher-order equivariant representations. An important detail to note is that the ORB and eqV2-M models predict forces as a separate output rather than deriving them as energy gradients as the other five models.

Results

Dataset and its properties

To benchmark phonon properties we use the dataset developed in the MDR database34. This dataset includes around 10 000 non-magnetic semiconductors, covering a wide range of elements across the periodic table. Moreover, the phonon calculations were performed with VASP, ensuring a high degree of compatibility with the training sets used in the construction of the uMLIPs. Unfortunately, this phonon dataset was originally constructed with the Perdew-Burke-Ernzerhof (PBE) for solids (PBEsol)35 approximation to the exchange-correlation functional. This is certainly a very reasonable choice, as the PBEsol functional exhibits superior structural36,37 and phonon38 properties when compared to the standard PBE39. However, as all uMLIPs were trained on PBE data, a direct comparison to PBEsol phonons can be ambiguous. To mitigate this problem, we recalculated the entire phonon dataset from ref. 34 with the PBE functional (see Section IV). In the following, we not only present comparisons of uMLIP calculations with PBE data, but we also include the difference between PBE and PBEsol. This gives us an estimate of the variability of the results as a function of the approximation to the exchange-correlation function, that we use as an absolute scale to assess the quality of the uMLIPs.

As illustrated in Fig. 1a, the dataset contains mostly ternary and quaternary compounds. Additionally, we observe that the majority of the compounds belong to the monoclinic and orthorhombic crystal systems, followed by approximately equal proportions of trigonal and tetragonal systems. Cubic systems are less common, with hexagonal systems representing the smallest proportion. Ultimately, these characteristics are inherited from the Materials Project database12 and the Inorganic Crystal Structure Database (ICSD)40. Finally, triclinic systems are absent from the MDR database, likely because of the extra computational cost that arises from the reduced symmetry.

Fig. 1: Summary of data diversity for the benchmarking dataset.
figure 1

Distribution over a number of different chemical elements per unit cell, b crystal systems, and c band gaps calculated with the PBE functional for all the materials in the dataset.

In Fig. 2 we plot the frequency of the chemical elements in the dataset. We can see that almost all the periodic table is well represented (with a few exceptions like Tc that is radioactive or Eu and Gd for which VASP has convergence problems). We also observe a significant abundance of structures containing oxygen. However, certain compounds, such as those containing Mo and W, as well as the magnetic 3d elements (from V to Ni) are underrepresented. These biases in the MDR database34 are also, to some extent, inherited from the Materials Project database12, but should not be relevant for the benchmark we present here. Although the dataset is predominated by oxides, the band gaps of the whole set still covers a large range, as illustrated in Fig. 1c.

Fig. 2: Periodic tables showing the frequency of the chemical elements in the structures from the dataset.
figure 2

Elements in gray are absent from the dataset.

Relative performances of uMLIPs

We start by discussing the errors in the geometry relaxations, as shown in Table 1 and Fig. 3. The “Failed” column in Table 1 indicates for how many systems a model failed to converge the forces to below 0.005 eV/Å. We can see that CHGNet and MatterSim-v1 models appear to be the most reliable, with 0.09% and 0.10% unconverged structures, respectively. The M3GNet, SevenNet-0 and MACE-MP-0 models have a similar number of unconverged structures, while the ORB and eqV2-M models exhibit a much larger failure rate. The most unreliable model for this dataset is eqV2-M, for which 0.85% structural calculations were unable to converge. In general, there are two main reasons behind the failures, either the geometry optimization path explored regions of the potential energy surface where the uMLIP yielded unphysical forces, or there were high frequency errors in the forces that prevented the relaxation algorithm to converge to the required precision. This latter reason is behind the very large failure rate for the two models where the forces are not the exact derivatives of the energy. CHGNet shows notably higher error in energy predictions, which is expected given that we did not apply the energy correction procedure typically used during CHGNet’s training.

Table 1 Summary of the errors for energy (E, in meV/atom) and volume (V, in Å3/atom) of the converged relaxation compared to PBE results
Fig. 3: Errors in the volume.
figure 3

Violin plots of the errors in the volume of the unit cell per atom (∆Volume, in Å3/atom), relative to the PBE reference data.

Looking at Fig. 3 we see that, as expected, PBEsol leads to a contraction of the unit cell, correcting the underbinding that is typical of the PBE approximation. The large majority of the systems show a difference between the PBE and PBEsol volume per atom between 0 and –2 Å3/atom. All uMLIPs exhibit MAE(V) that are smaller than the mean absolute difference between PBE and PBEsol. Among them, the eqV2-M model emerges as the most accurate, closely followed by ORB. Indeed, these two uMLIPs show remarkable performances for the vast majority of the compounds in the dataset, with errors that are quite small in both absolute and relative terms. MatterSim-v1 and SevenNet-0 show solid performances, although with mean errors four times larger than the two best models. Finally, M3GNet, MACE-MP-0, and CHGNet have wider error distributions, with MAE in the range of 0.4–0.5 Å3/atom. These results confirm that both eqV2-M and ORB are the best models for geometry optimization, and that they can already be used to essentially replace DFT calculations for this task.

We now turn our attention to phonon related properties. We chose to look at the maximum phonon frequency (reported in Kelvin, with 1 K = 0.695 cm−1), the phonon density of state (DOS), the average of the sound velocity on the 3 accoustic branches, the vibrational entropy, the Helmholtz free energy, and the heat capacity at constant volume, the last three calculated at the temperature of 300 K. The maximum phonon frequency allows us to detect systematic errors in the prediction of the concavity of the potential energy surface, especially important as it is well known that some uMLIPs have the tendency to yield too soft phonons. The phonon DOS provides information regarding the general prediction of phonon modes with respect to frequency, while the sound velocity help identify errors in the acoustic branches in the vicinity of Γ. It should be noted that for the phonon DOS we remove values below 0.1 states/THz. The vibrational entropy and the Helmholtz free energy are important properties as they are essential to determine thermodynamic stability and phase diagrams as a function of temperature. Finally, the heat capacity is an important thermal property that can be directly measured experimentally.

We note that maximum phonon frequency was calculated from the values at the q-points commensurate to the supercell matrix, whereas the DOS and thermodynamic properties were obtained on an denser q-grid by applying Fourier interpolation (see Section IV). However, as the q-grids are consistent across DFT and uMLIPs calculations, the interpolation error should be systematic and should not affect the benchmark.

We aggregated the errors for all models in Table 2 and in Fig. 4. We first notice that the deviation between the PBE and PBEsol results is small but not negligible. This observation reinforces the necessity of using a consistent functional between the training and benchmarking stages. The difference between PBE and PBEsol exhibits a rather narrow distribution in all 6 properties, especially when compared to the MAE of most of the uMLIPs. There are also systematic differences: for example, the maximum phonon frequencies in PBEsol are higher than those with PBE, which can be understood by the contraction of the cell and subsequent hardening of the force constants. PBEsol also leads to larger values of the free energy (on average of the order of 10 kJ/mol), and to smaller values of the entropy and the heat capacity.

Table 2 Summary of the mean absolute errors (MAE) for the maximum phonon frequency (MAE(ωmax), in Kelvin where 1 K ≈ 0.695 cm−1), the vibrational entropy (MAE(S), in J/K/mol), the Helmholtz free energy (MAE(F), in kJ/mol), the heat capacity at constant volume (MAE(CV), in J/K/mol), the phonon density of state (MAE(DOS)), and the average of sound velocties (MAE(avg. vs))
Fig. 4
figure 4

Violin plots of the errors in (a) the maximum phonon frequency, (b) the vibrational entropy, (c) the Helmholtz free energy, (d) the heat capacity, (e) the density of states and (f) the average of the sound velocity on the 3 accoustic branches, relatively to the PBE reference data.

Based on the errors we can roughly classify the seven models into three categories. The first contains ORB and eqV2-M, which have very large errors in phonon-related properties (see Fig. 4). In fact, phonon frequencies are grossly underestimated, and are often even imaginary as we will see in the following.

In the second category we have, in increasing order of accuracy, M3GNet, CHGNet, MACE-MP-0 and SevenNet-0 (see Fig. 4). The errors of these models are on average considerably larger than the difference between PBE and PBEsol. Moreover they all exhibit systematic errors, underestimating the phonon frequencies and the free energy, and overestimating the entropy and the heat capacity. From the four models, the most accurate is clearly SevenNet-0, while the older M3GNet and CHGNet show the larger errors. In spite of the difference in topologies, these four models are all trained in the same dataset, so it is not surprising that their results are somewhat similar. This again demonstrates that training data is at least as important as the representation of the crystal structure or the topology of the model to develop a uMLIP.

Finally, MatterSim-v1 stands out as the most accurate uMLIP for the calculation of phonons. Not only does it not exhibit any strong systematic error, with all distributions essentially centered at zero, but also the dispersion of the errors is extremely small, leading to values of MAE considerably smaller than the difference between PBE and PBEsol. This indicates that MatterSim-v1 can be used to calculate phonon properties of semiconductors with an accuracy comparable to DFT codes, although at a very small fraction of the computational cost. It is very interesting to note that although MatterSim-v1 is based upon the simple M3GNet, its performance exceeds much more complicated models such as SevenNet-0 or eqV2-M that are based on equivariant networks. The key in this case is the scalability of M3GNet, which allows for an increase in the number of parameters and the efficient use of larger amounts of training data.

To have a better understanding of the general behavior of the uMLIPs, we plot in Fig. 5 the distribution of the maximum frequencies predicted. Most compounds have maximum frequencies in the range of 500–2000 K, with a few containing very light elements going up to 5500 K. The softening of the phonon frequencies by M3GNet, CHGNet, MACE-MP-0 and SevenNet-0 is evident, in particular for the first two. ORB and eqV2-M, on the other hand, exhibit completely distorted distributions peaking at zero, showing that the force constants obtained with these models are unphysical.

Fig. 5
figure 5

Highest frequencies predicted for each structure for all models and from the original PBEsol MDR database.

Another important performance metric is dynamical stability, a crucial stability descriptor utilized by many high-throughput searches of inorganic materials41,42,43,44. A compound is dynamically stable when it is in a true minimum of the potential energy surface and not in a maximum or a saddle point. In practice, it is assured by the absence of imaginary phonon frequencies in the spectrum. Unfortunately, it is well known that numerical inaccuracies often lead to small imaginary frequencies close to the Γ-point. To avoid this problem, we consider a structure to be dynamically stable if frequencies are all real across the Brillouin zone except at Γ where we allow the three acoustic modes to have small imaginary frequencies (with a threshold of −50 K). This criterion was applied to all q-points commensurate with the supercell matrix (but not to the interpolated q-points).

The elements of the confusion matrix, when compared to the PBE, are listed in Table 3. Most compounds that are stable in the PBE are also stable in PBEsol, and vice-versa, with the differences coming mostly from the difficulty associated to small imaginary frequencies as mentioned above. MatterSim-v1 and MACE-MP-0 are the most reliable with a percentage of true positives at 95%. M3GNet, SevenNet-0 and CHGNet are somewhat less accurate, especially in what concerns the percentage of true positives. Finally, the eqV2-M and ORB models perform very poorly, with more than 80% of the unstable systems being false negatives.

Table 3 Normalized confusion matrix with true stable (TS, in %), false unstable (FU, in %), true unstable (TU, in %), false stable (FS, in %) of predicted dynamical stability for models using PBE data as reference

Discussion

We created a dataset that includes phonon properties of almost 10,000 semiconductors obtained with DFT. These calculations were performed with the PBE approximation, the same approximation employed in the datasets used for the training of uMLIPs. This allows us to benchmark, without ambiguities, phonon properties calculated with uMLIPs.

In what concerns the equilibrium geometry, ORB and eqV2-M are extremely accurate and convincingly outperform all other models. This can be understood from the fact that the models output both the energy and the forces, and are trained in a very large dataset, leading to very small errors at equilibrium. Regarding phonons, however, the situation is completely different. ORB and eqV2-M yield very low quality phonons, often imaginary. We believe that the reason for this problem is that these models are non-conservative. In fact, contrary to all other models, in ORB and eqV2-M the forces are not calculated by performing the derivative of the energy with respect to the atomic positions, but they are output directly by the network. This avoids the costly computational step of evaluating the derivatives though back-propagation and the extra freedom allows for a more accurate prediction of energy and forces. Unfortunately, it also leads to inevitable errors especially for the small displacements required for the calculation of phonons. This problematic behavior has also been reported and analyzed in45. The problem can be alleviated, but far from resolved, by using larger displacements in the frozen-phonon workflow. Of course, this can lead to further problems, such as the overestimation of the anharmonic contributions.

Phonon properties calculated with MatterSim-v1, and to a lesser extent SevenNet-0, are of very high quality. Other models fare somewhere in between, exhibiting both a larger dispersion of the errors, and systematic deviations with respect to the reference PBE values.

We should note that not only the performance of the models, but also their computational efficiency, should be taken into account when choosing a uMLIP for a specific application. From the models tested here, M3GNet is by far the fastest, running in a single CPU core more efficiently than any of the other models in a full GPU. On the other extreme we have eqV2-M and MACE-MP-0, convincingly the slowest of the pack, while the rest of the models fall in between.

Our benchmark highlights the importance of considering specific optimization goals for individual metrics and understanding the trade-offs involved. Furthermore, it shows that uMLIPs are ready to be used not only for the calculation of geometries and energies, but also of response properties, that are essential for a variety of material applications. We hope our critical assessment of phonon properties will guide future training efforts and encourage the use of our dataset to further develop uMLIPs.

Methods

Ab initio dataset

To recalculate the MDR dataset34 with PBE39 exchange-correlation functional, we used the code VASP46,47. We used all parameters consistent with the MDR dataset, with the exception of the approximation to the exchange-correlation functional that was changed from PBEsol to PBE. We followed the same workflow as MDR, but before the stringent geometry relaxation we applied a pre-relaxation step with energy and force convergence criteria of 10−7 eV/cell and 10−5 eV/Å, respectively. For the stringent relaxation step, in accordance to the MDR calculations, we used a higher energy and force convergence criteria of 10−8 eV/cell and 10−8 eV/Å, respectively. Next, the force constants were obtained by applying the finite displacement method as implemented in the PHONOPY python package48,49.

uMLIP evaluation

For all the uMLIP models, we perform the geometry relaxation and force set calculations starting from the PBE geometry using the Atomic Simulation Environment (ASE)50. To keep the space group symmetry of the PBE structure, we employ the ASE symmetrizer FRETCHCELLFILTER. The structure optimization is done using the fast inertial relaxation engine (FIRE)51, with force convergence criteria set to 0.005 eV/Å for all models.

To calculate the thermodynamic properties, i.e. the vibrational entropy, the Helmhotz free energy, and the heat capacity, the phonon density of states is obtained by Fourier interpolation from the coarse calculated q-grid into a denser 20 × 20 × 20 grid as in the MDR database. We set a temperature of 300 K to compute the thermodynamic properties.

The phonon density of states has been calculated using the same grid as that employed for the thermal properties. Values that are below 0.1 states/THz from PBE and model prediction were removed. The sound velocity is calculated using group velocities near the Γ point, that are calculated using small q-vectors oriented along each axis (x, y, and z). For each phonon branch, we extract the directional component of the group velocity corresponding to the axis along which the q-vector was oriented, specifically the xx, yy, and zz components. We then calculate the average of these directional components across all acoustic branches to obtain the average sound velocity.

All models considered in this paper are open source. In Table 4 we list their training set sizes, data sources, and the number of parameters.

Table 4 List of models with number of training data (Ntraining), data source, and number of parameters of the models (Nw)