Uncertainty quantification for misspecified machine learned interatomic potentials

Perez, Danny; Subramanyam, Aparna P. A.; Maliyov, Ivan; Swinburne, Thomas D.

doi:10.1038/s41524-025-01758-4

Download PDF

Article
Open access
Published: 16 August 2025

Uncertainty quantification for misspecified machine learned interatomic potentials

Danny Perez¹,
Aparna P. A. Subramanyam¹,
Ivan Maliyov² &
…
Thomas D. Swinburne^2,3

npj Computational Materials volume 11, Article number: 263 (2025) Cite this article

2733 Accesses
1 Altmetric
Metrics details

Subjects

Abstract

The use of high-dimensional regression techniques from machine learning has significantly improved the quantitative accuracy of interatomic potentials. Atomic simulations can now plausibly target quantitative predictions in a variety of settings, which has brought renewed interest in robust means to quantify uncertainties. In many practical settings where model complexity is constrained (e.g., due to performance considerations), misspecification — the inability of any one choice of model parameters to exactly match all training data — is a key contributor to errors that is often disregarded. Here, we employ a recent misspecification-aware regression technique to quantify parameter uncertainties, which is then propagated to a broad range of phase and defect properties in tungsten. The propagation is performed through both brute-force resampling and implicit Taylor expansion. The propagated misspecification uncertainties robustly quantify and bound errors on a broad range of material properties. We demonstrate application to recent foundational machine learning interatomic potentials, accurately predicting and bounding errors in MACE-MPA-0 energy predictions across the diverse materials project database.

A framework to evaluate machine learning crystal stability predictions

Article Open access 23 June 2025

Fine-tuning foundation models of materials interatomic potentials with frozen transfer learning

Article Open access 18 July 2025

Rapid and accurate predictions of perfect and defective material properties in atomistic simulation using the power of 3D CNN-based trained artificial neural networks

Article Open access 02 January 2024

Introduction

Atomic simulations such as molecular dynamics have long provided detailed insight into the nanoscale dynamics of material behavior that would otherwise be extremely difficult to access in experiment. For many years, these insights mostly took the form of mechanistic, qualitative information on the key nanoscale processes that dominate different material behavior. The focus on qualitative interpretation rather than quantitative predictions was a consequence of strong limitations in the accuracy of interatomic potentials, whose simple functional forms could not, nor could be expected to, deliver quantitative accuracy to some ab initio reference calculations (typically obtained using Density Functional Theory, DFT), outside the small set of physical properties they were tailored to reproduce.

This outlook has gradually evolved with the introduction of machine learned interatomic potentials (MLIAP)^1,2,3,4,5,6, where expertly-crafted functional forms with a modest number of adjustable parameters were largely superseded by very flexible generic functional forms with a high number of adjustable parameters that can naturally capture very complex and non-intuitive chemical behavior. This increase in model complexity has been supported by a corresponding increase in available computing power⁷ and by improvements in the ability of electronic structure codes at exploiting modern hardware such as GPUs⁸, making the generation of extremely large training sets increasingly accessible. Furthermore, while some ML representations of atomic environments can in principle be made complete, practical considerations related to inference speeds often limit the complexity and expressivity of models that are useful in applications, e.g., by truncating expansions at finite cutoffs or body-order, by limiting the depth, width, or number of message passing layers in neural networks, or by ignoring physically-relevant state variables such as charge, oxidation state, magnetic order, etc. In other words, while the new generation of MLIAPs is much more accurate than conventional IAPs, none of the practical models in use today can be expected to provide a perfect representation of quantum mechanics. In conjunction with the fact that the community is increasingly seeking a more favorable compromise between accuracy and execution speed⁹, the consideration of model errors in UQ for MLIAPs becomes increasingly urgent.

While this work focuses primarily on misspecification errors, it is important to note that other sources of errors can be important in practice. While well-converged quantum calculations can typically be considered quasi-deterministic, and thus free of significant aleatoric errors, epistemic errors can be significant. Indeed, it is increasingly understood that while MLIAPs can deliver impressive accuracy, their flexibility and comparative lack of built-in physical constraints make the curation of the datasets used to train them a determining factor in their robustness. For example, MLIAPs can exhibit pathological behaviors, such as unstable dynamics, even when point-wise RMS and MAE errors on hold-out data are very low^10,11. Similarly, narrowly curated datasets can achieve very high local accuracy on configurations that are sufficiently similar to their training data, but exhibit large errors on more diverse datasets, pointing to serious transferability challenges^12,13. This shows the importance of not only considering the size of training datasets, but also their diversity. In the following, we employ an ultra-diverse dataset that was specifically constructed to maximize its coverage in feature space, thereby minimizing the effect of this type of epistemic uncertainty.

Because it is very difficult to a priori predict the types of configurations that will be encountered in simulations carried out by end-users, it is critical to provide robust, simple, and affordable methods to quantify the confidence in results produced by atomistic simulations based on MLIAPs. Uncertainty metrics can further be used to calibrate the IAPs themselves¹⁴ or to provide scoring functions that enable active learning approaches to dataset curation^15,16,17,18. Uncertainty quantification (UQ) of MLIAPs has therefore been the subject of extensive prior studies, both for classical IAPs^14,19,20,21 and more recent MLIAPs^{16,22,23,24,25,26,27}.

The vast majority of existing UQ schemes (not just for MLIAPs) tacitly ignore uncertainties due to misspecification, or model imperfection, i.e., the idea that no one choice of model parameters can exactly match all training data. Misspecification affects both finite capacity models and deep learning approaches with finite training resources^28,29. It is however, known that the loss (an upper bound to the true generalization error³⁰) is only sensitive to epistemic (data-dependent) or aleatoric (intrinsic) uncertainties. Commonly used loss-based schemes can therefore significantly underestimate parameter uncertainties and model errors.

We have recently introduced a misspecification-aware UQ framework to resolve this issue³¹, which is summarized below. In this paper, we use this framework to demonstrate the quantification of misspecification uncertainties on MLIAP parameters and their propagation^32,33 to material properties of physical interest. We present extensive tests of property predictions against brute force DFT calculations on a diverse set of physical properties that were not explicitly included in the training data. Our main result is that propagated misspecification uncertainties provide an efficient and robust means to assign informative error bars on simulated material properties. We show that in all of the diverse test cases considered, the misspecification prediction bounds contain the “true” answer, i.e., that calculated a posteriori with the same DFT engine, and that the distribution of predictions offers a slightly conservative, but nonetheless generally quantitative representation of the actual errors.

Rigorously treating uncertainty requires a clear definition of the fitting task and sources of uncertainty, which are typically divided into three types: aleatoric, epistemic, and misspecification, with some debate persisting over exact definitions³⁴. We provide the following definitions, following our and other previous work^30,31:

Aleatoric uncertainty arises when training data is a sample from a random process such that a single input value produces a distribution of output values. Aleatoric uncertainty vanishes for deterministic data, i.e., when the output distribution has vanishing width.
Epistemic uncertainty arises due to finite data observations. When the number of observations N is much greater than the number of parameters P, i.e., in the underparametrized limit, the epistemic uncertainty vanishes, in a manner that can be rigorously defined using, e.g., PAC-Bayes concentration bounds^30,31,35. Note that kernel models have N/P ≃ 1 by definition, and thus will always have finite epistemic uncertainty. However, polynomial models or finite-width neural networks have finite capacity and can be under- or over-parametrized depending on the amount of training data.
Misspecification uncertainty arises when no one choice of model parameters can exactly fit the training data. Whilst for probabilistic data this is potentially challenging to precisely determine, it is simpler to identify for the deterministic data that are of interest in this work, as a finite training error for the global minimizer of the loss of interest then indicates misspecification. Of course, such a global minimizer is not always easy to find in practice, except for convex problems. Misspecification is the dominant form of uncertainty when fitting underparameterized models to deterministic data, but is known to be ignored by the expected loss^30,31.

In the present setting, MLIAPs are trained on large amounts of DFT data, namely a large set of atomic configurations X with corresponding ab initio total energies ${\mathcal{E}}({\bf{X}})$ and forces ${\bf{F}}=-{\boldsymbol{\nabla }}{\mathcal{E}}({\bf{X}})$. As ab initio hyperparameters, such as the choice of DFT exchange-correlation functional or k-point density, are fixed across all training data, the same input X always returns the same output ${\mathcal{E}}({\bf{X}})$, meaning we have a vanishingly small degree of aleatoric uncertainty. Whilst ab initio data energies will depend on the DFT hyperparameter choice, varying e.g., the k-point density will change all training data in a systematic, highly correlated fashion, and the corresponding best fit MLIAP parameters will also change. For the purpose of this work, the result of a DFT calculation for some consistent set of hyperparameters is considered to be the gold standard against which the models are to be evaluated. As detailed in “POPS-hypercube ansatz for linear models”, it can also be shown that under the above definitions, our datasets are sufficiently large that the fixed-capacity MLIAPs we employ here are indeed in the underparametrized regime, i.e., epistemic uncertainties can be expected to become negligible. Furthermore, since the models are linear, the optimization problem is convex and so finite training errors can be confidently assigned to misspecification.

Approaches for UQ on MLIAP predictions have differed based on the model architecture employed and the goal of the UQ task. For active learning schemes the primary goal is uncertainty qualification, i.e., a classification of whether an individual force or energy evaluation is trustworthy. If not, new ab initio reference data is typically required to either refine or directly replace model predictions^{15,16,17,18,36}.

MLIAPs that rely on Gaussian Processes Regression^26,37,38 posses an intrinsic uncertainty metric through the posterior variance, which is not typically a quantitative prediction but is ideal for use in active learning schemes. However, to provide robust uncertainty quantification on physical properties of interest, the MLIAP uncertainty must be quantified and propagated to the results of any simulation. Uncertainty propagation is challenging due to the strong correlations inherent to most simulation data, whether e.g., a trajectory average or a formation energy. As a result, it is in general challenging to propagate the uncertainty from Gaussian Processes Regression; the most efficient approach is typically estimating or sampling uncertainty on MLIAP parameters, as then propagation reduces to repeating simulations with sampled parameters^32,39 or evaluating gradients of simulation results with respect to potential parameters^33,40.

MLIAPs based on neural networks (NN) typically employ ensemble (query-by-committee) approaches^{17,22,32,39,41,42,43,44,45,46,47}, where multiple models are trained in different conditions (initial weights, hyperparameters, subsets of the training data). In many cases, the ensemble only adjusts a subset of model weights for computational efficiency⁴⁶. The result is an effective sample of plausible parameter values, which can then be propagated to properties either through direct resampling (rerunning simulations) or reweighting schemes³². UQ metrics then follow from the statistical properties of ensemble predictions.

A key strength of the ensemble approach is its simplicity, for only a mild increase in training cost (especially if only a subset of model weights are ensembled⁴⁶), resulting in broad adoption in gauging uncertainty in neural network models, both in atomic simulation^32,39,46 and more widely^47,48. However, ensemble approaches are a form of bagging predictor⁴⁹, which are known to underestimate model errors⁵⁰ for learning problems with weak aleatoric noise, as is the case for IAP models³⁹. In practice, ensemble approaches typically require multimodal model loss functions to return appreciable ensemble variance, without any theoretical guarantees that ensemble methods produce robust or predictive bounds on errors. As a result, error predictions typically require some form of calibration³² to be quantitative, as they are typically overconfident³⁹. We provide a demonstration of ensemble methods below. Within a framework of calibration-enabled error prediction, conformal UQ methods have also been applied to MLIAP errors²⁷.

For conventional IAPs and MLIAP that rely on linear ML architectures combined with strongly non-linear features, UQ approaches have traditionally relied on Bayesian regression⁵¹ to quantify parametric uncertainties^{14,19,20,52,53}, which can be extended to Bayesian NN⁵⁴. In our recent work³¹, discussed in more detail below, we address a known shortcoming of all Bayesian regression, which minimizes some expected loss, irrespective of model architecture: the expected loss provably ignores uncertainty due to misspecification, or imperfection, where no one choice of model parameters can perfectly predict training data. The vast majority of regression schemes target the loss and thus significantly underestimate parameter uncertainties (i.e., model errors) in the large-data, low-noise limit of interest for MLIAP fitting^30,55,56. In this limit, misspecification errors dominate, leading to bias and underestimation of uncertainties^30,31 if misspecification is ignored.

Whilst a small number of misspecification-aware Bayesian regression methods exist^{30,34,56,57,58,59}, they are only numerically stable in the regime of appreciable aleatoric uncertainty, whilst MLIAP models are fit to near-deterministic electronic structure calculations with vanishing aleatoric error³¹. Our recent scheme³¹ is thus uniquely able to estimate misspecification uncertainties in this context.

This paper is organized as follows. “Results” presents an extensive characterization of the performance of our UQ approach on a broad set of tasks that relate to the prediction of perfect crystal and defect properties. Whilst most of the error propagation is achieved through resampling, i.e., repeating simulations with resampled parameters, we also test our recent implicit differentiation scheme³³ in “Fast UQ propagation via implicit expansions”. The accuracy of this method demonstrates how misspecification uncertainties in the interatomic potential can be efficiently propagated to simulation results of interest in a multiscale modeling workflow. In anticipation of future work, we finally demonstrate how the approach can be used to predict and bound errors from recent foundational, or universal, message passing neural network MLIAPs^60,61. “Methods” summarizes the UQ and ML methodologies leveraged in this work. Perspectives for the method are then discussed in “Discussion”.

Results

The procedure used to generate an ensemble of models, characterized by ${\pi }_{{\mathcal{H}}}^{* }$, a probability density over model parameters, is described in details in “Methods”. The ability of the ensemble ${\pi }_{{\mathcal{H}}}^{* }$ to characterize the uncertainty on the predictions of the MLE model is assessed using a comprehensive validation suite of properties that are often of interest in practical applications of MLIAPs, including perfect crystal properties, defects, and energy barriers. Note that none of the validation properties were explicitly included in the training data, which was generated without any input from domain experts according to the procedure described in ref. ¹³, and therefore can be considered as an assessment of the UQ procedure on genuinely unseen test data.

Pointwise properties

In keeping with the traditional ML literature, the most common approach to characterizing the performance of MLIAPs is through point-wise error metrics measured on a testing set that is nominally independent of the training set. Predicting the distribution of errors incurred by the MLIAP is therefore a natural objective. Figure 1a) reports the distribution of the ratio of actual point-wise test errors to the difference between the MLE and UQ ensemble models. The results demonstrate that the overall error distributions from the MLE is extremely well captured by the resampled POPS ensemble ${\pi }_{{\mathcal{H}}}^{* }$. This shows that the deviation between the predictions of the MLE and those of individual samples from the model ensemble provides a representative statistical estimate of the actual difference between the MLE and (often unknown) exact reference value. The ensemble also provides excellent bounds on predictions: maximal and minimal predictions of an ensemble of 500 models sampled from ${\pi }_{{\mathcal{H}}}^{* }$ fails to bound the actual reference energies and forces in only 2.2% and 3.0% of the cases, respectively. Furthermore, the bounds provided by the model ensemble capture very specific features of individual predictions. E.g., in addition to capturing the generic increase in error with increasing energy or forces that results from the reweighting scheme used to train the MLIAP, “outlier” points with unusually large errors compared to their neighboring peers are very accurately captured (c.f. the outlier points in Fig. 1).

**Fig. 1: Statistics of pointwise errors obtained from ${\pi }_{{\mathcal{H}}}^{* }$.**

This correlation between actual errors and predicted uncertainties can be seen in Fig. 2, where a estimate of pointwise error was obtained as half of the min/max range of predictions from models sampled from ${\pi }_{{\mathcal{H}}}^{* }$. As the results clearly show, the actual errors are directly correlated with the confidence interval predicted by the POPS approach.

**Fig. 2: Correlation between actual and predicted pointwise errors obtained from ${\pi }_{{\mathcal{H}}}^{* }$.**

These results clearly show that the UQ ensemble not only captures average error behavior, but closely resolves high uncertainty regions that result from particularly detrimental combinations of test point and intrinsic model limitations. The ability to confidently bound predictions is also a powerful feature that can be used to easily propagate worst-case scenarios to more complex quantities, as will be shown in the following.

It is important to note that, in the misspecified setting, the trained model will be a function of the specific distribution from which the training data was sampled even in the infinite data limit, in contrast to the well-specified case where the correct model can be recovered from different training distributions⁶². This dependence will in general, lead to errors being larger in sparse regions than in dense regions⁶³, a phenomenon called error amplification. This is broadly consistent with the behavior observed in Figs. 1 and 11 below, where sparser regions generally tend to show higher errors (although sparsity should ideally be assessed in feature space, not in the output space). Crucially, when this effect persists in the infinite data limit, it is better described as a form of data-distribution-dependent misspecification error than as a form of epistemic error.

Perfect crystal properties

A second key class of properties of direct interest to applications is the quantification of the stability of different crystal structures. Figure 3 and Tables 1–3 demonstrate that the UQ ensemble accurately captures the actual errors in formation energy, equilibrium volume, and bulk modulus over 13 different crystal structures that vary broadly in topology and unit cell sizes. These results are obtained by using atomistic configurations and simulation cells that were individually optimized under corresponding MLIAPs, in contrast to evaluating point-wise energies on the reference structures relaxed with DFT.

**Fig. 3: Properties of different perfect crystal phases.**

Table 1 UQ statistics for the crystal formation energy of different crystal phases relative to the BCC phase (c.f., Fig. 3)

Full size table

Table 2 UQ statistics for the crystal formation volume of different crystal phases relative to the volume of the BCC phase (c.f., Fig. 3)

Full size table

Table 3 UQ statistics for the bulk moduli of different crystal phases (c.f., Fig. 3)

Full size table

The results clearly show that the UQ ensemble accurately captures the uncertainty inherent to different phases, providing tightly distributed predictions where the actual errors are low and more diverse predictions where the actual errors are large, in addition to accurately bounding the actual predictions in all cases.

Tables 1 to 3 also show that the standard deviation of the UQ ensemble predictions provide a statistically representative indication of the magnitude of the actual errors, as the mean ratio of the MLE error to the standard deviation of the ensemble is close to 1, except for the formation volume where the ensemble overestimates the errors by about a factor of 2 on average.

In all cases, the extreme values predicted by the ensemble bound the actual reference result, providing strong guarantees.

Furthermore, in addition to information regarding the absolute accuracy of the predictions, it is often desirable to establish whether the MLIAPs can be expected to predict the relative ordering of certain properties across different phases, e.g., of the formation energy, which determines the most thermodynamically stable phase at low temperature. Figure 4a and b demonstrate that the distribution of Spearman rank correlation coefficients between MLE and members of the UQ ensemble (blue histograms) provides representative estimates of the actual correlation between MLE and the reference data (black vertical line): while most potentials agree with the MLE with regards to the ordering of the formation energies, the relative ordering of the equilibrium volumes shows a much broader distribution. In both cases, the Spearman correlation coefficient between MLE and the reference is contained within a one standard deviation interval around the ensemble to MLE mean. This is a very desirable feature, as it enables the end-user to establish confidence in the accuracy of ranked comparisons without access to reference data.

**Fig. 4: Spearman rank correlation analysis for perfect crystal and defect properties.**

Transformation pathways between crystal structures are also relevant to the analysis of phase transitions. A range of such continuous transformation paths is reported in Fig. 5. The MLE MLIAP closely reproduces reference DFT results for the four paths that were considered. In all cases, the distribution of predictions from the UQ ensemble is tightly concentrated, except for the orthorhombic bcc → bct → bcc path where the predictions in the bct region are somewhat broader. In all cases, the UQ ensemble bounds the reference DFT values while providing a representative quantification of the actual error incurred by the MLE MLIAP.

**Fig. 5: Transformation paths between crystal phases.**

Phonons

Another key indicator of the thermodynamics and dynamics of crystal structures is provided by phonon dispersion relations, which are often prized as they can be correlated with scattering or spectroscopic experiments, as well as quantify the contribution of vibration-entropic effects to the thermodynamic stability of different crystal structures. Note that phonon properties derive from the diagonalization of energy Hessians or dynamical matrices and are therefore determined by second-order derivatives of the energy, which were not explicitly present in the training set.

Therefore, POPS ensembles were not explicitly constructed to match elements of the Hessian. Comparison of DFT and MLE results (c.f., Fig. 6) shows that the MLIAP performs well at low frequencies, but significantly overestimates the vibrational density of states at high frequencies (c.f., right panel), potentially reflecting the absence of Hessian training data. Correspondingly, the range of spectra predicted by the ensemble is also very broad, suggesting low confidence in the predictions. The UQ ensemble however, still correctly bounds the reference spectrum across the whole range of wavevectors.

**Fig. 6: Vibrational properties in the BCC phase.**

Defect energetics

Finally, due to their critical role in determining the mechanical properties of engineering materials, the energetics of defects are often key quantities used to train and validate potentials. We considered two classes of defects: self-interstitial atoms (SIAs) — which are particularly important to understand the behavior of materials under irradiation — and free surfaces. In both cases, formation energies were obtained self-consistently using the energy-minimizing structures predicted by each potential. The results are presented in Fig. 7 and Table 4. The energy scale for SIA formation is accurately captured by the MLE model, and the ensemble results bound the actual formation energies. The standard deviation of the UQ ensemble provides an excellent statistical representation of the actual error incurred by the MLE. In this case, the formation energies for 110 and OS variants are underestimated by the MLE, leading to a different predicted ordering of the relative defect stabilities. As shown in Fig. 4, the distribution of Spearman rank correlation coefficients between MLE and members of the UQ ensemble is also very broad, consistent with the observed ranking disagreement between MLE and reference values; the rank correlation coefficient between the reference and the MLE is found within one standard deviation of the mean of the correlation coefficients between MLE and ensemble.

**Fig. 7: Self-interstitial formation energies in the BCC phase.**

Table 4 UQ statistics for the formation energetics of SIA in a BCC crystal (c.f., Fig. 7)

Full size table

Surfaces are another class of important planar defects that, e.g., control the shape of nanoparticles. Figure 8 and Table 5 demonstrates that the MLE MLIAP in provides an adequate representation of the energies of different facets. In this case, the standard deviation of the ensemble prediction conservatively overestimates the actual errors by about a factor of 4, once again providing worst-case bounds that always include the actual reference value.

**Fig. 8: Surface formation energies in the BCC phase.**

Table 5 UQ statistics for the formation energetics of surfaces in a BCC crystal (c.f., Fig. 8)

Full size table

Figure 4 also shows that the ordering of surface energies is robustly captured by the MLE MLIAP, which is corroborated by the narrow distributions of rank correlation coefficients between MLE and members of the UQ ensemble. In these cases also, the distribution of Spearman coefficients is consistent with the very high correlation between the MLE and the reference.

Energy barriers

In addition to thermodynamics, an assessment of uncertainty of properties related to defect kinetics is often extremely desirable, especially since kinetic properties can be exponentially sensitive to transition barrier energetics. This makes it extremely important to avoid overly pessimistic UQ, since it can translate into exponentially large differences in predicted characteristic timescales. Furthermore, saddle points are computationally expensive to harvest in large numbers using reference quantum methods, which makes them potentially drastically underrepresented in most training sets for MLIAPs. Figure 9 reports on the performance of the UQ ensemble for the minimum energy pathway of a first neighbor vacancy hop in BCC W. The results show that the MLE overestimates the reference results by a significant margin (about 0.5 eV), but that the UQ ensemble offers a quantitatively appropriate estimation of the error on the energy barrier. Note that the minimum energy pathways were individually reconverged for each MLIAP, and not simply reevaluated along the reference minimum energy pathway.

**Fig. 9: Minimum energy pathway for a first-neighbor vacancy hop in BCC.**

Fast UQ propagation via implicit expansions

Many material properties, such as defect energetics and energy barriers, are calculated via local energy minimization. In principle, propagation of parameter uncertainty to these properties requires brute force repetition of simulations, which quickly becomes unfeasible as system size or system count increases. In this section, we apply a recently introduced approach to assess the predictions from the UQ ensemble by employing the implicit differentiation of atomic minima³³. The implicit derivative emerges by noting that a stationary atomic configuration X^* is an implicit function of the potential parameters Θ. As shown in ref. ³³, the implicit derivative of atomic configurations, ${\nabla }_{{\mathbf{\Theta }}}{{\bf{X}}}_{{\mathbf{\Theta }}}^{* }$, can be computed efficiently for linear-in-descriptor potentials. This enables the calculation of the change in stationary configurations, $\Delta {{\bf{X}}}_{{\mathbf{\Theta }}}^{* }$, under relatively small potential perturbations, ΔΘ, without re-minimization of the system for each potential sample. This method is advantageous in scenarios where performing molecular statics calculations is expensive due to the large system size or a large number of ensemble potentials.

Here, we apply the implicit approach to two UQ estimation cases presented above: (1) equilibrium volumes of BCC and HCP W phases and (2) minimum energy pathways for a first-neighbor vacancy hop in BCC W. For both cases, implicit derivative of the equilibrium volumes ${V}_{{\mathbf{\Theta }}}^{* }$, ${\nabla }_{{\mathbf{\Theta }}}{V}_{{\mathbf{\Theta }}}^{* }$ is sufficient for the predictions. More details of the implicit expansion method and various forms of truncation/approximation are given in ref. ³³.

For UQ of the equilibrium volumes, we first compute the implicit derivatives ${\nabla }_{{\mathbf{\Theta }}}{V}_{{\mathbf{\Theta }}}^{* }$ at BCC and HCP minima with the MLE potential. Then, for each potential sample from the UQ ensemble, we predict the BCC and HCP volume change $\Delta {V}_{{\mathbf{\Theta }}}^{* }=\Delta {\mathbf{\Theta }}\cdot {\nabla }_{{\mathbf{\Theta }}}{V}_{{\mathbf{\Theta }}}^{* }$. Left panel of Fig. 10 shows the predicted BCC and HCP volume ratios vs their true values obtained with minimization for potentials from the UQ ensemble. For UQ of the minimum energy pathways, we perform the full calculation with the MLE potential, and identify the initial and saddle-point configurations. We then compute the implicit volume derivative at the initial configuration. We predict the energy change at initial and saddle-point configurations using the Taylor expansion for atomic energy with implicit derivative (see ref. ³³ for more details). Figure 10, right panel, shows the implicit derivative predictions of the energy barriers compared to the full pathway calculations.

**Fig. 10: Implicit derivative predictions vs true molecular statics minimization for the UQ ensemble potentials.**

For both cases, the implicit derivative technique provides the predictions within less then 4% of error for both cases. Since the goal of the POPS approach is to provide the worst-case bounds for the quantities of interest, combination of the UQ ensemble potentials with the implicit derivative predictions provides the ultimate efficient scheme for the UQ of the molecular statics properties.

Application to universal MLIAPs

Recent message-passing neural network (MPNN) models^64,65 have shown impressive approximation ability to predict atomic energies and forces of diverse multi-specie configurations across the periodic table^60,61. There is thus significant interest in assessing the accuracy of these ‘universal’ MLIAPs (UMLIAPs), to determine both uncertainty in predictions and select optimal training configurations for fine-tuning, where additional training data is used to adjust a small subset of model parameters.

The per-atom energy prediction of UMLIAPs ${E}_{MPNN}^{i}$ is produced by a readout function^64,65, which typically receives scalar-valued messages from the MPNN, namely scalar-valued node features. In the framework of this paper, we can therefore treat the scalar-valued input to the readout layer as per-atom descriptors Dⁱ, as in the MACE MPNN model⁶⁴. To motivate forthcoming studies of misspecification-aware UQ and fine-tuning for UMLIAPs, we applied the POPS UQ scheme to a linear ‘corrector’ for the MACE-MPA-0 foundation model⁶⁰, trained on the mptraj⁶¹ and sAlex⁶⁶ datasets. Specifically, we consider a simple linear model in addition to the MACE-MPA-0 prediction, giving a loss function

$$L({\boldsymbol{\Theta }})=\frac{1}{2}\sum _{i}\parallel {E}_{{\rm{DFT}}}^{i}-{E}_{{\rm{MACE}}-{\rm{MPA}}-0}^{i}-{\boldsymbol{\Theta }}\cdot {{\bf{D}}}_{i}{\parallel }^{2},$$

(1)

where ${{\bf{D}}}_{i}\in {{\mathbb{R}}}^{256}$ is the MACE per-atom descriptor vector. We emphasize that the goal of the linear corrector formalism is not to improve MACE-MPA-0 model but to capture the misspecification uncertainty, namely the model-form errors, which are manifest as uncertain model parameters. In principle, every parameter of the model has a misspecification uncertainty. However, for UQ purposes, we are free to consider a subset of parameters fixed, in effect restricting the function space for propagation and thus typically overestimating the uncertainties on the remaining variable parameters. For MACE-MPA-0, we consider all parameters fixed apart from the linear component of the readout layer, a procedure which allows us to use POPS to determine parameter uncertainties, giving the same benefits for conservative error prediction and robust propagation of model-form errors to simulation results as demonstrated above for the SNAP form of MLIAPs. Similar results could be acheived using e.g., any additive linear model or linearizing around the minimum loss solution. Study of the misspecification parameter covariance also gives an architecture-sensitive error measure that can be used to guide fine-tuning, a detailed study of which will be presented elsewhere.

We applied the POPS scheme to obtain a posterior distribution ${\pi }_{{\mathcal{H}}}^{* }({\boldsymbol{\Theta }})$ trained over energies from the mptraj dataset, including all 89 elements of mptraj with a 50:50 train:test split. As shown in Fig. 11, the ability of POPS to accurately predict test error distributions and bound worst-case errors seen for linear MLIAPs is maintained in application to linear correctors for UMLIAPs. We observe excellent coverage of the test error distribution over at least ± 4 standard deviations, with a small envelope violation of 1%. These preliminary results show that the general misspecification-aware framework introduced here can be applied to recent universal MLIAPs; future work will develop this approach both for uncertainty propagation and active learning workflows for UMLIAP fine-tuning.

**Fig. 11: Statistics of pointwise energy errors of the MACE-MPA-0 foundation model.**

Comparison with standard uncertainty ensembles

As discussed in the introduction, a popular approach to gauge MLIAP uncertainty is an ensemble bagging predictor⁴⁹, where an ensemble of models is produced by subsampling the training set, as opposed to the POPS ensemble and hypercube approaches presented in the present paper. As previously discussed^31,50, there is no principled limit in how the number of ensemble members should be chosen, and the training cost in general scales with the number of ensemble members. Furthermore, whilst predictions in e.g., energy errors from ensemble methods can in principle be improved with e.g., an ad hoc rescaling factor, there is no theoretical basis for expecting the resultant error bounds to be robust, i.e., that the ground truth lies within these bounds. These theoretical limitations persist when propagating uncertainty and bounding quantities of interest. To give an example of the issues with bagging ensemble methods, we apply a simple 5-member ensemble for the POPS linear corrector discussed in the previous section, in addition to a model with a simple rescaling factor applied to energy predictions such that the mean training and predicted variance agree. As can be seen in Fig. 12, the unscaled ensemble significantly underestimates the test errors and has an extremely high envelope violation (EV) of 80%. Whilst applying a rescaling factor improves the predicted error histogram, as can be expected, the EV rate remains as high as 14%, which in practice means the predicted bounds are not reliable for property prediction. As a further demonstration, Fig. 13 shows the predicted vs actual error correlations as in Fig. 1, where it can be seen that even when compared to the rescaled ensemble, POPS shows a far superior error correlation. We emphasize that these behaviors remain when increasing the number of ensemble members, whilst the training cost is much higher than the POPS approach for N_E≥2. These simple tests demonstrate how POPS provides an efficient and robust means to both predict test error distributions and, uniquely, bound test errors, which as we have shown above, allow for bounding of propagated materials properties.

**Fig. 12: Statistics of pointwise energy errors of the MACE-MPA-0 foundation model, obtained from a 5-member ensemble of linear correctors.**

**Fig. 13: Correlation between actual and predicted pointwise energy errors of the MACE-MPA-0 foundation model.**

Discussion

This paper has investigated uncertainty quantification for the predictions of machine learning interatomic potentials (MLIAPs), using a qSNAP potential for tungsten as a prototypical example. We demonstrated application of a novel Bayesian regression approach, POPS³¹, that is specifically designed for near-deterministic regression tasks when the aleatoric error is low (e.g., when reference quantum calculations are well converged) and training data is abundant, so that model misspecification errors dominate. The effect of this type of error is comparatively under-studied in the Bayesian regression literature, where the focus primarily lies on quantifying the effects of the lack of training data, but is essential to understand uncertainties in conditions typical of the development of modern MLIAPs. The method is extremely computationally efficient for the broad class of MLIAPs that can be expressed as linear combinations of very complex non-linear features, such as the ACE⁶⁷ and SNAP potentials^68,69, introducing a negligible additional cost to generate a statistically-representative ensemble of MLIAPs.

The ensemble of potential weights generated by the POPS approach proved extremely adept at quantitatively estimating uncertainties on both pointwise and complex quantities and at bounding worst-case errors. Through an extensive suite of validation tests commonly used to assess MLIAP quality for materials science, including static, dynamic, and kinetic properties of perfect crystals and defects, we demonstrated that robust uncertainty metrics can be reliably obtained at a low computational cost. This type of approach offers dramatic improvement in the quantitative understanding of uncertainties inherent to atomistic simulations in the ML era. We also demonstrated that the POPS framework can be applied to bound the error of non-linear models, specifically recent MPNN models⁶⁰, through the use of a linear corrector, which will be developed further in future work. More generally, our study highlights the benefit of principled, misspecification-aware UQ techniques to systematically optimize accuracy/simulation rate tradeoffs, crucial to realize the predictive potential of data-driven models.

The current implementation of the POPS algorithm is restricted to linear parameter dependencies for training data with negligible aleatoric error. Epistemic errors are incorporated using standard Bayesian regression techniques³¹. Whilst the POPS approach can be extended to incorporate aleatoric terms, which will be the subject of a forthcoming publication, this is not applicable to MLIAP fitting ab initio data that have very low aleatoric uncertainty, as discussed above. Extension of the POPS misspecification UQ approach to non-linear models is under active development and will be the subject of a forthcoming study.

Methods

Misspecification-aware Bayesian regression for MLIAP fitting

In the following, we demonstrate the effectiveness of a recently-introduced misspecification-aware UQ method to describe the uncertainties inherent to MLIAPs. To summarize the above, this method specifically targets the aforementioned regime where:

1.
The reference data (here DFT energies and forces) is near-deterministic, i.e., it exhibits vanishing aleatoric errors
2.
The ML model is misspecified, i.e., no single choice of the free parameters can reproduce all reference data exactly
3.
The model is underparameterized, i.e., the amount of training data significantly exceeds the number of trainable parameters

In the context of MLIAPs, condition 1 reflects the near-deterministic nature of well-converged quantum calculations, where repeated calculations with the same inputs result in the same output. While some MLIAP formalisms provide completeness guarantees in some limit, practical accuracy/computational cost tradeoffs⁹ commonly result in the use of misspecified models where conditions 2 and 3 are met. In this regime, uncertainties on the predictions derived from the MLIAPs do not stem from the intrinsically noisy nature of the data or from an insufficient amount of training data, but are dominated by the misspecified nature of the ML model when a large amount of training data is provided.

In the following, we will demonstrate that this approach provides (i) reliable estimates of point-wise energy and force errors, (ii) reliable bounds on maximal errors, and iii) reliable errors estimates on a large number of non-point-wise complex properties (e.g., formation energies, energy barriers, etc.), which enables a thorough characterization of the uncertainties obtained by MLIAPs at a very affordable computational cost. This enables a systematic approach to the evaluation of the predictability of the simulation results that goes beyond what would be possible using point-wise average metrics alone.

POPS-hypercube ansatz for linear models

For completeness, this section summarizes the key details of our scheme to quantify misspecification uncertainties. We refer the reader to ref. ³¹ for a detailed presentation. An open source implementation, following the Scikit-learn linear_model API⁷⁰, is available on GitHub at https://github.com/tomswinburne/POPSRegression.git

Our goal is to determine a posterior distribution π(Θ) of parameters for some MLIAP ${\mathcal{M}}({\bf{X}};{\mathbf{\Theta }})$, which aims to approximate some DFT ground truth ${\mathcal{E}}({\bf{X}})$. In the following derivation (but not in the numerical experiments that follow), we only consider energies for brevity, with the extension to forces trivial. From a Bayesian perspective, the near-deterministic nature of ${\mathcal{E}}$ is manifest in the sharp conditional distribution of output Y (here a scalar energy) given an input X, reading

$${\rho }_{{\mathcal{E}}}({\rm{Y}}| {\bf{X}})=\left.\exp (\parallel {\mathcal{E}}({\bf{X}})-{\rm{Y}}){\parallel }^{2}/{\epsilon }^{2}\right)/\sqrt{\pi {\epsilon }^{2}}$$

(2)

which limits to a delta function as ϵ → 0. Bayesian regression aims to find the distribution of model parameters π(Θ) to minimize the cross entropy between ${\rho }_{{\mathcal{E}}}({\rm{Y}}| {\bf{X}})$ and the model distribution, which writes

$${\rho }_{{\mathcal{M}}}({\rm{Y}}| {\bf{X}},\pi )=\int\,\exp (-\parallel {\mathcal{M}}({\bf{X}};{\mathbf{\Theta }})-{\rm{Y}}{\parallel }^{2}/{\epsilon }^{2})\frac{{\rm{d}}\pi ({\mathbf{\Theta }})}{\sqrt{\pi {\epsilon }^{2}}},$$

(3)

where dπ(Θ) = π(Θ)dΘ. The cross entropy between ${\rho }_{{\mathcal{M}}}({\rm{Y}}| {\bf{X}},\pi )$ and ${\rho }_{{\mathcal{E}}}({\rm{Y}}| {\bf{X}})$ is known as the generalization error, here ${\mathcal{G}}[\pi ]$, reading (see ref. ³¹ for a full derivation)

$${\mathcal{G}}[\pi ]=-\left\langle \ln \left\vert \int\,\exp (-\parallel {\mathcal{M}}({\bf{X}};{\mathbf{\Theta }})-{\mathcal{E}}({\bf{X}}){\parallel }^{2}/{\epsilon }^{2}){\rm{d}}\pi ({\mathbf{\Theta }})\right\vert \right\rangle ,$$

where 〈…〉 denotes an average over a formally infinite quantity of training data, potentially with a normalized positive weighting w(X). Minimization of ${\mathcal{G}}[\pi ]$ is extremely challenging due to the poor conditioning of the logarithmic term, and also does not have any means to incorporate epistemic (finite data) uncertainties. However, it is clear that unless a single value of Θ can produce perfect predictions, π(Θ) is required to have finite width, which is precisely the misspecification uncertainty we wish to estimate. As ${\mathcal{G}}[\pi ]$ is numerically intractable, the vast majority of regression techniques employ the Jensen inequality $-\langle \ln x\rangle \le -\ln \langle x\rangle$ for convex functions to define ${\mathcal{L}}[\pi ]$, the expected loss or log likelihood through

$${\mathcal{G}}[\pi ]\le {\mathcal{L}}[\pi ]=\frac{1}{{\epsilon }^{2}}\int\left\langle \parallel {\mathcal{M}}({\bf{X}};{\mathbf{\Theta }})-{\mathcal{E}}({\bf{X}}){\parallel }^{2}\right\rangle {\rm{d}}\pi ({\mathbf{\Theta }}).$$

It is clear that ${\mathcal{L}}[\pi ]$ is minimized by a sharp distribution around the global loss minimizer

$${{\mathbf{\Theta }}}^{* }\in \arg \min \left\langle \parallel {\mathcal{M}}({\bf{X}};{\mathbf{\Theta }})-{\mathcal{E}}({\bf{X}}){\parallel }^{2}\right\rangle ,$$

(4)

such that ${\pi }_{{\mathcal{L}}}^{* }({\mathbf{\Theta }})=\delta ({\mathbf{\Theta }}-{{\mathbf{\Theta }}}^{* })$. This important result shows that loss minimization ignores misspecification uncertainties, which, as discussed above,e are dominant for MLIAPs. The connection to Bayesian inference at finite data (i.e., with epistemic uncertainties) was made in⁷¹, using PAC-Bayes analysis^35,72 to show that

$${\mathcal{L}}[\pi ]\le C+\int\left[\frac{{\sigma }_{N}^{2}({\mathbf{\Theta }})}{{\epsilon }^{2}}+\frac{1}{N}\ln \frac{\pi ({\mathbf{\Theta }})}{{\pi }_{0}({\mathbf{\Theta }})}\right]\pi ({\mathbf{\Theta }}){\rm{d}}{\mathbf{\Theta }},$$

(5)

where ${\sigma }_{N}^{2}({\mathbf{\Theta }})={\sum }_{i}{w}_{i}\parallel {\mathcal{M}}({{\bf{X}}}_{i};{\mathbf{\Theta }})-{\mathcal{E}}({{\bf{X}}}_{i}){\parallel }^{2}/N$ is the average squared error over the N training points, C is a constant^31,71 and π₀(Θ) is some prior distribution (which is taken as uniform in the following). It is simple to show that this upper bound is minimized by the well-known posterior from Bayesian inference

$${\pi }_{N}^{* }({\mathbf{\Theta }})={\pi }_{0}({\mathbf{\Theta }})\exp [-N{\sigma }_{N}^{2}({\mathbf{\Theta }})/{\epsilon }^{2}].$$

(6)

In the large data limit N → ∞, application of steepest descent recovers the sharp distribution ${\pi }_{{\mathcal{L}}}^{* }({\mathbf{\Theta }})$, again showing the inability to capture misspecification uncertainty.

To find an approximate minimizer of ${\mathcal{G}}$, our approach³¹ defines pointwise optimal parameter sets (POPS) for each training point X, being the set of all model parameters where that particular training point is exactly matched, i.e., all Θ such that ${\mathcal{M}}({\bf{X}};{\mathbf{\Theta }})={\mathcal{E}}({\bf{X}})$ at X. In Ref.citeswinburne2025, we show that any posterior distribution π that minimizes the generalization error must have mass in every POPS in the training set. For misspecified models, the mutual intersection of all POPS is empty, enforcing finite parameter uncertainty. Our POPS regression algorithm first finds the parameter ${{\boldsymbol{\Theta }}}_{{\bf{X}}}^{* }$ that minimizes the global loss conditional on lying in the POPS of X. This produces an ensemble of N parameter values clustered around the global loss minimizer ${{\mathbf{\Theta }}}_{{\mathcal{L}}}^{* }$. The final parameter posterior distribution ${\pi }_{{\mathcal{H}}}^{* }$ is then taken as a uniform distribution over the minimal hypercube ${\mathcal{H}}$ in parameter space that encompasses all of the N POPS-constrained loss minimizers. For a model of P parameters, the POPS-hypercube posterior can then be resampled for only ${\mathcal{O}}(P)$ computational effort and is thus a highly efficient manner to capture the dominant uncertainty in interatomic potentials trained on large datasets. Our open source implementation incurs a minimal overhead of around × 2 over Bayesian ridge regression as implemented in Sci-Kit learn’s linear_model.BayesianRidge.

Interatomic potential training

We consider Machine Learned interatomic potentials in the family of the Spectral Neighbor Analysis Potential (SNAP)⁶⁸, more specifically of the quadratic SNAP (qSNAP) type⁶⁹, which in the following will be instantiated to describe tungsten (W). SNAP potentials describe the local environment around an atom i in terms of invariants of a spherical harmonics expansion of the local atomic density, the so-called bispectrum components denoted $\{{B}_{k}^{i}\}$. Under qSNAP, the corresponding atomic energy is expanded to second order in bispectrum components, i.e.,

$${E}_{{\rm{SNAP}}}^{i}={\boldsymbol{\beta }}\cdot {{\boldsymbol{B}}}^{{\boldsymbol{i}}}+\frac{{\boldsymbol{1}}}{{\boldsymbol{2}}}{({{\boldsymbol{B}}}^{{\boldsymbol{i}}})}^{{\boldsymbol{T}}}\cdot {\boldsymbol{\alpha }}\cdot {{\boldsymbol{B}}}^{{\boldsymbol{i}}},$$

(7)

where β and α are vectors and matrices of adjustable coefficients, respectively. For simplicity we collate the linear and quadratic terms into a single parameter vector Θ and descriptor vector Dⁱ, giving the atomic energy as

$${E}_{{\rm{SNAP}}}^{i}={\boldsymbol{\Theta }}\cdot {{\bf{D}}}^{i},$$

(8)

The total energy of a configuration of atoms is then defined as the sum of the per atom energies,

$${\mathcal{M}}({\bf{X}};{\mathbf{\Theta }})=\mathop{\sum }\limits_{i=1}^{N}{E}_{{\rm{SNAP}}}^{i}=\mathop{\sum }\limits_{i=1}^{N}{\boldsymbol{\Theta }}\cdot {{\bf{D}}}^{i}$$

(9)

and the atomic forces as the gradient of Eq. (9) with respect to atomic coordinates.

Training a qSNAP model, therefore, corresponds to solving a (potentially weighted) linear least-square problem with unknowns β and α so as to minimize squared (total) energy and (atomic) force residuals against reference quantum calculations. In the following, the total number of adjustable parameters (including the β vector and the α matrix) was 1596.

The reference dataset was obtained here using a diverse-by-construction generation technique introduced in Refs.citekarabin2020entropy,montes2022training, and generalized in Ref.citesubramanyam2024information. This method creates atomic configurations so as to specifically maximize the information entropy of the bispectrum component distribution, resulting in very broad coverage of feature space. The dataset considered here was introduced in Ref.citesubramanyam2024information, and was rescaled to the interatomic spacing of tungsten. The data was partitioned into a training set containing 7000 energies and 122,853 force components and a testing set containing 3000 energies and 53,493 force components. This corresponds to a ratio N/P of training data to adjustable parameters of 77, which puts this problem deep in the under-parameterized regime where misspecifications errors are expected to dominate. As shown in a previous study³¹, conventional Bayesian ridge regression methods dramatically underestimate uncertainty with N/P ratios of a few tens.

Since the properties of lower energy structures are often the target of practical investigations, individual energies and forces were weighted to give more importance to near-equilibrium configurations, following:

$$\begin{array}{rcl}{w}_{{\rm{energy}}}&\propto &\exp (-{E}_{{\rm{ref}}}/a)\\ {w}_{{\rm{force}}}&\propto &\exp (-| {F}_{{\rm{ref}}}| /b)\end{array}$$

(10)

with a = 2 eV and b = 50 eV/Å. The energy and force weights are normalized so that their respective sums over the training set are equal. Using these settings, the MAE training errors are 0.15 eV/atom and 0.66 eV/Å, for energy and force errors, respectively, clearly indicating that the potential is significantly misspecified, as these values exceed SCF convergence thresholds by at least two orders of magnitude. While these errors are high by the standard of modern MLIAPs, these aggregate values do not reflect the fact that the errors are much lower on the low energies and forces that are typically relevant in most applications due to the data reweighting scheme. We also note that tungsten and other early transition metals are known to be particularly difficult to represent using MLIAP because of their large density of states near the Fermi level, which makes the potential energy surface very sensitive to small perturbations and therefore much more difficult to learn compared to later transition metals which exhibit intrinsically smoother surfaces⁷³. Furthermore, the potentials considered here were not fine-tuned, nor the hyperparameters (like a and b, the cutoff radius, etc.) optimized, as this potential was designed to serve as an assessment of the performance of the UQ procedure, not to generate a production-optimal model.

UQ ensemble

The weighted least squares solution will be referred to as the MLE solution. In a first stage, a loss-minimizing POPS ensemble ${\pi }_{E}^{* }$ containing 129,853 models was generated according to the procedure described in “POPS-hypercube ansatz for linear models”. The distribution of selected regression coefficients as ${\pi }_{E}^{* }$ reported in Fig. 14 shows a strongly non-Gaussian behavior and the occasional presence of very fat and asymmetric tails (e.g., for Feature 60). Furthermore, as shown in Fig. 15, the coefficients over the ensemble are correlated with each other following a complex pattern that reflects the physical definition of the features, the product structure of Eq. (8) (which can be expected to introduce correlations between regression coefficients), and their relative importance in the regression task.

**Fig. 14: Distribution of regression coefficients over the loss-minimizing POPS ensemble ${\pi }_{E}^{* }$.**

**Fig. 15: Covariance of the regression coefficients over the loss-minimizing POPS ensemble ${\pi }_{E}^{* }$.**

Unless otherwise noted, an ensemble ${\pi }_{{\mathcal{H}}}^{* }$ of 500 models was then uniformly resampled from the hyper-cube bounding ${\pi }_{E}^{* }$, a procedure which was previously shown to provide very good statistical error estimates at a small computational cost. The UQ ensemble results reported below are generated from ${\pi }_{{\mathcal{H}}}^{* }$.

Data availability

The training data and fitting scripts used as part of this study are available at https://doi.org/10.5281/zenodo.15676956.

References

Behler, J. Perspective: Machine learning potentials for atomistic simulations. J. Chem. Phys. 145, 170901 (2016).
Article PubMed Google Scholar
Deringer, V. L., Caro, M. A. & Csányi, G. Machine learning interatomic potentials as emerging tools for materials science. Adv. Mater. 31, 1902765 (2019).
Article CAS Google Scholar
Mishin, Y. Machine-learning interatomic potentials for materials science. Acta Mater. 214, 116980 (2021).
Article CAS Google Scholar
Goryaeva, A. M. et al. Efficient and transferable machine learning potentials for the simulation of crystal defects in bcc Fe and W. Phys. Rev. Mater. 5, 103803 (2021).
Article CAS Google Scholar
Lysogorskiy, Y. et al. Performant implementation of the atomic cluster expansion (PACE) and application to copper and silicon. npj Comput. Mater. 7, 1–12 (2021).
Article Google Scholar
Mortazavi, B., Zhuang, X., Rabczuk, T. & Shapeev, A. V. Atomistic modeling of the mechanical properties: the rise of machine learning interatomic potentials. Mater. Horiz. 10, 1956–1968 (2023).
Article PubMed CAS Google Scholar
Alexander, F. et al. Exascale applications: skin in the game. Philos. Trans. R. Soc. A 378, 20190056 (2020).
Article Google Scholar
Gavini, V. et al. Roadmap on electronic structure codes in the exascale era. Model. Simul. Mater. Sci. Eng. 31, 063301 (2023).
Article Google Scholar
Xie, S. R., Rupp, M. & Hennig, R. G. Ultra-fast interpretable machine-learning potentials. npj Comput. Mater. 9, 162 (2023).
Article Google Scholar
Fu, X. et al. Forces are not enough: Benchmark and critical evaluation for machine learning force fields with molecular simulations (2023). arXiv preprint arxiv:2210.07237 (2023).
Stocker, S., Gasteiger, J., Becker, F., Günnemann, S. & Margraf, J. T. How robust are modern graph neural network potentials in long and hot molecular dynamics simulations? Mach. Learn.: Sci. Technol. 3, 045010 (2022).
Google Scholar
Montes de Oca Zapiain, D. et al. Training data selection for accuracy and transferability of interatomic potentials. npj Comput. Mater. 8, https://doi.org/10.1038/s41524-022-00872-x (2022).
Subramanyam, A. & Perez, D. Information-entropy-driven generation of material-agnostic datasets for machine-learning interatomic potentials. npj Computer Mater 11, 218 (2025).
Article Google Scholar
Hegde, A., Weiss, E., Windl, W., Najm, H. N. & Safta, C. A Bayesian calibration framework with embedded model error for model diagnostics. Int. J. Uncertainty Quantif. 14, 37−70 (2024).
Podryabinkin, E. V. & Shapeev, A. V. Active learning of linearly parametrized interatomic potentials. Comput. Mater. Sci. 140, 171–180 (2017).
Article CAS Google Scholar
Zaverkin, V. et al. Uncertainty-biased molecular dynamics for learning uniformly accurate interatomic potentials. npj Comput. Mater. 10, 83 (2024).
Article Google Scholar
Smith, J. S. et al. Automated discovery of a robust interatomic potential for aluminum. Nat. Commun. 12, https://doi.org/10.1038/s41467-021-21376-0 (2021).
Kulichenko, M. et al. Uncertainty-driven dynamics for active learning of interatomic potentials. Nat. Comput. Sci. 3, 230–239 (2023).
Article PubMed PubMed Central Google Scholar
Kurniawan, Y. et al. Bayesian, frequentist, and information geometric approaches to parametric uncertainty quantification of classical empirical interatomic potentials. J. Chem. Phys 156, 214103 (2022).
Article PubMed CAS Google Scholar
Hegde, A., Weiss, E., Windl, W., Najm, H. & Safta, C. Bayesian calibration of interatomic potentials for binary alloys. Comput. Mater. Sci. 214, 111660 (2022).
Article CAS Google Scholar
Longbottom, S. & Brommer, P. Uncertainty quantification for classical effective potentials: an extension to potfit. Model. Simul. Mater. Sci. Eng. 27, 044001 (2019).
Article CAS Google Scholar
Zhu, A., Batzner, S., Musaelian, A. & Kozinsky, B. Fast uncertainty estimates in deep learning interatomic potentials.J. Chem. Phys. 158, 164111 (2023).
Article PubMed CAS Google Scholar
Tan, A. R., Urata, S., Goldman, S., Dietschreit, J. C. & Gómez-Bombarelli, R. Single-model uncertainty quantification in neural network potentials does not consistently outperform model ensembles. npj Comput. Mater. 9, 225 (2023).
Article CAS Google Scholar
Hu, Y., Musielewicz, J., Ulissi, Z. W. & Medford, A. J. Robust and scalable uncertainty estimation with conformal prediction for machine-learned interatomic potentials. Mach. Learn.: Sci. Technol. 3, 045028 (2022).
Google Scholar
Busk, J., Schmidt, M. N., Winther, O., Vegge, T. & Jørgensen, P. B. Graph neural network interatomic potential ensembles with calibrated aleatoric and epistemic uncertainty on energy and forces. Phys. Chem. Chem. Phys. 25, 25828–25837 (2023).
Article PubMed CAS Google Scholar
Bartók, A. P. et al. Improved uncertainty quantification for Gaussian process regression based interatomic potentials. arXiv preprint arXiv:2206.08744 (2022).
Best, I. R., Sullivan, T. J. & Kermode, J. R. Uncertainty quantification in atomistic simulations of silicon using interatomic potentials. J. Chem. Phys. 161, 064112 (2024).
Article PubMed CAS Google Scholar
Lahlou, S. et al. Deup: Direct epistemic uncertainty prediction. arXiv preprint arXiv:2102.08501 (2021).
Psaros, A. F., Meng, X., Zou, Z., Guo, L. & Karniadakis, G. E. Uncertainty quantification in scientific machine learning: Methods, metrics, and comparisons. J. Computational Phys. 477, 111902 (2023).
Article Google Scholar
Masegosa, A. Learning under model misspecification: Applications to variational and ensemble methods. Adv. Neural Inf. Process. Syst. 33, 5479–5491 (2020).
Google Scholar
Swinburne, T. & Perez, D. Parameter uncertainties for imperfect surrogate models in the low-noise regime. Mach. Learn.: Sci. Technol 6, 015008 (2025).
Google Scholar
Imbalzano, G. et al. Uncertainty estimation for molecular dynamics and sampling. J. Chem. Phys. 154, 074102 (2021).
Maliyov, I., Grigorev, P. & Swinburne, T. D. Exploring parameter dependence of atomic minima with implicit differentiation. npj Comput Mater 11, 22 (2025).
Article Google Scholar
Kato, Y., Tax, D. M. & Loog, M. A view on model misspecification in uncertainty quantification. In Artificial Intelligence and Machine Learning (eds Calders, T., Vens, C., Lijffijt, J. & Goethals, B.). (BNAIC/Benelearn, 2022).
Alquier, P. User-friendly introduction to PAC-Bayes bounds. Found Trends Mach Learn 17, 174–303 (2024).
Article Google Scholar
Li, Z., Kermode, J. R. & De Vita, A. Molecular dynamics with on-the-fly machine learning of quantum-mechanical forces. Phys. Rev. Lett. 114, 096405 (2015).
Article PubMed Google Scholar
Bartók, A. P., Payne, M. C., Kondor, R. & Csányi, G. Gaussian approximation potentials: The accuracy of quantum mechanics, without the electrons. Phys. Rev. Lett. 104, 136403 (2010).
Article PubMed Google Scholar
Vandermause, J. et al. On-the-fly active learning of interpretable bayesian force fields for atomistic rare events. npj Comput. Mater. 6, 20 (2020).
Article Google Scholar
Lu, S., Ghiringhelli, L. M., Carbogno, C., Wang, J. & Scheffler, M. On the uncertainty estimates of equivariant-neural-network-ensembles interatomic potentials. https://arxiv.org/abs/2309.00195 (2023).
Blondel, M. et al. Efficient and modular implicit differentiation. Adv. neural Inf. Process. Syst. 35, 5230–5242 (2022).
Google Scholar
Seung, H. S., Opper, M. & Sompolinsky, H. Query by committee. In Proc. Annual Workshop on Computational Learning Theory, 287–294 (1992).
Artrith, N. & Behler, J. High-dimensional neural network potentials for metal surfaces: A prototype study for copper. Phys. Rev. B—Condens. Matter Mater. Phys. 85, 045439 (2012).
Article Google Scholar
Zaverkin, V. & Kästner, J. Exploration of transferable and uniformly accurate neural network interatomic potentials using optimal experimental design. Mach. Learn.: Sci. Technol. 2, 035009 (2021).
Google Scholar
Schran, C., Brezina, K. & Marsalek, O. Committee neural network potentials control generalization errors and enable active learning. J. Chem. Phys. 153, 104105 (2020).
Article PubMed CAS Google Scholar
Smith, J. S., Nebgen, B., Lubbers, N., Isayev, O. & Roitberg, A. E. Less is more: Sampling chemical space with active learning. J. Chem. Phys. 148, 241733 (2018).
Article PubMed Google Scholar
Kellner, M. & Ceriotti, M. Uncertainty quantification by direct propagation of shallow ensembles. Mach. Learn.: Sci. Technol. 5, 035006 (2024).
Lakshminarayanan, B., Pritzel, A. & Blundell, C. Simple and scalable predictive uncertainty estimation using deep ensembles. Adv. Neural Inf. Process Syst 30, 6405–6464 (2017).
Google Scholar
Tripathy, R. K. & Bilionis, I. Deep uq: Learning deep neural network surrogate models for high dimensional uncertainty quantification. J. Comput. Phys. 375, 565–588 (2018).
Article Google Scholar
Breiman, L. Bagging predictors. Mach. Learn. 24, 123–140 (1996).
Article Google Scholar
Kahle, L. & Zipoli, F. Quality of uncertainty estimates from neural network potential ensembles. Phys. Rev. E 105, 015311 (2022).
Article PubMed CAS Google Scholar
Bishop, C. M. & Tipping, M. E. Bayesian regression and classification. In Advances in learning theory: methods, models and applications, 267–285 (IOS Press, 2003).
Frederiksen, S. L., Jacobsen, K. W., Brown, K. S. & Sethna, J. P. Bayesian ensemble approach to error estimation of interatomic potentials. Phys. Rev. Lett. 93, 165501 (2004).
Article PubMed Google Scholar
Williams, L., Sargsyan, K., Rohskopf, A. & Najm, H. N. Active learning for snap interatomic potentials via bayesian predictive uncertainty. Computational Mater. Sci. 242, 113074 (2024).
Article CAS Google Scholar
Goan, E. & Fookes, C. Bayesian neural networks: An introduction and survey. Case Studies in Applied Bayesian Data Science: CIRM Jean-Morlet Chair, Fall 2018, 45–87 (2020).
Bayarri, M. J. et al. A framework for validation of computer models. Technometrics 49, 138–154 (2007).
Article Google Scholar
Sargsyan, K., Huan, X. & Najm, H. N. Embedded model error representation for bayesian model calibration. Int. J. Uncertain. Quantif. 9, 365–394 (2019).
Morningstar, W. R., Alemi, A. & Dillon, J. V. Pacm-bayes: Narrowing the empirical risk gap in the misspecified bayesian regime. In International Conference on Artificial Intelligence and Statistics, 8270–8298 (PMLR, 2022).
Kleijn, B. J. K. & van der Vaart, A. W. Misspecification in infinite-dimensional Bayesian statistics. Ann. Stat. 34, 837 – 877 (2006).
Article Google Scholar
Kleijn, B. & van der Vaart, A. The Bernstein-Von-Mises theorem under misspecification. Electron. J. Stat. 6, 354 – 381 (2012).
Article Google Scholar
Batatia, I. et al. A foundation model for atomistic materials chemistry. arXiv preprint arXiv:2401.00096 (2023).
Deng, B. et al. Chgnet as a pretrained universal neural network potential for charge-informed atomistic modelling. Nat. Mach. Intell. 5, 1031–1041 (2023).
Ge, J., Tang, S., Fan, J., Ma, C. & Jin, C. Maximum likelihood estimation is all you need for well-specified covariate shift. arXiv preprint arXiv:2311.15961 (2023).
Amortila, P., Cao, T. & Krishnamurthy, A. Mitigating covariate shift in misspecified regression with applications to reinforcement learning. In The Thirty Seventh Annual Conference on Learning Theory, 130–160 (PMLR, 2024).
Batatia, I., Kovács, D. P., Simm, G. N., Ortner, C. & Csányi, G. Mace: Higher order equivariant message passing neural networks for fast and accurate force fields. Adv. Neural Inf. Process. Syst. 35, 11423−11436 (2022).
Bochkarev, A., Lysogorskiy, Y. & Drautz, R. Graph atomic cluster expansion for semilocal interactions beyond equivariant message passing. Phys. Rev. X 14, 021036 (2024).
CAS Google Scholar
Schmidt, J. et al. Machine-learning-assisted determination of the global zero-temperature phase diagram of materials. Adv. Mater. 35, 2210788 (2023).
Article CAS Google Scholar
Drautz, R. Atomic cluster expansion for accurate and transferable interatomic potentials. Phys. Rev. B. 99, 014104 (2019).
Article CAS Google Scholar
Thompson, A. P., Swiler, L. P., Trott, C. R., Foiles, S. M. & Tucker, G. J. Spectral neighbor analysis method for automated generation of quantum-accurate interatomic potentials. J. Comput. Phys. 285, 316–330 (2015).
Article CAS Google Scholar
Wood, M. A. & Thompson, A. P. Extending the accuracy of the snap interatomic potential form. J. Chem. Phys. 148, 241721 (2018).
Pedregosa, F. et al. Scikit-learn: Machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
Google Scholar
Germain, P., Bach, F., Lacoste, A. & Lacoste-Julien, S. Pac-bayesian theory meets bayesian inference. Adv. Neural Inf. Process. Syst. 29, 1876–1884 (2016).
Hoeffding, W. Probability inequalities for sums of bounded random variables. Collect. Works Wassily Hoeffding 409–426 (1994).
Owen, C. J. et al. Complexity of many-body interactions in transition metals via machine-learned force fields from the tm23 data set. npj Comput. Mater. 10, 92 (2024).
Article Google Scholar

Download references

Acknowledgements

We gratefully acknowledge useful discussions with Dr. Peter Hatton and the hospitality of the Institute for Pure and Applied Mathematics (IPAM) at UCLA and of the Institute for Mathematical and Statistical Innovation (IMSI) at the University of Chicago during the conception of this work. DP was supported by the Laboratory Directed Research and Development program of Los Alamos National Laboratory under project number 20220063DR. APAS acknowledges the support from the US Department of Energy through the Exascale Computing Project (17-SC-20-SC), a collaborative effort of the U.S. Department of Energy Office of Science and the National Nuclear Security Administration, and through the G. T. Seaborg Institute under project number 20240478CR-GTS. TDS gratefully acknowledges support from ANR grants ANR-19-CE46-0006-1, ANR-23-CE46-0006-1, IDRIS allocation A0120913455, and, with IM, an Emergence@INP grant from the CNRS. Los Alamos National Laboratory is operated by Triad National Security, LLC, for the National Nuclear Security Administration of the U.S. Department of Energy (Contract No. 89233218CNA000001).

Author information

Authors and Affiliations

Theoretical Division T-1, Los Alamos National Laboratory, Los Alamos, NM, 87544, USA
Danny Perez & Aparna P. A. Subramanyam
Aix-Marseille Université, CNRS, CINaM UMR 7325, Campus de Luminy, 13288, Marseille, France
Ivan Maliyov & Thomas D. Swinburne
Department of Mechanical Engineering, University of Michigan, Ann Arbor, MI, 48109, USA
Thomas D. Swinburne

Authors

Danny Perez
View author publications
Search author on:PubMed Google Scholar
Aparna P. A. Subramanyam
View author publications
Search author on:PubMed Google Scholar
Ivan Maliyov
View author publications
Search author on:PubMed Google Scholar
Thomas D. Swinburne
View author publications
Search author on:PubMed Google Scholar

Contributions

D.P. and T.D.S. conceived the study. A.P.A.S. developed and implemented the MLIAP test suite. D.P. carried out the MLIAP training and testing of the UQ method. I.M. and T.D.S carried out the demonstration of UQ via implicit expansion. All authors contributed to the analysis of the results and to the redaction of the manuscript

Corresponding authors

Correspondence to Danny Perez or Thomas D. Swinburne.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Perez, D., Subramanyam, A.P.A., Maliyov, I. et al. Uncertainty quantification for misspecified machine learned interatomic potentials. npj Comput Mater 11, 263 (2025). https://doi.org/10.1038/s41524-025-01758-4

Download citation

Received: 15 March 2025
Accepted: 28 July 2025
Published: 16 August 2025
DOI: https://doi.org/10.1038/s41524-025-01758-4