Reliable uncertainty estimates in deep learning with efficient Metropolis-Hastings algorithms

Schmal, Matthias; Mäder, Patrick

doi:10.1038/s41467-026-70015-z

Download PDF

Article
Open access
Published: 17 March 2026

Reliable uncertainty estimates in deep learning with efficient Metropolis-Hastings algorithms

Nature Communications volume 17, Article number: 2531 (2026) Cite this article

1729 Accesses
1 Altmetric
Metrics details

Subjects

Abstract

Approaching problems with data-driven models often requires reliable uncertainty estimates. Bayesian neural networks can offer these for deep learning models. Without the knowledge to set informative prior distributions, sampling methods such as Hamiltonian Monte Carlo are a robust choice. However, these come with prohibitive computational costs. We study two ways to incorporate computationally light-weight Metropolis-Hastings acceptance steps into deep neural networks and stochastic gradient Hamiltonian Monte Carlo. The first method proposes noisy acceptance steps computed on batched training samples rather than the entire set during the simulation of the stochastic dynamics accepting a small minima preserving bias. The second method sacrifices bias-free sampling of Hamiltonian Monte Carlo in favor of stochastic gradient driven trajectories. While the first is analytically plausible, the second is inspired by the Hamiltonian ensemble concept. Prediction accuracy is improved by up to 5.8% over deterministic and by up to 4.3% over Bayesian approaches while still guaranteeing calibration of the predictions. We observe that sampling methods facilitate model predictions with merely a third of the ensemble while maintaining prediction accuracy. In conclusion, the methods combine efficiency and regularization of stochastic gradients, showing strong performance despite the sampling bias.

Replica exchange enhanced adaptively weighted stochastic gradient Langevin dynamics for Bayesian sampling and optimization

Article Open access 15 May 2025

Stochastic Gradient Descent-like relaxation is equivalent to Metropolis dynamics in discrete optimization and inference problems

Article Open access 21 May 2024

Data-driven approach to the deep learning of the dynamics of a non-integrable Hamiltonian system

Article Open access 02 July 2025

Introduction

Deep Learning has become a working horse of modern machine learning. In the presence of large datasets, state-of-the-art results are achieved in image classification, segmentation, reinforcement learning, language processing and many more¹. Throughout these applications the models match or outperform humans on the same tasks. However, deep neural networks lack uncertainty awareness^2,3. For applications in medicine⁴, autonomous driving⁵ and other dependable applications⁶ this is a serious drawback. A solution has the potential to unfold the field of high performing deep learning for tasks, that require information on reliability. In contrast to deep learning¹, Bayesian learning is well known for robust modeling in presence of noise and for uncertainty estimation based on parameter distributions^7,8. It is used in control or traditional statistical modeling and builds upon a well grounded theory⁸. The concept of Bayesian neural networks combines both methods to extract features with deep architectures and still acquire uncertainty estimations on the predictions. Even with simple uninformed prior distributions this enables complicated parameter posteriors⁹ in contrast to simple post-processing³ or Bayesian classifiers on top of a deep neural network^6,10,11,12. The key to Bayesian neural networks remains the inference. Since the posterior distribution is analytically intractable, an approximation is required to estimate the posterior probability density and increased attention is required for big data applications due to practical computational limits^2,6,13. The approximation is typically realized by either variational inference or Markov Chain Monte Carlo sampling. Variational inference uses a fixed parametric posterior that is optimized to minimize the Kullback-Leibler divergence to the true parameter distribution^2,14,15. This approach was first published as a Bayesian neural network and uses variational formulations of backpropagation. The main limitation remains the fixed parameter posterior, which limits the diversity in the Bayesian evaluation process. Markov Chain Monte Carlo methods¹⁶ sample from the parameter posterior, such that the collected samples allow to solve the predictive distribution with Monte Carlo integration using

$$p(y^{*}|x^{*}, S) = \int p(y^{*}|x^{*}, {{\boldsymbol{\theta}}}) \, p({{\boldsymbol{\theta}}} | S)\, {{\rm{d}}}{{\boldsymbol{\theta}}}.$$

(1)

Here, the predictive distribution on a new data point (y^*, x^*) is computed by marginalizing over the model parameters θ that are inferred based on the dataset $S={({y}_{j},{x}_{j})}_{J}$. While deep ensembles¹⁷ can be considered a primitive collection of optima approximation to p(θ∣S), Bayesian sampling methods converge towards the true posterior. While early Markov Chain Monte Carlo methods are very inefficient on high-dimensional problems, Hamiltonian Monte Carlo overcame these problems by utilization of a momentum variable. Its Stan package implementation is the current state-of-the-art^18,19,20. The problem for deep neural networks is the required gradient of the potential energy term

$${{{\rm{U}}}}({{{\boldsymbol{\theta }}}})=-{\sum}_{J}\log p({y}_{j}| {x}_{j},{{{\boldsymbol{\theta }}}})-\log p({{{\boldsymbol{\theta }}}}).$$

(2)

Every step requires gradients on the whole dataset S²¹. The natural solution is to turn to stochastic gradient driven dynamical systems that allow the inclusion of step-wise additional noise terms. These simulate first or second order Langevin-dynamics, which have the posterior as their stationary distribution. Typically, the dynamics

$$\left\{\begin{array}{l}{{{\rm{d}}}}{{{\boldsymbol{\theta }}}}={M}^{-1}{{{\bf{m}}}}\,{{{\rm{d}}}}t \hfill \\ {{{\rm{d}}}}{{{\bf{m}}}}=-\gamma {{{\bf{m}}}}\,{{{\rm{d}}}}t-{\nabla }_{{{{\boldsymbol{\theta }}}}}\tilde{{{{\rm{U}}}}}({{{\boldsymbol{\theta }}}})\,{{{\rm{d}}}}t+{M}^{1/2}\sqrt{2\gamma T}\,{{{\rm{d}}}}W\end{array}\right.$$

(3)

with $M\in {{\mathbb{R}}}^{d\times d}$ invertible and W a standard Wiener process with variance I are computed with

$${\nabla }_{{{{\boldsymbol{\theta }}}}}\tilde{{{{\rm{U}}}}}({{{\boldsymbol{\theta }}}})=-\frac{| S| }{| B| }{\sum}_{j\in B}{\nabla }_{{{{\boldsymbol{\theta }}}}}\log p({y}_{j}| {x}_{j},{{{\boldsymbol{\theta }}}})-{\nabla }_{{{{\boldsymbol{\theta }}}}}\log p({{{\boldsymbol{\theta }}}}).$$

(4)

This state of the dynamical system (3) consists of the model parameters and a momentum state m. We orientate our notation on^22,23,24 that covers typical stochastic gradient Langevin dynamics²⁵ and stochastic gradient Hamiltonian Monte Carlo²⁶. With T = 0, equation (3) becomes the time continuous version of stochastic gradient descent with momentum²³. This similarity allows to reuse prior knowledge on damping γ or the discrete step-size ϵ from classical deep learning trainings. The advantages of Hamiltonian Monte Carlo mainly depend on its Metropolis-Hastings acceptance step²⁷. This eliminates a bias caused by using the proposal points computed with the dynamical system as is. To use the advantages of acceptance steps in stochastic gradient sampling, we consider different implementations of the discrete dynamics of (3). Therefore, three methods are chosen and Table 1 gives an overview. For other discretizations we refer to Leimkuhler et al.²⁸ for a full overview. The algorithms differ in symmetry^24,29,30 as it is required for acceptance steps. The only method to apply acceptance steps in the stochastic gradient setting was AMAGOLD proposed in ref. ²⁹. This paper approaches to overcome the convergence problems on deep learning tasks with a new Metropolis-Hastings acceptance proposal. These acceptance steps stabilize stochastic dynamics and as they are essential for the success of Hamiltonian Monte Carlo we evaluate their success in Bayesian neural network inference. Acceptance steps are used in two methods that allow for efficient Metropolis-Hastings acceptance steps in stochastic gradient driven sampling of neural networks to make use of the robust and knowledge independent sampling qualities known from Hamiltonian Monte Carlo. Our key findings:

We propose generalized SGHMC, a new scalable Metropolis-Hastings corrected sampling algorithm that accepts simulation steps based on batch-evaluations^21,31. This method allows to run efficient step-wise acceptance while introducing an extrema preserving bias.
We propose Hamiltonian Trajectory Ensemble (HTE), an ensemble method based on stochastic gradient trajectories that does not converge to the true posterior, but uses acceptance steps in an ensemble setting.
We grid search hyperparameters for our algorihtms as well as all competitor SGHMC methods demonstrating the critical effect of the step size ϵ in any SGHMC simulation.
We demonstrate, that the size of an Bayesian ensemble is not decisive for its accuracy, but its calibration. Therefore, we run a greedy sub-sampling experiment tried to eliminate samples from the Bayesian ensembles.

Table 1 Overview on discretizations of Eq. (3)

Full size table

In essence, our methodology provides the full pipeline from fine-tuning the sampler to improving efficiency of test inference, demonstrating the potential of Bayesian ensembles in deep learning. Prediction accuracy and calibration scores are pushed beyond the limits of deterministic and variational Bayesian methods without the costs of classical Hamiltonian Monte Carlo.

Results

The results section is put ahead of the details on methods 4. While our evaluation demonstrates the behavior of stochastic gradient sampling methods on deep learning problems as well as the performance of the proposed methods, it also carefully revises existing methods and showing their comparison. All methods are chosen based on compatibility with deep learning. Further discussion follows in section “ Previous acceptance step approaches”. We start with an artificial sampling experiment from a multivariate normal, comparing the two methods proposed in this paper with the two direct competitors.

Synthetic benchmark

To create a neural network related benchmark for stochastic gradient sampling algorithms, we sample multivariate Gaussian distribution with diagonal covariance matrix. The algorithms run on the exact gradients of the negative log-likelihood disturbed by standard normally distributed noise n ~ N(0, 1). For the set of Gaussian distributions we sample the means from an exponential distribution Exp(0.1) and the variances from a uniform distribution Unif[0.1, 1]. This is close to the expected mean-field model in many variational approaches on Bayesian neural networks⁶. Similar to the following experiments on image datasets, the number of samples was limited to 1000 and the tuning was done by structural variations of the step size. The results are shown in Fig. 1. We observe increasing Kullback-Leibler divergence between the true and estimated distribution, with increasing number of dimensions. While AMAGOLD outperforms all competitors at dimension 10, the other acceptance methods start to outperform AMAGOLD and SGHMC from dimension 100 on. From dimension 1000 the Kullback-Leibler divergence stays constant for the proposed algorithms.

Exploration of trajectory parameters

The experiments on image datasets focus on two different neural network architectures, which cover aspects of modern deep learning architectures, but remain reasonable small to allow ablation study experiments. A total number of 50,000 epochs per model for the first experiment only made this necessary. We divide the ablation experiments in phases showing the considered parameter dependencies. An overview is given in Table 2. The first model is a LeNet³² architecture. It represents purely feed-forward architectures based on a feature extractor part with convolutional layers and a following dense classifier. Although larger architectures from this group of models can outperform the small models, its size allows us to compute the large-scale experiments in a reasonable amount of time. Since the effect of hyperparameters in deep learning tasks can be crutial for performance, we present this study for stochastic gradient Hamiltonian Monte Carlo. The costs made us choose the second model too, which represents models with residual skip connections. We use a WideResNet-10-2, which we simply refer to as ResNet throughout this section,³³. While the LeNet model has about 270,000 parameters, the ResNet has about 300,000. We train the LeNet model on the balanced EMNIST dataset³⁴ including 47 classes of MNIST-like images of different characters. The ResNet is trained on the very popular CIFAR-10 dataset³⁵. As in Izmailov et al.³⁶, the batch normalization layers are replaced with feature response normalization to keep the samples independent of each other³⁷. In both cases, the default train-test split is 95%/5% training/validation data leaving 107,160 train images for EMNIST and 47,500 for CIFAR-10.

Table 2 Phases of the hyperparameter study

Full size table

In absence of the true predictive distributions, uncertainty quantification is usually compared on performance and calibration metrics, which allow the comparison of the predicted mean and therefore with deterministic neural networks as well. We choose accuracy metric for the performance and expected calibration error (ECE) for calibration assessment as the most common ones for balanced classification tasks. The metrics are all computed based on test sets using the datasets default train/test split (CIFAR-10: 10,000; EMNIST: 18,800). Section “Bayesian neural network comparison” contains further comparison with other Bayesian inference algorithms applied to these neural networks.

With the similarity of the Euler-Maruyama discretization (cp. Eq. (3)) to stochastic gradient descent (SGD), we apply prior knowledge on momentum decay α, which is usually chosen from [0.9, 1.0). We restrict the choice of ϵ and γ to satisfy the constraint

$$(1-\epsilon \,\gamma )=\alpha .$$

(5)

For the equations in Table 1 only one free parameter remains: The step size ϵ. For very small step sizes we propose a cosine schedule as it was referenced in Table 2. Following the constraint, the step size schedule keeps the momentum variable constant and adapts both step size ϵ and damping parameter γ accordingly to increase ϵ. This method speeds up the convergence of burn-in, before the sampling phase starts.

While we concentrate on the algorithmic parameters from phases (2) on, we start with the weight decay parameter and get an overview on loss training curves, when we apply the Langevin dynamics (cp. Eq. (3)). The weight decay regularization appears similar to the regularization through the prior in the Gaussian case differing only in a scaling factor of the training set size. In the context of deep learning we will use the term weight decay from here on. In Fig. 2 we show an subset of the generated trajectories. For different step sizes, we observe the following influence of varying weight decay: Up to weight decays 2.0 and 0.02 for the LeNet and ResNet respectively, the trajectories are very alike. Above this threshold the loss remains considerably larger and we observe occasional instabilities as in the upper left corner (cp. Fig. 2). We choose the weight decay to be at least above the stability threshold to be able to converge to useful minima for the given classification problem. Since weight decay gives considerable performance gains on deterministic methods, we argue to not choose it too small either. In the neural network architectures analyzed in this paper, the instabilities observed with weight decays did not occur in the deterministic case, what sheds light on the importance choosing this parameter for the sampling process. During our experiments we shared the results of ref. ³⁶ that cold posteriors²³ are not necessary to get competitive results. Considering T as a parameter, we fix it to T = 1 throughout the following experiments (cp. Eq. (3)).

**Fig. 2: Hyperparameter exploration, Phase (1).**

For the following simulations we run 5 parallel sampling chains with a burn-in of 1000 epochs and collect 1000 models during sampling phase after 10 epochs each, starting at a random network initialization.

Figure 3 shows the results of phases (2) and (3). The figure shows the performance measures over the step size, while the momentum variable is fixed to 0.9. The red and blue lines show the results for baseline SGHMC and SGHMC including a curvature estimate similar to RMSprop³⁸. The two models are separated by the different line-types. We set the range of step sizes from small step sizes that show progress during burn-in until large step sizes, which caused divergence of the trajectory. Along the baseline methods, we compare both methods with our constant momentum learning rate schedule, observing top calibration scores in these simulations. For the accuracy metric, the baseline methods scores slightly better on the LeNet problem (0.1%) and a lot better on the ResNet experiment (2.8%). Most importantly, the plot shows the behavior of accuracy scores with increasing step size. In multiple cases, the top performance is achieved with the largest step size close to the stability threshold. In all cases, the step above the last shown point to increasing step size caused the dynamics to diverge, what we represented with nan. In most cases, the top performance in accuracy was also the best calibrated Bayesian ensemble.

**Fig. 3: Hyperparameter exploration, Phase (2/3).**

The results of Phases (4/5) have been moved to Appendix A. We ran the top experiment from the step size variations (cp. Fig. 3) checking whether it remains the top simulation in both curvature estimation scenarios. First, we changed the momentum value to 0.99 testing its influence on the optimized sampling. Second, in the last column the batch size is changed from 80 to 400. We observe small improvements towards the larger batch size, but the changes are negligible in comparison to the effect of changing the step size of the simulation. Still, for the following simulations, we keep the larger batch size.

The experiments so far demonstrate parameter dependencies in stochastic Langevin-dynamics applied to typical deep learning tasks. The following experiments include different integrators, comparing all main methods applicable on deep learning tasks. This includes the comparison of acceptance steps methods for more accurate Markov Chain Monte Carlo trajectories.

Ablation study

To understand the effects of the different components of the algorithms used throughout this paper, this section summarizes results on SGHMC chains with different integrators as well as different acceptance step algorithms. We observe multiple interesting results shown in Table 3. First, we observe the AMAGOLD algorithm to perform poorly regarding accuracy. The need to get an acceptance rate high enough to finish the simulation occasionally limits the possible step-size range. While we also tried to speed up the burn-in using our initial learning rate schedule, we report on the best results we achieved. Second, our proposed method for OBABO acceptance performs similar to its non-acceptance counterpart. While it considerably improves over AMAGOLD acceptance simulations, the question arises how much influence the unbiased trajectory has on the ensemble performance. The following section evaluates the difference by repeating simulations with different random seeds, but keeping the simulation parameters fixed. The effects of the stochastic proposal algorithm using the Hamiltonian acceptance step, results in performance boosts on EMNIST and CIFAR-10 classification. For the LeNet on EMNIST it reaches a plus of 0.7%, for the ResNet it reaches 5.2% over the Euler ensemble. At the same time, the calibration measures decreases in comparison to the best compatitors, but remains in the range of the Euler simulations.

Table 3 Phase (6): Studying different integrators and acceptance steps

Full size table

As we already discussed, the acceptance rate was one crutial factor to influence the performance metrics reported here. While there is often an optimal acceptance rate to balance exploration and exploitation, we faced some issues deciding on hyperparameters with the acceptance rate. Mostly, because the changes in step-size, that were required to change the acceptance rate where much larger than the changes we observed to change the accuracy by multiple percentage points. We decided to leave the acceptance rate aside for the tuning phase, although it was still important to finish simulations. The typical dynamics of the acceptance rate was starting at almost 100% at the beginning of burn-in and settling at a stable value for the rest of the simulation. We ended up with acceptance rates of about 50% for both our algorithms with the most promising hyperparameter configuration regarding the evaluation metrics. While this important quality of the sampling algorithms is mainly left aside for the proposed evaluation, we extend the discussion in Appendix C.

Bayesian neural network comparison

This section compares the results on sampling algorithms in this paper to different Bayesian neural network approaches. In particular, we compare against regular deep ensembles and optimization based variational inference approaches as well as deterministic trainings. We include two different methods for variational inference on Bayesian neural networks: The Bayes by backpropagation approach^2,39 extended by the flipout method⁴⁰ and the variational online Gauss Newton (VOGN) method proposed in ref. ⁴¹ and proven to be competitive in ref. ⁴². We compare based on accuracy and expected calibration error again, but also report maximum calibration error and the number of epochs to account for the computational costs independent of the hardware setup. For an improved overview on the field MC-dropout, stochastic weight averaged Gaussian (SWAG) and Cyclical learning rate schedulers are included^10,12,43. In this section, we repeat the simulations 5 times with different random seeds to report on statistical variation, which was not possible for all ablation experiments.

In addition to the hyperparameter tuning study, we add a regression experiment of timeseries forcasting for this final comparison. We experiment with the chaotic Lorenz’96 differential equation system, which is a common benchmark for timeseries forcasting problems. The system is simulated with 5 states, of which 50 steps are used for intializing the model and 50 are predicted for computing the loss. The evaluation predicts 150 timesteps ahead, on which we compare maximum absolute error (MAE), mean squared error (MSE) and the R² score. The model is a one layer LSTM with ReLU activation with a state-space of 100. Although this problem was not included in the hyperparameter study, we observe similar results, what demonstrates the generality of the results. Comparing the results in Table 4, we see the top results are achieved with the Bayesian sampling algorithms. Still, we observe differences between the problems: While the ensemble created from different model initializations does consistently outperform the deterministic baseline, it sets the strongest baseline in the timeseries forcasting case. The HTE algorithm does only match this baseline, while the other Bayesian inference algorithms could not reach this performance. Also, the two classification problems appear to be quite different. While the EMNIST classification benefits from all sampling algorithms equally, the CIFAR-10 classification mainly benefits from the HTE mechanism. We approach an explanation in the discussion. The results confirm a performance increase of Markov Chain Monte Carlo algorithms over variational inference based methods on the considered problems including the Bayesian neural networks. The costs for this performance remains the increased computational burden in comparison to variational inference methods. Although the proposed sampling methods considerably reduce the costs in comparison to Hamiltonian Monte Carlo experiments as in ref. ³⁶, the number of epochs against the deterministic training is increased considerably, as we show in the column reporting the number of epochs these algorithms ran on average. These costs need to be balanced against the performance and calibration increase from Table 4.

Table 4 Comparison of Bayesian neural network approaches showing the mean and standard deviation of 5 seeded repetitions

Full size table

We also report on out of distribution detection on two additional datasets. For the EMNIST classification we compute the predictive entropy for the train dataset of Fashion MNIST⁴⁴. For the CIFAR-10 classification we use the test split of the Street View House Numbers⁴⁵ dataset. The results are visualized in Fig. 4 by showing the sorted entropy scores over the out of distribution dataset. The shaded area shows the standard deviation over the five random seed repetitions from Table 4. The optimal curve would be returning the maximum entropy for all datapoints, so the higher the curve, the better the out-of-distribution detection. We observe HTE to outperform its competitors, although not being comparably calibrated. The GSGHMC curve is at least partly above the SGHMC baseline, but does not meet the HTE line.

**Fig. 4: Entropy-based out of distribution detection.**

Ensemble size reduction

The ensembles of 1000 members make the evaluation and prediction process considerably more costly than running the inference of a single model. That raises the question how many models need to be kept to maintain the performance achieved with the large ensemble, but reducing the cost of evaluation. Therefore, we run a simple greedy selection on the ensembles, keeping only samples that appear to be essential for a given metric not to be impaired. We visualize the results in Fig. 5. The number of remaining ensemble members is displayed on the y-axis after the reduction of ensembles from the prior section. The variance is computed both over the different random seeds and 5 greedy repetitions with different ordering each. We can see that the Accuracy and Calibration have quite different requirements with regards to the ensemble size. While preserving the accuracy requires less than 50% of the ensemble, the calibration requires almost all of it. While there are methods to compress Bayesian ensembles in a principled manner⁴⁶, we wanted to demonstrate, that the Bayesian sampling is not necessarily the last step before having a model ready for production. The advances in accuracy are less costly to preserve than the diversity needed for calibration. That applying a reduction technique can still be favorable is further discussed in Appendix F.

Discussion

In this study, we demonstrate the effective application of Metropolis-Hastings acceptance steps in stochastic gradient Hamiltonian Monte Carlo for image classification using deep learning models. The proposed methods achieve stable sampling chains, enabling the construction of ensembles that exhibit superior prediction performance compared to deterministic and variational approaches.

Unlike prior work on Metropolis-Hastings in large dataset settings^{43,47,48,49,50,51}, our methods support an exploratory parameterization of the sampling chain with an fixed computational budget. Notably, on both datasets, this parameterization leads to substantial accuracy improvements, attributed to the critical relationship between step size and predictive performance.

Our results systematically compare previously reported sampling algorithms with a comprehensive exploration of parameter settings, ensuring a fair evaluation framework that, to our knowledge, has not been addressed in the existing literature. The findings align with trends observed in Hamiltonian Monte Carlo methods applied to CIFAR-10 but achieve computational feasibility for a broader research community. For instance, Izmailov et al.³⁶ required a 512-GPU cluster, whereas our implementation ran efficiently on a single GPU. While the computational costs of sampling methods increase over deterministic and variational methods, this study demonstrates their potential in non-convex optimization.

Overall, this work compares a range of posterior approximation methods for Bayesian neural networks, establishing a robust baseline that facilitates direct comparisons and advances research in this domain.

Hyperparameter exploration was the initial focus of our experiments. Consistent with findings on deterministic neural networks, the choice of learning rate and regularization parameters plays a critical role in the performance and generalization of trained models^52,53. We extend this observation to the stochastic gradient sampling case, noting a strong dependence of model accuracy on the step size used in trajectory simulations across all experiments. We stress again, that there was no relation between acceptance rates and the actual performance, which we mainly analyzed in terms of accuracy. This is further discussed in Appendix C.

A common hypothesis regarding the quality of minima associated with good generalization properties relates to their extend. Large learning rates often lead optimization algorithms to converge to wide and large minima, which are thought to promote better generalization. We propose that this principle may also apply to sampling algorithms, where larger step sizes might bias the sampling chain toward wide minima, potentially enhancing the generalization of the resulting ensemble. Our experiments aimed on taking this advantage to the Bayesian ensembles, potentially creating additional divergence to an optimal sampling regarding the distribution estimation.

Another approach to explain the importance of exploration with large learning rates can be driven by loss-landscape visualizations⁵⁴. As we show in the appendix the two problems we analyzed represent very different local structure. (cp. Supplementary Fig. D3). While the ResNet has many close minima, on the LeNet problem only one minimum is visible in the chosen range. This could be a reason for the smaller effects with large learning rates on the LeNet model as well as the effects, when using the cosine step-size schedule. The cosine schedule experiments naturally focused on local exploration and falls short on the experiments without the schedule in terms of accuracy (cp. Fig. 3). The results on the timeseries prediction task support the hypothesis, that ensembles on timeseries models also benefit from exploration of different minima, which would be similar to the ResNet case apparant from Table 4.

The results highlighting the critical role of step-size selection reveal a practical challenge: identifying optimal step sizes for sampling chains is computationally expensive. The choice on the step-size can be simplified with analytical observations in the convex-case^55,56,57: For Hamiltonian Monte Carlo (HMC), a common heuristic involves setting the step size to the smallest expected standard deviation²¹ for example. However, the absence of interpretability for parameters in black-box models complicates the application of these results and heuristics. Furthermore, the non-convex nature of deep learning problems raises fundamental questions about the meaning of standard deviation in the posterior distribution. Comparing the results on the sampling chains to deterministic and variational deep learning, we observe a similar role of tuning the step-size. This was not expected, since the sampling should average out the need to find the exact minimum.

For non-convex problems, convergence analyses such as those in ref. ⁵⁸ impose a constraint restricting sampling to the region within a ball. This approach encounters limitations in the context of deep neural networks due to the lack of prior knowledge about parameter range. Advancing research in this area could enable to focus computational effort to regions of the likelihood landscape that are more promising, as opposed to the current state-of-the-art approach, which relies on uninformed exploration. Currently, the exploration of multiple minima was found to improve prediction results¹⁴, but the interaction with the data and model complexity remains unknown. Our convergence analysis on a small benchmark problem showed rapid convergence for medium and large step sizes and multi modal distributions (cp. Appendix G). In this analysis, the proposed GSGHMC and HTE methods show improved mixing over their competitors.

Our ensemble reduction experiment represents an initial step toward addressing this challenge. As anticipated, we observed an over-representation of redundant information within ensembles comprising hundreds of models, enabling significant reductions in ensemble size without substantial performance loss regarding accuracy. In the randomly sampled stochastic gradient Hamiltonian Monte Carlo ensembles, we hypothesize that the extraction of informative features is dominated by the most descriptive and widely relevant features. However, a considerable proportion of noisy features persists, hindering further reductions in ensemble size. The calibration benefit however, is not preserved, when we reduce the ensemble in an uninformed manner. This could be caused by the slightly varying features that cause the diversity contributing to improved calibration, as evidenced by comparisons of the expected calibration error. The behavior of the Hamiltonian trajectory ensembles suggests similarities to pure Hamiltonian Monte Carlo, which converges effectively to minima between momentum resampling steps and the log-likelihood as well as parametrization can positively and negativly influence the behavior of the total ensemble.

We were surprised by the minimal impact of the RMSprop curvature estimate⁵⁹ and variations in integrators observed during our experiments. While the Euler discretization led to a slight reduction in calibration, the OBABO and skew-symmetric integrators did not exhibit a significant independent effect. We suggest, the curvature estimate might be less effective in an optimization driven by gradients and independent noise causing no increase in exploration and barely effecting the influence of the step-size.

In a broader context, the hyperparameter search highlights the necessity of exploring the parameter space and exposes costs and problems during this search. We explicitly refer to step-size regions close to the divergence barrier. If we facilitate exploration, we find improvements in accuracy and calibration for the problems analyzed in this paper. However, the prediction performance depends on the hyperparameters such that the algorithms do not unfold their potential when choosing small step-sizes. These results are achieved with models comparably small for the deep learning community. The behavior on larger models is therefore hard to predict. At first, small models on complex datasets seem to benefit a lot from the diversity in large Bayesian ensembles. On the contrary, large models in relation to the complexity of the dataset have been observed to benefit not quite as much in variational settings³⁹. As each model can only predict based on a fixed weighted selection of visual features, we argue that diversity in these features through multiple models should help to improve the robustness in large models also. With our sampling methods, these models can be part of different minima and therefore include new aspects for classification tasks. What remains an open question is whether there are inference methods, which guarantee the required diversity. While the HTE ensemble considerably improved the accuracy, it could not improve the calibration much. The generalized stochastic gradient Hamiltonian Monte Carlo method on the other hand could not equally improve accuracy. An approach to combine the two methods is not possible as is, but might be topic for future research.

Although this paper extensively searched through relevant parameters, some simplifications have been necessary that could not be included. One was the choice of the prior. Choosing a meaningful prior is difficult in the black-box setting of deep learning. Although different studies pointed out advantageous prior choices using other distributions than the typical Gaussian,^60,61, these have not yet been transferred to variational approaches. Heavy tails on the distribution are standing out as a beneficial choice for the sampling algorithms evaluated. Since the Gaussian assumption allows to connect the prior to an L₂ regularization effect as well as keeps the comparison to variational methods fair, we decided to keep this constraint. Another limitation was our constraint on the relation between friction and step size of the discretization. Although other papers applied similar simplifications²⁶, an increased variability of simulation parameters could show even further improved results or stabilize the dynamics with varying hyperparameters. Using knowledge from deterministic trainings of deep neural networks remains a time and cost efficient way of setting a hyperparameter prior and finally allows to compare the ensembles to deterministic trainings and use them for our understanding of deep learning in total. The extend of these simplification remains an active field of research. Furthermore, the number of samples was chosen arbitrarily to 1000 in most cases. To evaluate the effects of considerably larger and smaller sampling size can hint on the effects of knowing a single minimum very precise versus a reduced training cost. Further applications of acceptance rates might be automatic parameter tuning algorithms such as NUTS¹⁸ or methods to distinguish aleatoric and epistemic uncertainty influences⁶² for active learning tasks. A thorough understanding of the mechanisms is the key to know, when the additional cost of sampling methods can give additional performance benefits.

We presented two efficient sampling algorithms for the inference of Bayesian neural networks. Based on an extensive hyperparameter study, we demonstrated the performance benefits achievable with both methods. The main advantages distribute between the methods. GSGHMC emerged as the top-performing algorithm in terms of calibration among SGHMC-based approaches, HTE achieved the largest accuracy gain. Although HTE does not converge to the true posterior, it allowed to create an algorithm with strong mixing properties focusing on exploration of the unknown posterior. Both algorithms achieved significant performance improvements, with benefits of 4.3% and 6.0% accuracy over state-of-the-art Bayesian sampling methods. These results raise intriguing questions about the interplay between feature variation across local minima and the model properties that contribute to improved calibration. Further research in this direction could deepen our understanding of how to optimize Bayesian neural networks for both accuracy and reliability.

Methods

To introduce the concepts of Metropolis-Hastings acceptance we review the key concepts in the following and introduce the background to the two sampling algorithms, which have been evaluated in this paper. Afterwards, we introduce the generalized stochastic gradient Hamiltonian Monte Carlo algorithm and the Hamiltonian trajectory ensemble.

Errors in Markov Chain Monte Carlo estimation

There are multiple relevant errors that occur during the numeric approximation of the posterior probability distribution with Markov Chain Monte Carlo²⁸. However, a moment matching at time step t does not exactly gather the error the user of a sampling method is interested in. For Markov Chains the long term error is most relevant, to compute correct Monte Carlo expectations. Therefore, the representation of p(θ∣S) through the ensemble is what matters. It consists of an error caused by discretization (cp. Eq. (3)), which is proportional to the step size. Also, there is an error due to the limited number of samples collected during sampling phase. These errors are often referred to as sampling bias and sampling error, respectively. The error or bias caused by the discretization can not be reduced by running the simulation for a longer time. On the other hand, a reduction of the step size reduces the mixing of stored Markov Chain Monte Carlo samples. Similar to a bias-variance trade-off, these two errors balance each other. Simulations with a constant number of steps do either precisely simulate the dynamics or gather more samples from a larger parameter space to represent the posterior accurately.

Previous acceptance step approaches

The sampling bias of Markov Chain Monte Carlo simulation can be reduced by considering the end points of trajectories proposal points z_p, which are accepted or rejected in a Metropolis-Hastings manner²⁷. We introduce z as an auxiliary state, which summarizes the concatenation of θ and m, in case the sampler utilizes a momentum state. The transition probability from initial value z₀ = (θ₀, m₀) to z_p = (θ_p, m_p) is required to fulfill t(z₀ → z_p) = t(z_p → z₀) for an unique stationary distribution, which is called detailed balance condition. The Metropolis-Hastings acceptance probability takes the form

$$r({z}_{0},{z}_{p})=\min \left\{1,\frac{\pi ({z}_{p})}{\pi ({z}_{0})}\frac{t({z}_{p}\to {z}_{0})}{t({z}_{0}\to {z}_{p})}\right\}.$$

(6)

We observe that non reversible Markov Chain Monte Carlo chains directly cause acceptance rates of zero. This applies in particular to Euler-Maruyama²⁹. If the backward transition probability is not zero, the acceptance step guarantees reversibility by design, if the momentum is completely resampled. An advantage of acceptance steps is the stabilizing effect on the dynamics. Since stochastic gradients can prevent convergence for integrators of any order⁶³, the application of acceptance steps allows to run more robust sampling chains. In the following, we present the direct competitor approach of stochastic gradient acceptance steps for sampling and its limitations in the context of large models. Zhang et al.²⁹ propose AMAGOLD, an accumulating strategy for r(z₀, z_p) using the observation that we can accept proposal points based on the product of probability densities from all state transitions between z₀ and z_p. With the log-probability this becomes a sum, which we can compute step-by-step in the minibatch-setting with little extra costs. They use the skew-symmetric integrator from Table 1 and build a skew-reversible chain, which has non-zero acceptance rates (cp. Alg. 2). The algorithm allows to explore the posterior based on stochastic gradients, but requires U(θ) for accepting the proposal state. We run the algorithm on our test cases in all configurations tested for any other method as in Table 3 with the results in Fig. 3. In combination with Bayesian neural networks we observe small acceptance rates below 10%, if the step size is kept in a reasonable range that guarantees the optimization to move forward. To get into acceptance rates of about 30% we had to reduce the trajectory length to 2 epochs and keep the step size at 10⁻⁴, which was with factor 100 considerably smaller than without the AMAGOLD acceptance steps. The results from section “Ablation study” shows the problem of efficient acceptance steps was not overcome for Bayesian neural networks. The current solution even limits prediction accuracy. While AMAGOLD was evaluated on deep neural network cases, other competitors could not be evaluated on Bayesian deep learning problems in the literature. Efficient Metropolis-Hastings acceptance steps for Markov Chain Monte Carlo methods have also been approached in refs. ^{47,48,49,50,51}. While the application to Bayesian deep learning is not in the focus of these studies, some problems occur evaluating them on the highly non-convex log-likelihoods in deep learning. The most prominent problem are variable batch-sizes, which slow down the inference of Bayesian neural networks considerably. These appear in refs. ^47,48,49,51. The work of ref. ⁵⁰ even requires occasional resampling on the data, which also limits the speed of each step and cannot be implemented with a fixed computational budget. Additionally, the batch-size is a relevant regularization hyperparameter. If it is not fixed, the knowledge transfer from regular neural network trainings is quite limited. Therefore, these methods have not been tested on any Bayesian deep learning problems yet. Together with the methological limits, these methods were excluded from the study in this paper, similar to previous studies on sampling algorithms for Bayesian deep learning³⁶. We approach an algorithm that fixes the required batch-size for the gradients and acceptance step fixing the computational budget and therefore being especially compatible with deep neural networks. To do so, we rather orientate on partial momentum updates in contrast to the majority of these methods, which replace an acceptance step in an momentum resampling method. Therefore, we choose to call the method generalized SGHMC in accordance with generalized HMC.

Our proposed Metropolis-Hastings acceptance methods

Our two approaches aim to facilitate an application perspective. The first proposed method uses the similarity between classical SGHMC and Generalized HMC to construct an unbiased SGHMC algorithm. The second method is motivated from the long trajectory length parametrizations from HMC simulation experiments as in ref. ³⁶. We ask ourselves, whether the posterior influenced by stochastic gradients might actually cause beneficial regularization, as it is well known for large models⁶⁴. Our HMC-like method samples from this posterior and preserves the exploration properties of original HMC, while losing the exact posterior sampling property. This section presents both algorithms, which have been experimentally analyzed.

Generalized SGHMC (GSGHMC)

The generalized SGHMC methodology uses step wise acceptance steps and is therefore composed of an batch-acceptance algorithm and an integrator, that does not result in an acceptance rate of zero. A batch-acceptance algorithm with a conservation of minima property is proposed in ref. ⁶⁵. They introduce a clever notation to analyze the properties of their algorithm, which we will also use, but find the acceptance rates to be relatively small throughout their experiments. Our algorithm bases on their idea, fixing the acceptance rate by using a reversible integrator, such as OBABO. The preservation of local optima property of the algorithm is required in our case to explore the extrema of the true posterior exactly. In ref. ⁶⁵ the minibatch acceptance setting is considered by an state space extension with a binary random vector τ ∈ {0, 1}^∣J∣ such that B = {j: τ(j) = 1, j ∈ J}. This allows to reformulate $\tilde{{{{\rm{U}}}}}({{{\boldsymbol{\theta }}}})$ to

$${\tilde{{{{\rm{U}}}}}}_{\tau }({{{\boldsymbol{\theta }}}})=-\frac{| S| }{| B| }{\sum}_{j\in J}{\tau }_{j}\log p({y}_{j}| {x}_{j},{{{\boldsymbol{\theta }}}})-\log p({{{\boldsymbol{\theta }}}}).$$

(7)

Since τ is independent of θ, the target distribution of the augmented system is an exponential distribution based on $\widetilde{U}$ multiplied by the uniform distribution over I_τ = {τ: ∑_j∈Jτ_j = ∣B∣}, which we denote by Unif_B(τ):

$$\widetilde{\pi }(z)\propto {e}^{-{c}_{n}{\tilde{{{{\rm{U}}}}}}_{\tau }({{{\boldsymbol{\theta }}}})}\,{e}^{-{{{\rm{K}}}}({{{\bf{m}}}})}{{{{\rm{Unif}}}}}_{B}(\tau ).$$

(8)

The additional distribution over the momentum remains independent of θ and τ. With the target distribution $\widetilde{\pi }(z)$, we can construct an Markov Chain Monte Carlo chain. Starting with equation (6), we plug in the augmented target distribution and observe that the distribution on τ does not change from initial to proposal state and does not influence the relation. The acceptance probability becomes

$$ r(({z}_{0},{\tau }_{0}),({z}_{p},{\tau }_{p}))=\\ =\min \left\{1,\frac{\widetilde{\pi }({z}_{p})}{\widetilde{\pi }({z}_{0})}\frac{t(({z}_{p},{\tau }_{p})\to ({z}_{0},{\tau }_{0}))}{t(({z}_{0},{\tau }_{0})\to ({z}_{p},{\tau }_{p}))}\right\}\\ =\min \left\{1,\frac{t({z}_{p}\to {z}_{0})}{t({z}_{0}\to {z}_{p})}\frac{\exp (-{c}_{n}{\tilde{{{{\rm{U}}}}}}_{{\tau }_{p}}({{{{\boldsymbol{\theta }}}}}_{p}))\exp (-{{{\rm{K}}}}({{{{\bf{m}}}}}_{p}))}{\exp (-{c}_{n}{\tilde{{{{\rm{U}}}}}}_{{\tau }_{0}}({{{{\boldsymbol{\theta }}}}}_{0}))\exp (-{{{\rm{K}}}}({{{{\bf{m}}}}}_{0}))}\right\},$$

(9)

where each step is accepted or rejected based on a second minibatch that is selected with τ_p. To reconstruct the posterior distribution π(θ), we have to marginalize over τ. Using the independence of the distribution on m, we directly compute the distribution on θ,

$$\widetilde{\pi }({{{\boldsymbol{\theta }}}})\propto \pi {({{{\boldsymbol{\theta }}}})}^{{c}_{n}}{\left(\begin{array}{c}n\\ | B| \end{array}\right)}^{-1}{\sum}_{\tau \in {I}_{\tau }}{e}^{{c}_{n}(-{\tilde{{{{\rm{U}}}}}}_{\tau }({{{\boldsymbol{\theta }}}})-{{{\rm{U}}}}({{{\boldsymbol{\theta }}}}))},$$

(10)

which slightly differs from the distribution in ref. ⁶⁵ due to our different definition of U. The parameter c_n balances the tempering of the posterior distribution against resulting acceptance rates and overall bias. While in the proposing paper c_n = 20 appears as a generally good choice to trade off acceptance rate against tempering effect⁶⁵, we tried to stay at c_n = 1 during our simulation. By slightly increasing the step size in comparison to the SGHMC simulation without the acceptance step, our experiments ran with acceptance rates of 50.0% ± 3.5% both in the benchmark and the neural network case. We decided to keep c_n fixed at this value that our algorithm results in an unbiased chain, approximating the exact posterior. The independence of m guarantees the convergence results from ref. ⁶⁵, which are shortly reviewed in appendix E.

The acceptance rate of a sampling algorithm often decides, whether it can be applied on large neural networks. We discuss some observations based on equation (9) that what to expect from our GSGHMC algorithm. Therefore, we take the logarithm of the acceptance probability and plug in the transition probability for OBABO as in ref. ²⁴

$$\log \frac{t({z}_{p}\to {z}_{0})}{t({z}_{0}\to {z}_{p})}+\log \frac{\exp (-{c}_{n}{\tilde{{{{\rm{U}}}}}}_{{\tau }_{p}}({{{{\boldsymbol{\theta }}}}}_{p}))\exp (-{{{\rm{U}}}}({{{{\bf{m}}}}}_{p}))}{\exp (-{c}_{n}{\tilde{{{{\rm{U}}}}}}_{{\tau }_{0}}({{{{\boldsymbol{\theta }}}}}_{0}))\exp (-{{{\rm{K}}}}({{{{\bf{m}}}}}_{0}))}$$

(11)

$$=-({{{\rm{K}}}}({{{{\bf{m}}}}}_{0})-{{{\rm{K}}}}({{{{\bf{m}}}}}_{p})+{{{\rm{K}}}}({{{{\bf{m}}}}}_{3/4})-{{{\rm{K}}}}({{{{\bf{m}}}}}_{1/4}))$$

(12)

$$\,\,-{c}_{n}{\tilde{{{{\rm{U}}}}}}_{{\tau }_{p}}({{{{\boldsymbol{\theta }}}}}_{p})+{c}_{n}{\tilde{{{{\rm{U}}}}}}_{{\tau }_{0}}({{{{\boldsymbol{\theta }}}}}_{0})-{{{\rm{K}}}}({{{{\bf{m}}}}}_{p})+{{{\rm{K}}}}({{{{\bf{m}}}}}_{0})$$

(13)

$$={c}_{n}\left.({\tilde{{{{\rm{U}}}}}}_{{\tau }_{0}}({{{{\boldsymbol{\theta }}}}}_{0})-{\tilde{{{{\rm{U}}}}}}_{{\tau }_{p}}({{{{\boldsymbol{\theta }}}}}_{p}))\right)-{{{\rm{K}}}}({{{{\bf{m}}}}}_{3/4})+{{{\rm{K}}}}({{{{\bf{m}}}}}_{1/4}).$$

(14)

This equation demonstrates the effect of c_n as it increases the acceptance rate, as long as the loss improves. The K terms depend on the gradients. We can write K(m_3/4) as a function of m_1/4 such that the quadratic term of m_1/4 is eliminated. Introducing ${\nabla }_{{{{\boldsymbol{\theta }}}}}{\tilde{{{{\rm{U}}}}}}_{{\tau }_{p},{\tau }_{0}}^{\,{{{\rm{sum}}}}}({{{{\boldsymbol{\theta }}}}}_{p},{{{{\boldsymbol{\theta }}}}}_{0})=({\nabla }_{{{{\boldsymbol{\theta }}}}}{\tilde{{{{\rm{U}}}}}}_{{\tau }_{p}}({{{{\boldsymbol{\theta }}}}}_{p})+{\nabla }_{{{{\boldsymbol{\theta }}}}}{\tilde{{{{\rm{U}}}}}}_{{\tau }_{0}}({{{{\boldsymbol{\theta }}}}}_{0}))$, the K-part reduces to

$$\begin{array}{rc}-{{{\rm{K}}}}({{{{\bf{m}}}}}_{3/4}) &+{{{\rm{K}}}}({{{{\bf{m}}}}}_{1/4})=\epsilon \,{{{{\bf{m}}}}}_{1/4}^{\top }{\nabla }_{{{{\boldsymbol{\theta }}}}}{\tilde{{{{\rm{U}}}}}}_{{\tau }_{p},{\tau }_{0}}^{\,{{{\rm{sum}}}}}({{{{\boldsymbol{\theta }}}}}_{p},{{{{\boldsymbol{\theta }}}}}_{0})\\ & -\frac{{\epsilon }^{2}}{4}{\nabla }_{{{{\boldsymbol{\theta }}}}}{\tilde{{{{\rm{U}}}}}}_{{\tau }_{p},{\tau }_{0}}^{\,{{{\rm{sum}}}}}{({{{{\boldsymbol{\theta }}}}}_{p},{{{{\boldsymbol{\theta }}}}}_{0})}^{\top }{\nabla }_{{{{\boldsymbol{\theta }}}}}{\tilde{{{{\rm{U}}}}}}_{{\tau }_{p},{\tau }_{0}}^{\,{{{\rm{sum}}}}}({{{{\boldsymbol{\theta }}}}}_{p},{{{{\boldsymbol{\theta }}}}}_{0}).\end{array}$$

The first term in this equation shows as long the series of gradients agree on the direction as they are captured in m_1/4, the first term is large and dominates the second term, which is scaled by the square of the small step size. This behavior is valuable as a method to prevent U-turns in the parameter trajectory was proposed earlier in ref. ¹³. In case of unstable gradients and small changes in the values of U, the acceptance behavior will mostly depend on the random choice.

Algorithm 1

Hamiltonian Trajectory Ensemble (HTE)

HTE(N, U, K, θ₀, t_L, ϵ, M, integrator)

${{{\mathcal{S}}}}=\{\}$ ▹Posterior samples

θ = θ₀ ▹Initialize position

while $| {{{\mathcal{S}}}}| < N$ do

m ~ N(0, M) ▹Resample momentum

z = {θ, m} ▹Store initial value

for t = 0 to t_L do

integrator.step$({{{\boldsymbol{\theta }}}},{{{\bf{m}}}},\epsilon,M,\tilde{{{{\rm{U}}}}},T=0)$

end for

$r=\min \{1,\exp ({{{\rm{U}}}}(z({{{\boldsymbol{\theta }}}}))-{{{\rm{U}}}}({{{\boldsymbol{\theta }}}})-{{{\rm{K}}}}({{{\bf{m}}}})+{{{\rm{K}}}}(z({{{\bf{m}}}})))\}$

ν ~ Unif(0, 1)

if r > ν then

${{{\mathcal{S}}}}={{{\mathcal{S}}}}\cup \{{{{\boldsymbol{\theta }}}}\}$ ▹Accepted!

else

θ = z(θ) ▹Rejected!

end if

end while

Hamiltonian trajectory ensemble (HTE)

The second method we propose is motivated by the results of³⁶ on Hamiltonian Monte Carlo. To achieve overarching performance improvements, the Hamiltonian dynamics are simulated over a long trajectory length up to a thousand steps, using effectively gradient descent with momentum between the momentum resets, as this is what the trajectory simulation of HMC does. Although the integrators from Table 1 with stochastic gradients cannot simulate the same trajectory as the one followed by the true Hamiltonian dynamics, we know from optimization literature that the risk will reduce similarly to a gradient descent optimization (cp.⁶⁴), which is performed when simulating the Hamiltonian dynamics in between the momentum resampling steps. Therefore, we decided to try reproducing the HMC performance benefits with an OBABO stochastic gradient trajectory, still applying the same acceptance used in HMC. While this introduces an uncontrolled bias as it was discussed in ref. ⁶⁶, the divergence of the trajectories is the main concern. The aim of our method is to accept only trajectories, that stayed close to the actual Hamiltonian trajetory and are therefore excepted by the Metropolis-Hastings step. The OBABO integator is used with T = 0, effectively simulating a leapfrog discretization²¹ with an additional momentum decay (cp.⁶⁷). A summary of the program realization is given in Alg. 1. We use a program-like notation, where the formulas of the integrator steps are implemented in a step method, which is applied to the variables in place for every trajectory step as given in Table 1. Evaluation of U refers to equation (2) and K to the quadratic kinetic energy term depending on m and M. The notation z( ⋅ ) refers to the choice of θ or m from z.

Our primary objective using this naive approach was to sample from the same likelihood, which is optimized during deep neural network training. Through the usage of stochastic gradients small minima are hidden and the likelihood gets biased towards minima that facilitate generalization⁶⁴. Taking the acceptance probability for the GSGHMC algorithm in formula (14), the momentum terms do differ by the gradient noise. Therefore, in contrast to Hamiltonian Monte Carlo, the Kinetic energy terms do not eliminate each other. With a quadratic kinetic energy term, there is a sum of kinetic energy terms of the noise as well as scalar product terms between the momentum and the gradient noise that we explicitly ignore and that will deformate the likelihood depending on the energy in the noise. While this was argued in ref. ⁶⁶ to eventually cause acceptance rates of zero, we could not empirically validate this for the gradient noise evaluated in this paper. While this does not eliminate the bias of this method, it explores a regularized log-likelihood. Applying the algorithm on neural networks, we showed to conserve benefits from regularized neural network training.

The algorithm showed surprising robustness in the neural network setting, when the dimension of the target distribution increases, a property that is observed for regular HMC²¹ too. Since this method is not a Bayesian inference, but an ensemble method, we decided to use the term Hamiltonian trajectory ensemble (HTE). Advantages over regular ensemble methods from the machine learning community⁶⁸ are generated through a speed up, since it is not required to retrain the complete model in different settings. By running the trajectory length after an initial burn in creates large ensembles considerably faster than regular ensemble methods, saving time proportional to the ensemble size.

Data availability

No data was collected for this work. The EMNIST and CIFAR-10 datasets are publicly available as well as the Fashion MNIST and SVHN datasets. The Lorenz’96 simulation is a deterministic ode solution, that can be repeated with the published code.

Code availability

The source code for all experiments is available on CodeOcean. The Pytorch library⁶⁹ is used to construct the models and compute the gradients.

References

Alzubaidi, L. et al. Review of deep learning: concepts, CNN architectures, challenges, applications, future directions. J. Big Data 8, 53 (2021).
Article PubMed PubMed Central Google Scholar
Blundell, C., Cornebise, J., Kavukcuoglu, K., Wierstra, D.: Weight uncertainty in neural network. in Proceedings of the 32nd International Conference on Machine Learning. Proceedings of Machine Learning Research (eds Bach, F., Blei, D.) Vol. 37, pp. 1613–1622 (PMLR, Lille, France, 2015).
Guo, C., Pleiss, G., Sun, Y., Weinberger, K.Q. On calibration of modern neural networks. in International Conference on Machine Learning (2017).
Suganyadevi, S., Seethalakshmi, V. & Balasamy, K. A review on deep learning in medical image analysis. Int. J. Multimed. Inf. Retr. 11, 19–38 (2022).
Article CAS PubMed Google Scholar
Muhammad, K., Ullah, A., Lloret, J., Ser, J. D. & Albuquerque, V. H. C. Deep learning for safe autonomous driving: Current challenges and future directions. IEEE Trans. Intell. Transport. Syst. 22, 4316–4336 (2021).
Article Google Scholar
Gawlikowski, J. et al. A survey of uncertainty in deep neural networks. Artif. Intell. Rev. 56, 1513–1589 (2023).
Article Google Scholar
Barber, D. Bayesian Reasoning and Machine Learning (Cambridge University Press, Cambridge, 2012).
Gelman, A. et al. Bayesian Data Analysis 3rd edn (Chapman & Hall/CRC Texts in Statistical Science. Taylor & Francis, New York, 2013).
Sharma, M., Farquhar, S., Nalisnick, E., Rainforth, T. Do Bayesian neural networks need to be fully stochastic? in Proceedings of The 26th International Conference on Artificial Intelligence and Statistics. Proceedings of Machine Learning Research (eds Ruiz, F., Dy, J., Meent, J.-W.) Vol. 206, pp. 7694–7722 (PMLR, Valencia, Spain, 2023).
Gal, Y., Ghahramani, Z.: Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. in Proceedings of The 33rd International Conference on Machine Learning. Proceedings of Machine Learning Research, (eds Balcan, M.F., Weinberger, K.Q.) Vol. 48, pp. 1050–1059 (PMLR, New York, 2016).
Malinin, A., Gales, M. Predictive uncertainty estimation via prior networks. in Advances in Neural Information Processing Systems (eds Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R.) Vol. 31 (Curran Associates, Inc., Montreal, Canada, 2018).
Maddox, W. J., Garipov, T., Izmailov, P., Vetrov, D., Wilson, A. G. A simple baseline for Bayesian uncertainty in deep learning. in Proceedings of the 33rd International Conference on Neural Information Processing Systems (Vancouver, Canada, 2019).
Jospin, L. V., Laga, H., Boussaid, F., Buntine, W. & Bennamoun, M. Hands-on Bayesian neural networks—a tutorial for deep learning users. IEEE Comput. Intell. Mag. 17, 29–48 (2022).
Article Google Scholar
Fort, S., Hu, H., Lakshminarayanan, B. Deep ensembles: a loss landscape perspective. Preprint at https://doi.org/10.48550/arXiv.1912.02757 (2019).
Khan, M. E. & Rue, H. The Bayesian learning rule. J. Mach. Learn. Res. 24, 1–46 (2023).
MathSciNet CAS Google Scholar
Brooks, S., Gelman, A., Jones, G. & Meng, X.-L. Handbook of Markov Chain Monte Carlo (Chapman and Hall/CRC, New York, 2011).
Abdar, M. et al. A review of uncertainty quantification in deep learning: Techniques, applications and challenges. Inf. Fusion 76, 243–297 (2021).
Article Google Scholar
Hoffman, M. D. & Gelman, A. The no-u-turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo. J. Mach. Learn. Res. 15, 1351–1381 (2014).
MathSciNet Google Scholar
Carpenter, B. et al. Stan: a probabilistic programming language. J. Stat. Softw. 76, 1–32 (2017).
Article PubMed PubMed Central Google Scholar
Betancourt, M. A Conceptual Introduction to Hamiltonian Monte Carlo Preprint at https://doi.org/10.48550/arXiv.1701.02434 (2018).
Neal, R.M. 5. MCMC Using Hamiltonian Dynamics (CRC Press, New York, 2011).
Nemeth, C. & Fearnhead, P. Stochastic gradient Markov chain Monte Carlo. J. Am. Stat. Assoc. 116, 433–450 (2021).
Article MathSciNet CAS Google Scholar
Wenzel, F. et al. How good is the Bayes posterior in deep neural networks really? in Proceedings of the 37th International Conference on Machine Learning (ICML’20. JMLR.org, Online, 2020).
Garriga-Alonso, A. & Fortuin, V. Exact Langevin Dynamics with Stochastic Gradients. Preprint at https://doi.org/10.48550/2102.01691 (2021).
Welling, M., Teh, Y.W. Bayesian learning via stochastic gradient Langevin dynamics. in Proceedings of the 28th International Conference on International Conference on Machine Learning (ICML’11, pp. 681–688. Omnipress, Bellevue, Washington, USA, 2011).
Chen, T., Fox, E.B., Guestrin, C. Stochastic gradient Hamiltonian Monte Carlo. in Proceedings of Machine Learning Research (ICML, 2014).
Hastings, W. K. Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57, 97–109 (1970).
Article MathSciNet Google Scholar
Leimkuhler, B., Matthews, C. Molecular dynamics: with deterministic and stochastic numerical methods. Interdisciplinary Applied Mathematics (Springer, Switzerland, 2015).
Zhang, R., Cooper, A.F., Sa, C.D. Amagold: Amortized metropolis adjustment for efficient stochastic gradient MCMC. in Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics. Proceedings of Machine Learning Research, (eds Chiappa, S., Calandra, R.) vol. 108, pp. 2142–2152 (PMLR, Palermo, Italy, 2020).
Leimkuhler, B., Matthews, C. & Vlaar, T. Partitioned integrators for thermodynamic parameterization of neural networks. Found. Data Sci. 1, 457–489 (2019).
Article MathSciNet Google Scholar
Horowitz, A. M. A generalized guided Monte Carlo algorithm. Phys. Lett. B 268, 247–252 (1991).
Article ADS Google Scholar
Lecun, Y., Bottou, L., Bengio, Y. & Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 86, 2278–2324 (1998).
Article ADS Google Scholar
Zagoruyko, S., Komodakis, N. Wide residual networks. in Proceedings of the British Machine Vision Conference, BMVC 2016. (eds Wilson, R.C., Hancock, E. R., Smith, W. A. P.) (BMVA Press, York, UK, 2016).
Cohen, G., Afshar, S., Tapson, J., Schaik, A.V. Emnist: Extending MNIST to handwritten letters. 2017 International Joint Conference on Neural Networks (IJCNN) (2017).
Krizhevsky, A. Learning multiple layers of features from tiny images. Technical Report at https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf (2009).
Izmailov, P., Vikram, S., Hoffman, M. D., Wilson, A. G. G. What are Bayesian neural network posteriors really like? in Proceedings of the 38th International Conference on Machine Learning. Proceedings of Machine Learning Research, (eds Meila, M., Zhang, T.) Vol. 139, pp. 4629–4640 (PMLR, Online, 2021).
Singh, S., Krishnan, A. Filter response normalization layer: eliminating batch dependence in the training of deep neural networks. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2020).
Li, C., Chen, C., Carlson, D., Carin, L. Preconditioned stochastic gradient langevin dynamics for deep neural networks. Proceedings of the AAAI Conference on Artificial Intelligence 30(1) (2016).
Krishnan, R., Subedar, M., Tickoo, O. Specifying weight priors in Bayesian deep neural networks with empirical Bayes. in AAAI Conference on Artificial Intelligence, pp. 4477–4484 (2020).
Wen, Y., Vicol, P., Ba, J., Tran, D. & Grosse, R. Flipout: Efficient Pseudo-Independent Weight Perturbations on Mini-Batches. International Conference on Learning Representations (2018).
Khan, M. et al. Fast and scalable Bayesian deep learning by weight-perturbation in Adam. in Proceedings of the 35th International Conference on Machine Learning. Proceedings of Machine Learning Research (eds Dy, J., Krause, A.) Vol. 80, pp. 2611–2620 (PMLR, Stockholm, Sweden, 2018).
Osawa, K. et al. Practical Deep Learning with Bayesian Principles (Curran Associates Inc., Red Hook, NY, USA, 2019).
Zhang, R., Li, C., Zhang, J., Chen, C., Wilson, A.G. Cyclical stochastic gradient MCMC for Bayesian deep learning. in International Conference on Learning Representations (2020).
Xiao, H., Rasul, K. & Vollgraf, R. Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms. Preprint at https://arxiv.org/abs/1708.07747 (2017).
Netzer, Y. et al. Reading digits in natural images with unsupervised feature learning. in NIPS Workshop on Deep Learning and Unsupervised Feature Learning 2011 (2011).
Korattikara Balan, A., Rathod, V., Murphy, K.P., Welling, M. Bayesian dark knowledge. in Advances in Neural Information Processing Systems (eds Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., Garnett, R.) Vol. 28. (Curran Associates, Inc., Montreal, Canada, 2015).
Korattikara, A., Chen, Y., Welling, M. Austerity in MCMC land: cutting the Metropolis-Hastings budget. in Proceedings of the 31st International Conference on Machine Learning. Proceedings of Machine Learning Research, (eds Xing, E. P., Jebara, T.) Vol. 32, pp. 181–189 (PMLR, Bejing, China, 2014).
Bardenet, R., Doucet, A., Holmes, C.: Towards scaling up Markov chain Monte Carlo: an adaptive subsampling approach. in Proceedings of the 31st International Conference on Machine Learning. Proceedings of Machine Learning Research, (Xing, E.P., Jebara, T.) Vol. 32, pp. 405–413 (PMLR, Bejing, China, 2014).
Maclaurin, D., Adams, R. P. Firefly Monte Carlo: Exact MCMC with subsets of data. in Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence (IJCAI), pp. 4289–4295 (2015).
Seita, D., Pan, X., Chen, H., Canny, J. An efficient minibatch acceptance test for Metropolis-Hastings. in Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI-18 pp. 5359–5363 (International Joint Conferences on Artificial Intelligence Organization, Stockholm, Sweden, 2018).
Zhang, R., Cooper, A. F., De Sa, C. M. Asymptotically optimal exact minibatch Metropolis-Hastings. in Advances in Neural Information Processing Systems (eds Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M.F., Lin, H.) Vol. 33, pp. 19500–19510 (Curran Associates, Inc., Vancouver, Canada, 2020).
Yang, L. & Shami, A. On hyperparameter optimization of machine learning algorithms: theory and practice. Neurocomputing 415, 295–316 (2020).
Article Google Scholar
Ren, P. et al. A comprehensive survey of neural architecture search: challenges and solutions. ACM Comput. Surv. 54, (4) 1–34 (2021).
Li, H., Xu, Z., Taylor, G., Studer, C., Goldstein, T. Visualizing the loss landscape of neural nets. in Neural Information Processing Systems (2018).
Cheng, X., Chatterji, N. S., Bartlett, P. L., Jordan, M. I. Underdamped Langevin MCMC: a non-asymptotic analysis. in Proceedings of the 31st Conference on Learning Theory. Proceedings of Machine Learning Research, Vol. 75, pp. 300–323 (PMLR, San Diego, USA, 2018).
Dalalyan, A. S. Theoretical guarantees for approximate sampling from smooth and log-concave densities. J. R. Stat. Soc. B (Stat. Methodol.) 79, 51–676 (2017).
MathSciNet Google Scholar
Durmus, A. & Moulines Nonasymptotic convergence analysis for the unadjusted Langevin algorithm. Ann. Appl. Probab. 27, 1551–1587 (2017).
Article MathSciNet Google Scholar
Cheng, X., Chatterji, N. S., Abbasi-Yadkori, Y., Bartlett, P. L. & Jordan, M. I. Sharp convergence rates for Langevin dynamics in the nonconvex setting. Preprint at https://arxiv.org/abs/1805.01648 (2020).
Graves, A. Generating Sequences With Recurrent Neural Networks. Preprint at https://arxiv.org/abs/1308.0850 (2014).
Tran, B.-H., Rossi, S., Milios, D. & Filippone, M. All you need is a good functional prior for Bayesian deep learning. J. Mach. Learn. Res. 23, 1–56 (2022).
MathSciNet CAS Google Scholar
Fortuin, V. et al. Bayesian neural network priors revisited. in International Conference on Learning Representations (ICLR) (2022).
Depeweg, S., Hernandez-Lobato, J.-M., Doshi-Velez, F., Udluft, S. Decomposition of uncertainty in Bayesian deep learning for efficient and risk-sensitive learning. in Proceedings of the 35th International Conference on Machine Learning. Proceedings of Machine Learning Research (eds Dy, J., Krause, A.) Vol. 80, pp. 1184–1193 (PMLR, Stockholm, Sweden, 2018).
Franzese, G., Milios, D., Filippone, M., Michiardi, P. Revisiting the effects of stochasticity for Hamiltonian samplers. in Proceedings of the 39th International Conference on Machine Learning. Proceedings of Machine Learning Research Vol. 162, pp. 6744–6778 (PMLR, Baltimore, 2022).
Grohs, P., Kutyniok, G. Mathematical Aspects of Deep Learning (Cambridge University Press, Cambridge, 2022).
Wu, T.-Y., Wang, Y. X. R. & Wong, W. H. Mini-batch Metropolis-Hastings with reversible SGLD proposal. J. Am. Stat. Assoc. 117, 386–394 (2022).
Article MathSciNet CAS Google Scholar
Betancourt, M. The fundamental incompatibility of scalable Hamiltonian Monte Carlo and naive data subsampling. in Proceedings of the 32nd International Conference on Machine Learning. Proceedings of Machine Learning Research, (eds Bach, F., Blei, D.) Vol. 37, pp. 533–540 (PMLR, Lille, France, 2015).
Sutskever, I., Martens, J., Dahl, G., Hinton, G. On the importance of initialization and momentum in deep learning. in Proceedings of the 30th International Conference on Machine Learning. ICML’13, vol. 28, pp. 1139–1147 (JMLR.org, Atlanta, Georgia, USA, 2013).
Dietterich, T.G. Ensemble methods in machine learning. in Multiple Classifier Systems, pp. 1–15 (Springer, Berlin, Heidelberg, 2000).
Ansel, J. et al. PyTorch 2: Faster Machine Learning Through Dynamic Python Bytecode Transformation and Graph Compilation. in 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2 (ASPLOS ’24) (ACM, New York, NY, USA, 2024).

Download references

Acknowledgements

Thanks to the Carl Zeiss Stiftung for funding this research under P2022-08-006.

Funding

Open Access funding enabled and organized by Projekt DEAL.

Author information

Authors and Affiliations

Data-intensive Systems and Visualization Group, Technische Universität Ilmenau, Ilmenau, Thüringen, Germany
Matthias Schmal & Patrick Mäder
Faculty of Biological Sciences, Friedrich Schiller University, Jena, Thüringen, Germany
Patrick Mäder
German Centre for Integrative Biodiversity Research—iDiv, Leipzig, Saxony, Germany
Patrick Mäder

Authors

Matthias Schmal
View author publications
Search author on:PubMed Google Scholar
Patrick Mäder
View author publications
Search author on:PubMed Google Scholar

Contributions

M.S. performed the theoretical analysis, the coding realization and experiments. P.M. supervised the work.

Corresponding author

Correspondence to Matthias Schmal.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Communications Communications thanks the anonymous, reviewer(s) for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Infomation (download PDF )

Transparent Peer Review file (download PDF )

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Schmal, M., Mäder, P. Reliable uncertainty estimates in deep learning with efficient Metropolis-Hastings algorithms. Nat Commun 17, 2531 (2026). https://doi.org/10.1038/s41467-026-70015-z

Download citation

Received: 05 February 2025
Accepted: 13 February 2026
Published: 17 March 2026
Version of record: 17 March 2026
DOI: https://doi.org/10.1038/s41467-026-70015-z

Subjects

Abstract

Similar content being viewed by others

Replica exchange enhanced adaptively weighted stochastic gradient Langevin dynamics for Bayesian sampling and optimization

Stochastic Gradient Descent-like relaxation is equivalent to Metropolis dynamics in discrete optimization and inference problems

Data-driven approach to the deep learning of the dynamics of a non-integrable Hamiltonian system

Introduction

Results

Synthetic benchmark

Exploration of trajectory parameters

Ablation study

Bayesian neural network comparison

Ensemble size reduction

Discussion

Methods

Errors in Markov Chain Monte Carlo estimation

Previous acceptance step approaches

Our proposed Metropolis-Hastings acceptance methods

Generalized SGHMC (GSGHMC)

Algorithm 1

Hamiltonian trajectory ensemble (HTE)

Data availability

Code availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Supplementary information

Supplementary Infomation (download PDF )

Transparent Peer Review file (download PDF )

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links