Introduction

The field of uncertainty quantification and deep learning has established a solid theoretical foundation through hybrid approaches such as scalable variational Gaussian processes, neural network-Gaussian process correspondences, and deep Gaussian processes1,2,3,4,5,6. While these hybrid methods have achieved significant success in their respective application domains, the widespread reliance on attention mechanisms in modern sequence modeling and multimodal tasks makes principled uncertainty quantification in Transformer architectures an important engineering requirement7,8. However, the systematic engineering integration of these mature Bayesian techniques with modern Transformer architectures still faces concrete technical challenges, particularly in designing effective uncertainty propagation pathways within attention mechanisms, implementing probabilistic distribution composition under residual connections, and addressing the decoupling and fusion of epistemic and aleatoric uncertainty while maintaining mathematical rigor9,10,11,12.

In the probabilistic modeling of attention mechanisms, works such as variational attention mechanisms13 and stochastic attention weights14 have preliminarily explored uncertainty representations of attention scores, laying the foundation for combining attention mechanisms with probabilistic reasoning. Addressing uncertainty propagation in deep networks, Monte Carlo Dropout15 achieves approximate sampling of parameter uncertainty by maintaining dropout activation during inference, while Bayesian Deep Learning methods16,17 establish probabilistic distribution representations at the weight level through variational inference frameworks. In the domain of uncertainty decoupling and fusion, Deep Ensemble18 effectively separates epistemic and aleatoric uncertainty through model ensembling, Sensoy et al. (2018)19 provides a unified framework for uncertainty quantification based on evidence theory, and multi-task uncertainty modeling20 makes important contributions in covariance construction. Although these works have achieved significant progress in their respective technical directions, the systematic engineering integration of these technologies in Transformer architectures, particularly in maintaining the balance between computational feasibility and theoretical consistency, still lacks a unified solution framework.

In the theoretical construction of probabilistic attention mechanisms, variational information bottleneck theory21 provides a principled foundation for information-theoretic interpretation of attention weights, while kernelized attention mechanisms22 achieve linear complexity approximation through the FAVOR + algorithm. Gaussian Process Attention23 directly embeds Gaussian process priors into attention computation. In the technical evolution of fusing variational inference with deep architectures, Structured Variational Inference24 achieves more precise posterior approximation through structured priors, Multiplicative Normalizing Flows25 provides more expressive variational families for complex posterior distribution modeling, while Concrete Dropout15 and Variational Dropout26 establish bridges between theory and practice in adaptive regularization. The fusion of covariance modeling with multi-task learning reflects more sophisticated mathematical treatment: Multi-Task Gaussian Processes27 establish rigorous mathematical representations of inter-task correlations, Tensor-Structured Gaussian Processes28 achieve efficient computation of high-dimensional covariances through Kronecker decomposition, and Neural Module Networks29 and Modular Meta-Learning30 provide composable uncertainty modeling approaches in modular architecture design. Although these deep technical developments have reached considerable theoretical depth in their respective subfields, there currently lacks a unified engineering framework to systematically integrate these advanced technologies, particularly in implementing end-to-end Bayesian inference capabilities in the dominant Transformer architecture.

To address this systematic engineering integration challenge, this research proposes the Residual Bayesian Attention (RBA) framework, which achieves organic fusion of Bayesian inference with Transformer architectures through three tightly coupled core components. Specifically, the Bayesian feedforward layer establishes a differentiable propagation mechanism for parameter-level uncertainty through reparameterization tricks and Delta method approximation. The multi-layer residual Bayesian attention directly embeds radial basis function kernels into attention score computation and introduces Beta distribution-modeled adaptive residual weights to enable uncertainty accumulation propagation in deep networks. The Bayesian covariance construction module generates covariance matrix representations that satisfy Gaussian process mathematical requirements through outer product operations and eigenvalue correction techniques. This design operates synergistically under a unified variational Bayesian optimization framework, maintaining the parallel computation advantages of Transformers while achieving principled separation of epistemic and aleatoric uncertainty, providing end-to-end Bayesian inference capabilities for modern sequence modeling tasks.

In comprehensive validation across six benchmark datasets covering different application domains and data characteristics, RBA demonstrates stable and competitive performance: achieving a coefficient of determination of 0.972 and good calibration quality (ECE = 0.1877) in engineering optimization tasks, maintaining prediction accuracy of 0.920 while controlling the Prediction Interval Normalized Average Width to 0.180 in time series forecasting, and reaching 96.38% Prediction Interval Coverage Probability in spatial modeling tasks while effectively handling geospatial dependencies. Notably, RBA exhibits consistent advantages in uncertainty calibration, maintaining good consistency between prediction confidence and actual accuracy across various data complexity levels—a characteristic of practical value for decision support in real applications. Although complex physical system modeling still presents challenges2, this objective delineation of applicability boundaries further validates the objectivity and credibility of the method evaluation. Therefore, RBA, as a systematic engineering integration framework for Bayesian inference and Transformer architectures, contributes an engineering solution with clearly defined technical boundaries for probabilistic reasoning implementation in deep sequence modeling.

Methods

RBA construction

The Residual Bayesian Attention (RBA) algorithm represents an innovative model that deeply integrates the Bayesian inference of Gaussian processes with the residual architecture of Transformers. The entire algorithmic construction adheres to the design principles of “Bayesian consistency + residual information propagation + multi-layer uncertainty accumulation.”

As illustrated in Fig. 1, the core framework comprises three components: Bayesian feedforward layers, multi-layer residual-connected Bayesian attention, and Bayesian covariance construction. Specifically, the Residual Bayesian Attention (RBA) architecture achieves essential fusion between deep learning and Gaussian processes through three tightly coupled core components. The Bayesian feedforward layer constitutes the probabilistic foundation of the entire architecture, transforming deterministic parameters in traditional neural networks into probability distributions characterized by means and variances, implementing differentiable stochastic sampling through reparameterization tricks, and quantifying epistemic uncertainty at the parameter level using variational inference frameworks, thereby providing a probabilistic feature representation foundation for subsequent components. The multi-layer residual Bayesian attention mechanism builds upon this foundation to construct a deep probabilistic inference framework, directly embedding Gaussian process kernel function concepts into attention score computations, enabling each attention head to correspond to an independent Gaussian process prior, learning data correlations across different scales and patterns through multi-head parallel processing, while introducing Bayesian residual connection mechanisms to address gradient vanishing problems in deep networks and achieving effective propagation and accumulation of uncertainty in the network depth direction through hierarchical variational inference. The Bayesian covariance construction component is responsible for transforming the outputs of the aforementioned multi-layer attention mechanisms into covariance matrices that satisfy the mathematical requirements of Gaussian processes, generating fundamental covariance structures through outer product operations, separately modeling epistemic components arising from parameter uncertainty and aleatoric components from intrinsic data randomness, ensuring matrix positive definiteness through eigenvalue correction, and supporting tensor product extensions for multi-task scenarios. The three components undergo collaborative optimization within a unified variational Bayesian framework, forming an end-to-end learning system that possesses both the powerful representational capabilities of deep neural networks and maintains the rigorous uncertainty quantification characteristics of Gaussian processes, achieving organic unification of data-driven feature learning and probabilistic inference.

Fig. 1
figure 1

Residual Bayesian Attention (RBA) architecture

Bayesian feedforward layer

The Bayesian feedforward layer transforms deterministic neural network parameters into variational posterior distributions (Eq. 7) through a hierarchical Bayesian model (Eq. 6), achieving explicit modeling and differentiable propagation of parameter-level epistemic uncertainty. Equations (8)-(9) implement differentiable parameter sampling, decoupling randomness from variational parameters through the reparameterization trick to ensure numerical stability of backpropagation. Equations (13)-(26) establish a complete uncertainty propagation pathway, from expectation-variance computation of linear transformations (Eqs. 15–17) to Delta method approximation of activation functions (Eqs. 1114), and further to distribution composition of residual connections (Eq. 24), achieving analytical propagation of inter-layer uncertainty. Equations (27)-(33) construct the Evidence Lower Bound (ELBO) framework, balancing reconstruction error with prior constraints through KL divergence regularization, and establishing Jacobian matrix representation for uncertainty propagation (Eq. 33), laying the theoretical foundation for subsequent probabilistic modeling of attention mechanisms.

The Bayesian feedforward layer process is as follows.

The first layer weights can be expressed as formulas (1) and (2).

$${W_1}\sim N({\mu _{{W_1}}},{\Sigma _{{W_1}}})$$
(1)
$${b_1}\sim N({\mu _{{b_1}}},{\Sigma _{{b_1}}})$$
(2)

The second layer of weight priors can be expressed as formulas (3) and (4).

$${b_2}\sim N({\mu _{{b_2}}},{\Sigma _{{b_2}}})$$
(3)
$${W_2}\sim N({\mu _{{W_2}}},{\Sigma _{{W_2}}})$$
(4)

Among them, the covariance matrix is a diagonal matrix defined by formula (5).

$${\Sigma _{{W_i}}}={\text{diag}}(\sigma _{{{W_i}}}^{2}),\quad {\Sigma _{{b_i}}}={\text{diag}}(\sigma _{{{b_i}}}^{2})$$
(5)

The hierarchical Bayesian model31 is defined by formula (6).

$$p(\theta )=p({W_1})p({b_1})p({W_2})p({b_2})$$
(6)

Where \(\theta =\{ {W_1},{b_1},{W_2},{b_2}\}\) is for all parameters.

For difficult-to-handle true posteriors, use variational posteriors defined by formula (7).

$$q(\theta )=q({W_1})q({b_1})q({W_2})q({b_2})$$
(7)

Each factor is \(q({W_i})=N({\mu _{q,{W_i}}},diag(\sigma _{{q,{W_i}}}^{2}))\), \(q({b_i})=N({\mu _{q,{b_i}}},diag(\sigma _{{q,{b_i}}}^{2}))\),which is represented by Eqs. (8) and (9) from the variational posterior32 sampling process.

$$W_{i}^{{(s)}}={\mu _{q,{W_i}}}+{\sigma _{q,{W_i}}} \odot O _{{{W_i}}}^{{(s)}}$$
(8)
$$b_{i}^{{(s)}}={\mu _{q,{b_i}}}+{\sigma _{q,{b_i}}} \odot \epsilon _{{{b_i}}}^{{(s)}}$$
(9)

Among them,\(\epsilon _{{{W_i}}}^{{(s)}},\epsilon _{{{b_i}}}^{{(s)}}\sim N(0,I)\),GELU33 function is defined as formula (10).

$${\text{GELU}}(z)=z\Phi (z)=z\int_{{ - \infty }}^{z} {\frac{1}{{\sqrt {2\pi } }}} {e^{ - {t^2}/2}}dt$$
(10)

The approximate form is \({\text{GELU}}(z) \approx \frac{z}{2}\left( {1+\tanh \left( {\sqrt {\frac{2}{\pi }} \left( {z+0.044715{z^3}} \right)} \right)} \right)\).

For random variables \(Z\sim N({\mu _Z},\sigma _{Z}^{2})\) and differentiable functions \(g( \cdot )\), the Delta method provides the solution to formula (11).

$$g(Z) \approx N\left( {g({\mu _Z}),{{\left( {\frac{{dg}}{{dz}}{|_{z={\mu _Z}}}} \right)}^2}\sigma _{Z}^{2}} \right)$$
(11)

The derivative of GELU is defined by formula (12).

$$\frac{{d{\text{GELU}}}}{{dz}}=\Phi (z)+z\phi (z)$$
(12)

Among them, \(\Phi (z)=\frac{1}{2}\left( {1+{\text{erf}}\left( {\frac{z}{{\sqrt 2 }}} \right)} \right)\) is the standard normal CDF34,\(\phi (z)=\frac{1}{{\sqrt {2\pi } }}{e^{ - {z^2}/2}}\) is the standard normal PDF.

Input \({H_1}\sim N({\mu _{{H_1}}},{\Sigma _{{H_1}}})\), activation function output formula (13)-(14) value.

$$E[{\text{GELU}}({H_1})]={\text{GELU}}({\mu _{{H_1}}})$$
(13)
$${\text{Var}}[{\text{GELU}}({H_1})]={\left( {\frac{{d{\text{GELU}}}}{{dh}}{|_{h={\mu _{{H_1}}}}}} \right)^2} \odot {\text{diag}}({\Sigma _{{H_1}}})$$
(14)

Bayesian forward propagation in feedforward networks, where the first-layer linear transformation can be defined by formula (15).

$$H_{1}^{{(s)}}=XW_{1}^{{(s)}}+b_{1}^{{(s)}}$$
(15)

The expected value of the first-layer output distribution is expressed by formula (16). 

$$E[{H_1}]=XE[{W_1}]+E[{b_1}]=X{\mu _{q,{W_1}}}+{\mu _{q,{b_1}}}$$
(16)

Variance (diagonal approximation)35 is expressed by formula (17). 

$${\text{Var}}[{H_{1,ij}}]=\sum\limits_{{k=1}}^{d} {X_{{ik}}^{2}} \sigma _{{q,{W_1},kj}}^{2}+\sigma _{{q,{b_1},j}}^{2}$$
(17)

The distribution after activation can be expressed by formula (18).

$$H_{1}^{{{\text{act}}}}={\text{GELU}}({H_1})$$
(18)

Applying the Delta method36 yields the activated distributions of Eqs. (19) and (20).

$$E[H_{1}^{{{\text{act}}}}]={\text{GELU}}(E[{H_1}])$$
(19)
$${\text{Var}}[H_{1}^{{{\text{act}}}}]={\left( {\frac{{d{\text{GELU}}}}{{dh}}{|_{h=E[{H_1}]}}} \right)^2} \odot {\text{Var}}[{H_1}]$$
(20)

The second-order linear transformation is expressed by formula (21). 

$${Y^{(s)}}=H_{1}^{{{\text{act}}(s)}}W_{2}^{{(s)}}+b_{2}^{{(s)}}$$
(21)

Expectation is defined as formula (22).

$$E[Y]=E[H_{1}^{{{\text{act}}}}]E[{W_2}]+E[{b_2}]$$
(22)

The variance is given by formula (23).

$${\text{Var}}[{Y_{ij}}]=\sum\limits_{{k=1}}^{{{d_{ff}}}} E [H_{1}^{{{\text{act}}}}]_{{ik}}^{2}\sigma _{{q,{W_2},kj}}^{2}+{\text{Var}}{[H_{1}^{{{\text{act}}}}]_{ik}}{({\mu _{q,{W_2},kj}})^2}+\sigma _{{q,{b_2},j}}^{2}$$
(23)

Construct a Bayesian model with residual connections, where the probability of residual connections is represented as follows. Let the input of the layer be \({X_l}\sim N({\mu _{{X_l}}},{\Sigma _{{X_l}}})\), the output of the feedforward network be \({F_l}\sim N({\mu _{{F_l}}},{\Sigma _{{F_l}}})\), and the residual connection be \({X_{l+1}}={X_l}+{F_l}\).

Since the linear combination of normal distributions is still a normal distribution, we obtain formula (24).

$${X_{l+1}}\sim N({\mu _{{X_l}}}+{\mu _{{F_l}}},{\Sigma _{{X_l}}}+{\Sigma _{{F_l}}})$$
(24)

For layer normalization transformation, update to formula (25).

$$Y=\gamma \odot \frac{{X - {\mu _X}}}{{{\sigma _X}}}+\beta$$
(25)

Among them, \({\mu _X}=\frac{1}{d}\sum\limits_{{i=1}}^{d} {{X_i}} ,\quad \sigma _{X}^{2}=\frac{1}{d}\sum\limits_{{i=1}}^{d} {{{({X_i} - {\mu _X})}^2}}\).

the approximate transformation of uncertainty is defined by formula (26)

$${\text{Var}}[{Y_i}] \approx {\left( {\frac{{{\gamma _i}}}{{{\sigma _X}}}} \right)^2}{\text{Var}}[{X_i}]$$
(26)

The evidence lower bound (ELBO)37 of the variational objective function is defined by formula (27).

$$\mathcal{L}={E_{q(\theta )}}[\log p(Y|X,\theta )] - KL[q(\theta )||p(\theta )]$$
(27)

The error term of the reconstruction is defined as formula (28).

$${E_{q(\theta )}}[\log p(Y|X,\theta )]={E_{q(\theta )}}\left[ { - \frac{1}{{2{\sigma ^2}}}||Y - f(X;\theta )|{|^2}} \right]$$
(28)

Introducing the KL divergence term38, each parameter is defined by formula (29).

$$KL[q({W_i})||p({W_i})]=\frac{1}{2}\left[ {{\text{tr}}(\Sigma _{{p,{W_i}}}^{{ - 1}}{\Sigma _{q,{W_i}}})+{{({\mu _{p,{W_i}}} - {\mu _{q,{W_i}}})}^T}\Sigma _{{p,{W_i}}}^{{ - 1}}({\mu _{p,{W_i}}} - {\mu _{q,{W_i}}}) - k+\log \frac{{|{\Sigma _{p,{W_i}}}|}}{{|{\Sigma _{q,{W_i}}}|}}} \right]$$
(29)

Using the reparameterization technique39, the gradient update is given by Eqs. (30) and (31).

$${\nabla _{{\mu _q}}}\mathcal{L}={E_\epsilon }[{\nabla _{{\mu _q}}}f(X;{\mu _q}+{\sigma _q} \odot \epsilon )]$$
(30)
$${\nabla _{{\sigma _q}}}\mathcal{L}={E_\epsilon }[\epsilon \odot {\nabla _{{\mu _q}}}f(X;{\mu _q}+{\sigma _q} \odot \epsilon )]$$
(31)

Given the input distribution, the output of the Bayesian feedforward layer is given by Eq. (32).

$$(Y,{\Sigma _Y})={\text{BayesianFF}}(X,{\Sigma _X};q(\theta ))$$
(32)

End-to-end uncertainty propagation can be determined by formula (33).

$${\Sigma _Y}={J_f}({\mu _X}){\Sigma _X}{J_f}{({\mu _X})^T}+E[{\nabla _\theta }f({\mu _X};\theta ){\Sigma _\theta }{\nabla _\theta }f{({\mu _X};\theta )^T}]$$
(33)

Among them, \({J_f}({\mu _X})\) is the Jacobian matrix40 of the function with respect to the input, and \({\Sigma _\theta }\) is the covariance matrix of the parameters.

Multi-layer residual bayesian attention

Multi-layer Residual Bayesian Attention transforms traditional similarity measures into probabilistic correlation modeling by directly embedding Gaussian process kernel functions into attention score computation (Eqs. 4142), where each attention head corresponds to independent Gaussian process priors (Eqs. 3440) to learn multi-scale data correlation structures. In terms of uncertainty propagation in deep networks, while traditional residual connections address the vanishing gradient problem, the Bayesian residual connection mechanism achieves cumulative variance propagation (Eq. 52) and hierarchical variational inference (Eq. 51) through Beta distribution-modeled adaptive weights (Eqs. 4646) and Bayesian extension of Layer Normalization (Eqs. 48–50), combined with the parameter-level uncertainty foundation established in Sect. 2.2.1, ensuring cumulative propagation of uncertainty in the depth direction of the network. The entire framework integrates parameter uncertainty from feedforward layers with correlation uncertainty from attention mechanisms through unified variational lower bound optimization (Eqs. 5356), forming a complete deep probabilistic inference system.

The complete mathematical derivation process of the multi-layer residual Bayesian attention41 is as follows.The prior definitions of the query, key, and value projection weights in the l th layer are given by formulas (34) to (37).

$$W_{Q}^{{(l)}}\sim N({\mu _{W_{Q}^{{(l)}}}},{\Sigma _{W_{Q}^{{(l)}}}})$$
(34)
$$W_{K}^{{(l)}}\sim N({\mu _{W_{K}^{{(l)}}}},{\Sigma _{W_{K}^{{(l)}}}})$$
(35)
$$W_{V}^{{(l)}}\sim N({\mu _{W_{V}^{{(l)}}}},{\Sigma _{W_{V}^{{(l)}}}})$$
(36)
$$W_{O}^{{(l)}}\sim N({\mu _{W_{O}^{{(l)}}}},{\Sigma _{W_{O}^{{(l)}}}})$$
(37)

The kernel function parameters of the nth attention head are defined by formulas (38)-(40).

$$\ell _{h}^{{(l)}}\sim {\text{Gamma}}({\alpha _\ell },{\beta _\ell })$$
(38)
$$\sigma _{f}^{{(l),h}}\sim {\text{Gamma}}({\alpha _f},{\beta _f})$$
(39)
$$\sigma _{n}^{{(l),h}}\sim {\text{Gamma}}({\alpha _n},{\beta _n})$$
(40)

Among them, \(\ell _{h}^{{(l)}}\) is the length scale, \(\sigma _{f}^{{(l),h}}\) is the signal variance, and \(\sigma _{n}^{{(l),h}}\) is the noise variance.

The RBF kernel function is defined by formula (41).

$$k({x_i},{x_j})={(\sigma _{f}^{{(l),h}})^2}\exp \left( { - \frac{{\parallel{x_i} - {x_j}\parallel^{2}}}{{2{{(\ell _{h}^{{(l)}})}^2}}}} \right)$$
(41)

The attention score of the l th layer and the h th head is expressed by formula (42).

$$S_{{ij}}^{{(l),h}}=\frac{{Q_{i}^{{(l),h}} \cdot K_{j}^{{(l),h}}}}{{\sqrt {{d_k}} }}+k({x_i},{x_j})$$
(42)

Softmax normalization is expressed as formula (43).

$$A_{{ij}}^{{(l),h}}=\frac{{\exp (S_{{ij}}^{{(l),h}})}}{{\sum\limits_{{k=1}}^{n} {\exp } (S_{{ik}}^{{(l),h}})}}$$
(43)

The first output is formula (44).

$${\text{hea}}{{\text{d}}^{(l),h}}=\sum\limits_{{j=1}}^{n} {A_{{ij}}^{{(l),h}}} V_{j}^{{(l),h}}$$
(44)

Multiple series connections can be represented by formula (45).

$${\text{MultiHea}}{{\text{d}}^{(l)}}={\text{Concat}}({\text{hea}}{{\text{d}}^{(l),1}}, \ldots ,{\text{hea}}{{\text{d}}^{(l),H}})W_{O}^{{(l)}}$$
(45)

The residual connection weight prior is defined by Eqs. (46) and (46).

$$\beta _{{{\text{att}}}}^{{(l)}}\sim {\text{Beta}}({\gamma _3},{\gamma _4})$$
(46)
$$\alpha _{{{\text{res}}}}^{{(l)}}\sim {\text{Beta}}({\gamma _1},{\gamma _2})$$
(47)

Residual update equation, extended to Bayesian formula (48)-(49) based on Layer Normalization.

$${\mu _{{\text{norm}}}}=\frac{1}{{{d_h}}}\sum\limits_{{i=1}}^{{{d_h}}} {{x_i}}$$
(48)
$$\sigma _{{{\text{norm}}}}^{2}=\frac{1}{{{d_h}}}\sum\limits_{{i=1}}^{{{d_h}}} {{{({x_i} - {\mu _{{\text{norm}}}})}^2}} +\epsilon$$
(49)

The residual connection is determined by formula (50).

$${H^{(l+1)}}=\alpha _{{{\text{res}}}}^{{(l)}} \cdot {\text{LayerNorm}}({H^{(l)}})+\beta _{{{\text{att}}}}^{{(l)}} \cdot {\text{MultiHea}}{{\text{d}}^{(l)}}$$
(50)

The variational posterior of the layer is given by formula (51).

$$q({\theta ^{(l)}})=\prod\limits_{{p \in \{ {\text{Q,K,V,O}}\} }} q (W_{p}^{{(l)}})\prod\limits_{{h=1}}^{H} q (\ell _{h}^{{(l)}})q(\sigma _{f}^{{(l),h}})q(\sigma _{n}^{{(l),h}})$$
(51)

Cumulative variance propagation is given by Eq. (52).

$${\text{Var}}({H^{(l+1)}})={(\alpha _{{{\text{res}}}}^{{(l)}})^2}{\text{Var}}({H^{(l)}})+{(\beta _{{{\text{att}}}}^{{(l)}})^2}{\text{Var}}({\text{MultiHea}}{{\text{d}}^{(l)}})$$
(52)

Introducing KL divergence regularization, the variational lower bound objective function becomes Eq. (53).

$$\mathcal{L}={{\mathbb{E}}_{q(\theta )}}[\log p(y|X,\theta )] - \sum\limits_{{l=1}}^{L} {KL} [q({\theta ^{(l)}})\parallel p({\theta ^{(l)}})]$$
(53)
$${\text{KL}}[q({\theta ^{(l)}}) p({\theta ^{(l)}})]=\sum\limits_{{p \in \{ {\text{Q,K,V,O}}\} }} {{\text{KL}}} [q(W_{p}^{{(l)}})r p(W_{p}^{{(l)}})]+\sum\limits_{{h=1}}^{H} {\sum\limits_{{\phi \in \{ \ell ,{\sigma _f},{\sigma _n}\} }} {{\text{KL}}} } [q(\phi _{h}^{{(l)}}) p(\phi _{h}^{{(l)}})]$$

The complete model posterior representation is given by Eq. (54).

$$p(\Theta |X,y) \propto p(y|X,\Theta )\prod\limits_{{l=1}}^{L} p ({\theta ^{(l)}})$$
(54)

Among them, \(\Theta =\{ {\theta ^{(1)}}, \ldots ,{\theta ^{(L)}}\}\) is for all layer parameters.

The variational approximation is given by formula (55).

$$q(\Theta )=\prod\limits_{{l=1}}^{L} q ({\theta ^{(l)}})$$
(55)

The optimal variational parameter is obtained by maximizing the variational lower bound in Eq. (56).

$${\phi ^*}=\arg {\hbox{max} _\phi }\mathcal{L}(\phi )$$
(56)

Bayesian covariance construction

Bayesian covariance construction transforms the deep probabilistic inference outputs from Sect. 2.2.2 into covariance matrices that satisfy Gaussian process mathematical requirements, constructing basic covariance structures through outer product operations (Eqs. 5758) to achieve mapping from network representations to probabilistic kernel functions. The covariance enhancement mechanism separately models epistemic uncertainty and aleatoric uncertainty: epistemic uncertainty through Jacobian matrix computation of parameter propagation (Eq. 60), and aleatoric uncertainty through independent variance prediction network modeling (Eqs. 6163), with their fusion forming a complete uncertainty matrix (Eqs. 59, 65). The mathematical correction process ensures the mathematical validity of covariance matrices through symmetrization processing (Eq. 66), eigenvalue decomposition (Eq. 67), and positive definiteness correction (Eqs. 6869). Multi-task extensions adopt tensor product decomposition structures (Eqs. 7073), decoupling data correlations from task correlations in modeling. Covariance hyperparameter learning achieves adaptive adjustment of global scaling factors through variational posterior updates (Eqs. 7478), while covariance quality assessment ensures numerical stability through condition number monitoring (Eq. 79), rank deficiency detection (Eq. 80), and Frobenius norm conditions (Eq. 81), ultimately establishing a complete mapping from network outputs to Gaussian process prediction covariance (Eqs. 8385).

The Bayesian covariance42 construction process is as follows.

The final layer output is reduced to formula (57).

$${Z^{(L)}}={H^{(L)}}{W_{{\text{proj}}}}+{b_{{\text{proj}}}}$$
(57)

Among them,

$${W_{{\text{proj}}}}\sim N({\mu _{{W_{{\text{proj}}}}}},{\Sigma _{{W_{{\text{proj}}}}}})$$
$${b_{{\text{proj}}}}\sim N({\mu _{{b_{{\text{proj}}}}}},{\Sigma _{{b_{{\text{proj}}}}}})$$
$${Z^{(L)}} \in {{\mathbb{R}}^{n \times 1}}$$

The covariance matrix is constructed as formula (58).

$${K_{{\text{learned}}}}={Z^{(L)}}{({Z^{(L)}})^T}$$
(58)

Bayesian uncertainty enhancement is defined as formula (59).

$${K_{{\text{bayes}}}}={K_{{\text{learned}}}}+{\Sigma _{{\text{uncertainty}}}}$$
(59)

Among them, the uncertainty matrix is \({\Sigma _{{\text{uncertainty}}}}={\text{diag}}(\sigma _{{{\text{epistemic}}}}^{2})+{U_{{\text{aleatoric}}}}\).

The parameter uncertainty propagation is expressed by formula (60)

$$\sigma _{{{\text{epistemic}}}}^{2}({x_i})=\sum\limits_{{l=1}}^{L} {\sum\limits_{{p \in \{ {\text{Q,K,V,O}}\} }} {{\text{Tr}}} } [\frac{{\partial {H^{(l)}}}}{{\partial W_{p}^{{(l)}}}}{\Sigma _{q,W_{p}^{{(l)}}}}{(\frac{{\partial {H^{(l)}}}}{{\partial W_{p}^{{(l)}}}})^T}]$$
(60)

Heteroscedasticity modeling43 for formula (61).

$$\sigma _{{{\text{aleatoric}}}}^{2}({x_i})=\exp ({f_{{\text{var}}}}({x_i}))$$
(61)

Among them, \({f_{{\text{var}}}}\) is an independent variance prediction network, represented by formula (62).

$${f_{{\text{var}}}}({x_i})=H_{i}^{{(L)}}{W_{{\text{var}}}}+{b_{{\text{var}}}}$$
(62)

The variance network parameter prior is obtained from Equations (81) to (63).

$${b_{{\text{var}}}}\sim N({\mu _{{b_{{\text{var}}}}}},{\Sigma _{{b_{{\text{var}}}}}})$$
(63)
$${W_{{\text{var}}}}\sim N({\mu _{{W_{{\text{var}}}}}},{\Sigma _{{W_{{\text{var}}}}}})$$
(64)

The joint uncertainty matrix was finally determined as formula (65).

$${U_{{\text{aleatoric}}}}={\text{diag}}(\sigma _{{{\text{aleatoric}}}}^{2}({x_1}), \ldots ,\sigma _{{{\text{aleatoric}}}}^{2}({x_n}))$$
(65)

Symmetrization treatment44 is given by formula (66).

$${K_{{\text{sym}}}}=\frac{1}{2}({K_{{\text{bayes}}}}+K_{{{\text{bayes}}}}^{T})$$
(66)

Eigenvalue decomposition45 into formula (67).

$${K_{{\text{sym}}}}=U\Lambda {U^T}$$
(67)

Positive correction is corrected to formula (68).

$$\tilde {\Lambda }=\hbox{max} (\Lambda ,\epsilon I)$$
(68)

Among them, \(\epsilon>0\) is the numerical stability parameter.The corrected covariance matrix is given by formula (69)

$${K_{{\text{final}}}}=U\tilde {\Lambda }{U^T}$$
(69)

The task covariance matrix is expressed by formula (70).

$${K_{{\text{task}}}} \in {{\mathbb{R}}^{T \times T}}$$
(70)

Among them, T is the number of tasks, \({K_{{\text{task}}}}\) which obeys the inverse Wishart prior and can be expressed by formula (71).

$${K_{{\text{task}}}}\sim {\text{IW}}(\Psi ,\nu )$$
(71)

The complete multitask covariance is defined by the product structure in formula (72).

$${K_{{\text{multi}}}}={K_{{\text{final}}}} \otimes {K_{{\text{task}}}}$$
(72)

When the task is independent, use block diagonal approximation optimization, defined as formula (73).

$${K_{{\text{multi}}}}={\text{blockdiag}}(K_{{{\text{final}}}}^{{(1)}}, \ldots ,K_{{{\text{final}}}}^{{(T)}})$$
(73)

The covariance hyperparameter learning process is as follows.

The global scaling factor is defined by formula (74).

$${\tau ^2}\sim {\text{Gamma}}({\alpha _\tau },{\beta _\tau })$$
(74)

The final covariance matrix is defined by formula (75).

$$K={\tau ^2}{K_{{\text{final}}}}+\sigma _{n}^{2}I$$
(75)

The covariance parameter is variational posterior updated to formula (76).

$$q({\tau ^2})={\text{Gamma}}({\tilde {\alpha }_\tau },{\tilde {\beta }_\tau })$$
(76)

The update equation is defined by formulas (77) and (78).

$${\tilde {\alpha }_\tau }={\alpha _\tau }+\frac{n}{2}$$
(77)
$${\tilde {\beta }_\tau }={\beta _\tau }+\frac{1}{2}{\text{Tr}}[K_{{{\text{final}}}}^{{ - 1}}(y - \mu ){(y - \mu )^T}]$$
(78)

The process for assessing the quality of the covariance is as follows.

Condition number monitoring is performed using formula (79).

$$\kappa (K)=\frac{{{\lambda _{\hbox{max} }}(K)}}{{{\lambda _{\hbox{min} }}(K)}}$$
(79)

Rank deficiency detection is given by formula (80).

$${\text{rank}}(K)=\sum\limits_{{i=1}}^{n} {\mathbf{1}} [{\lambda _i}>{\epsilon _{{\text{rank}}}}]$$
(80)

The Frobenius norm condition46 is given by formula (81).

$${K_F}=\sqrt {\sum\limits_{{i,j}} {K_{{ij}}^{2}} }$$
(81)

The logarithmic determinant of the covariance matrix is given by formula (82).

$$\log |K|=\sum\limits_{{i=1}}^{n} {\log } ({\lambda _i})$$
(82)

The Bayesian updating rule is as follows.

The likelihood function is given by formula (83).

$$p(y|X,K)=N(y;\mu ,K)$$
(83)

The posterior covariance is given by formula (84).

$${K_{{\text{post}}}}={({K^{ - 1}}+{\Phi ^T}\Sigma _{y}^{{ - 1}}\Phi )^{ - 1}}$$
(84)

where \(\Phi\) is the feature mapping matrix, and \({\Sigma _y}\) is the observation noise covariance.The test point prediction covariance is given by formula (85).

$${\text{Cov}}[{f_*}|X,y,{X_*}]={K_{**}} - {K_{*X}}K_{{XX}}^{{ - 1}}{K_{X*}}$$
(85)

Among them: \({K_{**}}=K({X_*},{X_*})\) is the covariance between test points, \({K_{*X}}=K({X_*},X)\) is the cross-covariance between test and training points, and\({K_{XX}}=K(X,X)\) is the covariance of training points.

Model experiments

This experiment conducts testing on six datasets, with specific dataset information referenced in Tables 1, 2, 3, 4, 5 and 6. The following metrics are primarily selected to evaluate experimental results (Eqs. 8691): R² measures the proportion of target variable variance explained by the model, reflecting the overall accuracy of predictions.

$${R^2}=1 - \frac{{\sum {{{({y_i} - {{\hat {y}}_i})}^2}} }}{{\sum {{{({y_i} - \bar {y})}^2}} }}$$
(86)

Where \({y_i}\) represents the true values (actual observed data), \({\hat {y}_i}\) represents the model predicted values, and\(\bar {y}\) represents the mean of true values. The Prediction Interval Coverage Probability (PICP) metric is introduced to evaluate the reliability of prediction intervals47, which reflects the probability that actual observed values fall within the upper and lower bounds of the prediction intervals.

$${\text{PICP}}=\frac{1}{{{N_t}}}\sum\limits_{{i=1}}^{{{N_t}}} {\kappa _{i}^{{(\alpha )}}}$$
(87)

Where \({N_t}\) is the number of prediction samples, \(\kappa\) is a boolean variable, and \(\alpha\) is the confidence level parameter. The Prediction Interval Normalized Average Width (PINAW) metric is introduced to reflect the conservativeness of predictions48,49,50. Conservativeness due to purely pursuing reliability results in prediction intervals that are excessively wide, failing to provide effective uncertainty information for predicted values and losing decision-making value.

$${\text{PINAW}}=\frac{1}{{{N_t}R}}\sum\limits_{{i=1}}^{{{N_t}}} {\left[ {\tilde {U}_{i}^{{(\alpha )}}({x_i}) - \tilde {L}_{i}^{{(\alpha )}}({x_i})} \right]}$$
(88)

Where \({N_t}\) is the number of prediction samples, R is the range of prediction target values used for normalizing the average width, \(\tilde {U}_{i}^{{(\alpha )}}({x_i})\) is the upper bound of the prediction interval for the th sample, \(\tilde {L}_{i}^{{(\alpha )}}({x_i})\) is the lower bound of the prediction interval for the i-th sample, and \(\alpha\) is the confidence level parameter. Expected Calibration Error (ECE) is an important metric for measuring model calibration performance51,52, used to evaluate the degree of consistency between the model’s predicted confidence and actual accuracy9.

$${\text{ECE}}=\sum\limits_{{b=1}}^{B} {\frac{{{n_b}}}{N}} |{\text{acc}}(b) - {\text{conf}}(b)|$$
(89)

Where bin refers to dividing the continuous confidence range into several discrete intervals, B represents the total number of bins, \({n_b}\) represents the total number of samples in the b-th bin, N represents the total number of all samples, acc(b) represents the average of true labels for samples in the b-th bin, and conf(b) represents the average of model predicted confidence for samples in the b-th bin. Continuous Ranked Probability Score (CRPS) is an important metric for evaluating the quality of probabilistic predictions53.

$${\text{CRPS}}(F,y)=\int {{{\left( {F(x) - {{\mathbf{1}}_{\{ x \geqslant y\} }}} \right)}^2}} dx$$
(90)

Where \(F(x)\) represents the cumulative distribution function (CDF) predicted by the model54, which describes the probability that a random variable is less than or equal to a certain value x. y is the observed true value, i.e., the actual outcome that occurred. x is the integration variable, representing all possible value ranges. The indicator function \({\mathbf{1}}\{ x \geqslant y\}\) is a step function: when x. is greater than or equal to the true value y, the function value is 1; when x. is less than y, the function value is 0.

Area Under the Sparsification Error curve (AUSE) measures the degree of consistency between uncertainty scores and true errors, i.e., the extent to which uncertainty estimates can reflect the model’s true mistakes55.

TorchUncertainty has not been formally defined by the official source, but based on its definition, it can be expressed as formula (91).

$${\text{AUSE}}=\int_{0}^{1} {{\text{SE}}} (f){\mkern 1mu} df$$
(91)

Where \({\text{SE}}(f)\) is the sparsification error after removing a fraction f of high uncertainty samples, and \(f \in [0,1]\) represents the proportion of removed samples.

This research introduces Deep Ensemble, Monte Carlo Dropout (MC Dropout), Bayesian Deep Learning (BDL), and Gaussian Process (GP) methods for comparison with RBA, with conceptual explanations as follows.

The Deep Ensemble method is based on statistical ensemble theory, constructing predictive ensemble models by training multiple independently initialized deep neural networks. This method follows the theoretical framework of the Condorcet jury theorem, utilizing weight distribution differences and loss trajectory diversity among models to achieve marginal likelihood estimation of predictive distributions. In practice, this method employs negative log-likelihood as the loss function, enabling individual networks to simultaneously predict mean and variance, obtaining well-calibrated uncertainty estimates through ensemble inference.

Monte Carlo Dropout is built on the theoretical foundation of variational inference, reinterpreting dropout regularization as approximate Bayesian inference in deep Gaussian processes. This method achieves Monte Carlo sampling approximation of weight posterior distributions by maintaining dropout activation during the inference phase. Theoretically, this technique transforms deterministic neural networks into stochastic models, generating predictive distributions through multiple forward passes to quantify the model’s epistemic uncertainty. Its theoretical advantage lies in achieving Bayesian neural network approximation without modifying existing network architectures, though the quality of its uncertainty estimates is significantly affected by dropout rates and weight regularization parameters.

The Bayesian Deep Learning (BDL) framework combines Bayesian statistical principles with deep learning architectures, achieving probabilistic representation of parameters by imposing prior distributions on neural network weights and updating posterior distributions using observed data. This method can distinguish and quantify two types of uncertainty: aleatoric uncertainty (reflecting the inherent randomness of observation noise) and epistemic uncertainty (reflecting insufficient knowledge of model parameters). The theoretical foundation of BDL stems from Bayes’ theorem, approximating intractable posterior distributions through variational inference or Markov Chain Monte Carlo methods, providing a principled uncertainty quantification framework for safety-critical applications.

Gaussian Process (GP) constitutes a non-parametric Bayesian method that defines probability distributions over function spaces to achieve regression and classification tasks. The theoretical foundation of GP is built on random process theory, where the joint distribution of any finite number of function values follows a multivariate Gaussian distribution. This method encodes prior knowledge through kernel functions and obtains posterior predictive distributions by combining observed data using the Bayesian inference framework. The core advantage of GP lies in its analytical tractability, directly providing closed-form solutions for predictive mean and variance without requiring additional uncertainty quantification steps. In uncertainty quantification applications, GP is widely used for surrogate modeling, Bayesian optimization, and sensitivity analysis tasks.

Table 1 Energy efficiency Dataset
Table 2 Household_power_timeseries
Table 3 California housing Dataset
Table 4 Student_Performance Dataset
Table 5 Power_Plant Dataset
Table 6 Yacht_Hydrodynamics Dataset

This study first uses a lower-dimensional dataset Energy Efficiency, which belongs to small data and contains engineering features. As shown in the experimental results in Fig. 2: (a) RBA achieved a prediction accuracy of approximately 0.972, surpassing the other three methods, including Bayesian Deep Learning (BDL) at 0.656, Gaussian Process at 0.859, and MC Dropout at 0.741. This performance difference is further validated in the analysis in (b), where RBA achieved a relatively low calibration error of 0.1877. The temporal analysis in (c) demonstrates the dynamic behavioral characteristics of each method in uncertainty quantification. RBA and Gaussian Process maintained close to or exceeding 90% target coverage probability at most sampling points, demonstrating good uncertainty calibration capability. In the evaluation in (d), RBA’s distribution concentrates in lower numerical intervals, indicating that its predicted probability distributions have higher consistency with true observed values. In contrast, BDL and MC Dropout show higher CRPS values, meaning their probabilistic predictions exhibit greater deviations. Although Gaussian Process performs stably in certain cases, the variability in its CRPS distribution indicates its limitations in handling complex data structures.

Fig. 2
figure 2

Energy Efficiency experimental results. (c).PICP experimental results, (d).CRPS experimental results

To validate RBA’s generalization capability, this study further introduces the larger-scale time series prediction regression task dataset in Table 2 to explore the dataset characteristics that RBA optimally adapts to. Figure 3 shows that the RBA model has a coefficient of determination of 0.920, root mean square error of 0.23 kW, and correlation coefficient of 0.963. From the prediction-actual value scatter plot observation, RBA’s data points are distributed around the ideal prediction line.

Figure 4 shows that RBA’s residual mean is 0.030 kW, standard deviation is 0.236 kW, and trend line slope is −0.029, indicating relatively small systematic bias between residuals and predicted values. The BDL model has an uncertainty coverage probability of 0.948, but relatively low prediction accuracy with R² of 0.820 and RMSE of 0.35 kW, and the residual distribution exhibits heteroscedastic characteristics. Gaussian Process has R² of 0.902, RMSE of 0.26 kW, and CRPS score of 0.096. The MC Dropout method has R² of 0.842 and uncertainty coverage probability of 0.873, showing relatively weaker performance across multiple evaluation metrics.

Figure 5 shows that the RBA model performs stably in capturing power consumption variation patterns, with uncertainty bandwidth indicator PINAW of 0.180. When handling power peaks, RBA maintains relatively small prediction errors and moderate uncertainty ranges. Different models show variations in prediction performance across different time periods, with RBA demonstrating relatively consistent performance in maintaining the balance between prediction accuracy and uncertainty quantification. Experimental results indicate that RBA’s residual attention mechanism has a certain role in improving prediction performance and uncertainty estimation.

Fig. 3
figure 3

Household_power_timeseries prediction accuracy experiment results

Fig. 4
figure 4

Household_power_timeseries results of residual analysis experiment

Fig. 5
figure 5

Household_power_timeseries time series forecasting experiment results

Based on the Table 3 dataset, this study conducted both parallel and ablation experiments to explore the contribution of each RBA component. Figure 6 shows point estimation accuracy, where the RBA method has a root mean square error of 0.5089 and coefficient of determination of 0.8027; the MC Dropout method has corresponding indicators of 0.5131 and 0.7994 respectively; the BDL method has 0.5179 and 0.7957. In geospatial visualization, all three methods can capture the basic spatial distribution characteristics of real house price data, with prediction results showing similarity to true values in spatial patterns.

Figure 7 shows uncertainty calibration evaluation, where the RBA method has an Expected Calibration Error of 0.0660, Prediction Interval Coverage Probability reaching 0.9638, and Continuous Ranked Probability Score of 0.2898. From spatial distribution characteristics observation, RBA method’s uncertainty estimation presents specific patterns in geographic distribution, with high uncertainty regions mainly distributed in geographic locations with sparse data or dramatic house price changes.

Figure 8 shows performance differences among models in handling spatial dependency. Introducing Moran’s I spatial autocorrelation analysis can quantitatively evaluate each model’s capability to capture spatial structure. The RBA model’s Global Moran’s I value is 0.2211 with Z-Score of 37.250, indicating that its residuals still exhibit moderate spatial clustering. The MC Dropout model’s Moran’s I value rises to 0.2519 with Z-Score of 42.438, showing stronger spatial autocorrelation, meaning this model systematically overestimates or underestimates house prices in certain geographic regions. The BDL model performs worst, with Moran’s I value reaching 0.2728 and Z-Score of 45.963, indicating its residuals have the strongest spatial clustering characteristics. From spatial distribution patterns, all three models show significant positive spatial autocorrelation (red regions) in the San Francisco Bay Area (longitude − 122° to −121°, latitude 37.5° to 38.5°) and Los Angeles area (longitude − 118° to −117°, latitude 33.5° to 34.5°), indicating that prediction errors in these high house price regions exhibit clustered distribution.

Based on the ablation experiment results analysis using the California housing dataset (Tables 7 and 8), experimental data reveals differential impacts of different Bayesian components on model performance. From regression performance indicators, the configuration removing the Bayesian feedforward network shows relative advantages across three core indicators, with RMSE of 0.6204, MAE of 0.4306, and.

R² reaching 0.7062. In contrast, the configuration removing the Bayesian attention mechanism shows obvious performance degradation, with RMSE rising to 0.6516 and R² declining to 0.6760. The configuration removing the Bayesian covariance component lies between the two, showing moderate performance.

Cross-validation results further supplement the evaluation dimension of model stability. From the coefficient of variation perspective, the configuration removing Bayesian covariance exhibits higher stability with CV coefficient of 0.0111, while the configuration removing Bayesian attention shows relatively larger performance fluctuations with CV coefficient reaching 0.0207. This difference indicates significant distinctions in how different components affect model generalization capability.

Experimental results indicate that the Bayesian attention mechanism plays an important functional role in the model architecture, with its absence leading to dual degradation in both performance and stability. The Bayesian covariance component primarily contributes to the stability dimension, having a positive effect on the model’s cross-dataset generalization.

Fig. 6
figure 6

California Housing Dataset Experimental eesults on spatial mapping relationships

Fig. 7
figure 7

California Housing Dataset experimental results of uncertainty analysis

Fig. 8
figure 8

California Housing Dataset results of the spatially dependent experiment. (a).RBA experimental results, (b).MC Droupt experimental results, (c).BDL experimental results

Table 7 Core regression performance metrics
Table 8 Cross -Validation robustness analysis

To improve the universality of experimental results, this study introduces the Deep Ensemble algorithm to repeat experiments on new datasets, conducting in-depth experimental testing using three models (MC Dropout, Deep Ensemble, RBA) on two strong nonlinear datasets (Student Performance Dataset, Power Plant Dataset) and one nonlinear dataset with physical property coupling (Yacht Hydrodynamics Dataset). Each dataset is divided according to an 80% training, 16% validation, 4% testing ratio. Data preprocessing adopts standardized scaling, ensuring features and target variables have zero mean and unit variance, eliminating the impact of dimensional differences on model training. This experiment is based on the PyTorch framework, accelerated using CUDA 11.8.

The experimental results in Fig. 9 are as follows: (a) The R² performance plot reveals dramatic fluctuation characteristics among the three methods, with the RBA method showing substantial variation from 61.5% to 73.8%, while MC Dropout and Deep Ensemble oscillate in ranges of 59.8%−67.2% and 58.3%−71.4% respectively. This high inter-trial variability reflects the inherent complexity and nonlinear characteristics of student performance data. However, in terms of prediction interval coverage probability, the RBA method maintains stable coverage of 85.3%−95.2%, approaching the theoretical target value of 95%, while MC Dropout only reaches a suboptimal level of 69.8%−78.5%, and Deep Ensemble shows good calibration at 89.7%−94.1%. Meanwhile, the PINAW indicator shows that RBA achieves a balance between coverage and precision within an interval width of 41.2%−51.3%, superior to Deep Ensemble’s range of 47.8%−56.2%.

  1. a.

    (b) Shows distinctly different performance characteristics, with extreme volatility in R² values being particularly prominent. Each method oscillates dramatically between peaks of 84.8%−85.4% and valleys of 84.3%−84.9%. This “sawtooth” pattern suggests possible periodic disturbances or systematic noise in power plant operational data. Both RBA and Deep Ensemble maintain excellent coverage rates above 95% on the PICP indicator, while MC Dropout remains at an insufficient level of 88.7%−91.2%, indicating systematic underfitting problems when processing industrial process data. PINAW results further confirm that the RBA method achieves the narrowest prediction intervals within the range of 22.1%−27.8%, showing approximately 15% efficiency improvement compared to Deep Ensemble’s 23.4%−29.7%.

  2. b.

    (c) Presents the most extreme performance differentiation, with R² indicators showing negative value phenomena. The RBA method fluctuates dramatically between − 27.3% and 3.2%, MC Dropout maintains at −22.1% to 1.8%, while Deep Ensemble shows persistent negative values in the range of −18.5% to −14.2%. This widespread negative R² phenomenon indicates that complex fluid-structure interactions in yacht resistance prediction tasks exceed the expressive capabilities of current model architectures. However, paradoxically, Deep Ensemble shows the highest PICP coverage of 82.1%−98.3% in uncertainty quantification indicators, but this high coverage is achieved at the cost of extremely wide PINAW of 60.3%−87.4%, indicating that it compensates for severe prediction accuracy deficiencies through overly conservative interval estimation. In contrast, RBA controls PINAW within a narrower range of 40.1%−43.2% while maintaining reasonable coverage, demonstrating a more balanced uncertainty quantification strategy.

Fig. 9
figure 9

Deep validation of experimental results. (a). Student_Performance Dataset, (b). Power_Plant Dataset, (c). Yacht_Hydrodynamics Dataset

Figure 10 illustrates the reasons for the generally poor performance of models on the Yacht_Hydrodynamics Dataset. As the Froude number increases, the flow field undergoes a complete physical transition from viscosity-dominated displacement flow (Fr = 0.16) to a viscous-wave coupling transitional state (Fr = 0.32), and then to a wave resistance-dominated semi-planing regime (Fr = 0.44). Each stage corresponds to different governing equations and boundary conditions: at low speeds, it follows the Navier-Stokes equations for viscous flow; at high speeds, it requires consideration of potential flow theory under nonlinear free surface boundary conditions; while the transition region involves complex phenomena such as turbulent transition and flow separation. This multi-scale, multi-physics strongly coupled nonlinear characteristic manifests mathematically as discontinuous mapping relationships in high-dimensional parameter space, which fundamentally contradicts the smooth mapping mechanisms assumed by neural network architectures. This leads to inherent difficulties such as gradient vanishing, local optima, and limited generalization capability when RBA, Monte Carlo Dropout, and Deep Ensemble methods handle such marine hydrodynamic problems with distinct physical mechanism transitions.

The complexity of these physical mechanisms is further confirmed in Fig. 11, where all six key hull design parameters exhibit highly nonlinear relationships with residual resistance. Forward positioning of the longitudinal center of buoyancy triggers wave interference patterns leading to exponential resistance growth; the optimal slenderness ratio of the prismatic coefficient around 0.565 reflects the sensitive balance between pressure distribution and flow separation; the length-displacement ratio and beam-draft ratio embody the competing effects between hull fineness and stability requirements; while the Froude number exhibits the inherent cubic scaling characteristics of wave-making physics, establishing its status as the dominant parameter in resistance prediction. These complex geometric-hydrodynamic coupling relationships cannot be effectively captured through traditional machine learning’s linear or simple nonlinear mapping, essentially requiring physics-informed modeling methods that incorporate naval architecture principles to accurately describe the multi-parameter interaction mechanisms in yacht resistance prediction.

Figure 12 presents the Froude number complexity analysis, clearly identifying three distinctly different operational domains: the viscous force-dominated displacement regime, the mixed physical characteristics semi-displacement transition region, and the planing regime where wave-making resistance exhibits dramatic cubic scaling. The discontinuous behavioral transitions that occur when hull speed crosses critical thresholds violate the continuity assumptions of standard statistical modeling. Residual analysis shows that although incorporating physical scaling laws achieves significant improvements, the complex interactions between hull shape factors, dynamic trim effects, and Reynolds number dependencies across operational regimes still generate substantial unexplained variance.

Fig. 10
figure 10

Evolution of Hydrodynamic Flow Fields in Yachts. (a). Low-speed flow field diagram, (Fr = 0.16), (b). Medium-speed flow field diagram, (Fr = 0.32), (c). High-speed flow field diagram, (Fr = 0.44)

Fig. 11
figure 11

Feature-Target relationship analysis

Fig. 12
figure 12

Froude number complexity analysis

Based on the analysis of Figs. 10, 11, 12 and 50 independent repeatability experiments were conducted to primarily compare the lateral differentiation characteristics of the models. As shown in Fig. 13, (a) indicates that RBA and MC Dropout achieved R² values of 11.47%±14.24% and 9.66%±14.73%, respectively. Although the absolute values are not high, considering the physical complexity of yacht resistance prediction, these results are still within an acceptable range. However, the Deep Ensemble method exhibited extremely anomalous negative R² values (−359.88%±141.45%), a phenomenon indicating that this method experienced severe model failure when handling the specific physical constraints and nonlinear characteristics of yacht hydrodynamic data.

Figure 13 (b) shows that RBA and MC Dropout methods achieved ECE values of 0.441 ± 0.069 and 0.427 ± 0.058, respectively, which are statistically equivalent, while the Deep Ensemble method’s ECE value reached 0.667 ± 0.051, significantly higher than the former two, indicating that this method suffers from a mismatch between prediction confidence and actual accuracy in modeling complex nonlinear systems such as yacht hydrodynamics.

Figure 13 (c) demonstrates that RBA AUSE values are primarily concentrated in the range of 5.65 ± 1.28, the Monte Carlo Dropout method achieved a mean of 5.97 ± 1.38, while the Deep Ensemble method’s AUSE value reached as high as 25.35 ± 6.20, indicating systematic overestimation problems in its uncertainty estimation for this specific hydrodynamic task.

Fig. 13
figure 13

Yacht_Hydrodynamics Dataset in-depth validation of experimental results. (a).R2 performance, (b).ECE performance, (c).AUSE performance

Discussion

From the systematic analysis of experimental results, a performance gradient phenomenon of profound significance can be observed: RBA’s excellent performance on structured engineering data (Energy Efficiency dataset) to stable performance on time series data (Household Power dataset), and then to challenging performance on complex physical systems (Yacht Hydrodynamics dataset). This performance gradient is not simply a matter of method superiority or inferiority, but reveals the fundamental compatibility relationship between RBA’s architectural assumptions and different data generating mechanisms. Specifically, when the underlying data structure follows relatively smooth nonlinear mappings and learnable statistical dependencies exist between features, RBA’s Bayesian feedforward layers can effectively model parameter uncertainty, the multi-layer residual attention mechanism can capture relevant feature interactions, and the covariance construction module can generate mathematically valid uncertainty representations. However, when the system transitions to multi-physics strongly coupled regime transitions scenarios, the data’s generating process fundamentally violates the smoothness assumptions that neural network architectures rely upon, causing all gradient-based optimization methods to face similar predicaments.

More in-depth mechanistic analysis reveals the non-trivial synergistic mechanisms among RBA’s three components: ablation experiments show that removal of the Bayesian attention mechanism has the most significant impact on performance, a finding that points to an important architectural insight—in deep probabilistic models, the attention mechanism not only serves the function of feature selection, but more importantly plays a key role as an uncertainty propagation pathway. By directly embedding RBF kernels into attention score computation, RBA achieves a paradigm shift from deterministic similarity measures to probabilistic correlation modeling. This design enables the model to perform principled uncertainty quantification at the feature level, rather than merely post-hoc estimation at the output level. Meanwhile, Bayesian residual connections, through adaptive weighting mechanisms modeled by Beta distributions, achieve inter-layer propagation and accumulation of uncertainty. This design philosophy embodies a profound understanding of unified modeling of information flow and uncertainty flow in deep networks.

From a comparative methodology perspective, RBA’s advantages over established baselines such as Monte Carlo Dropout and Deep Ensemble essentially stem from the fundamental difference between its end-to-end uncertainty modeling approach and these methods’ approximation-based strategies. MC Dropout approximates weight uncertainty through stochastic regularization, but its uncertainty quality is highly dependent on the specific configuration of dropout rate and network architecture; Deep Ensemble, while capable of providing robust uncertainty estimates, is limited in scalability due to computational overhead and potential diversity-accuracy trade-offs inherent in its ensemble nature. In comparison, RBA achieves a better balance between computational efficiency and theoretical rigor through architectural integration, although this balance still has inherent limitations when facing extremely high complexity tasks.

Particularly worthy of deep consideration is that the modeling challenges of complex physical systems like Yacht Hydrodynamics provide important empirical insights for the entire uncertainty quantification field: when system behavior is governed by multiple, competing physical mechanisms with clear regime boundaries, purely data-driven approaches, regardless of their architectural sophistication, face fundamental expressivity constraints. This observation not only objectively delineates the technical boundaries of current deep learning methods, but also provides clear guidance for future research directions—truly solving such problems requires establishing deeper integration between domain-specific physical priors and general-purpose learning algorithms, which extends far beyond the scope of pure architectural innovation. Therefore, RBA’s value lies not only in its practical utility as a working solution, but more importantly in the methodological foundation and empirical benchmarks it provides for understanding and designing next-generation physics-informed probabilistic models.

Conclusion

This study presents the Residual Bayesian Attention (RBA) framework, which integrates Bayesian inference with Transformer architecture through three coupled components: Bayesian feedforward layers, multi-layer residual Bayesian attention, and Bayesian covariance construction. The framework achieves end-to-end probabilistic modeling via variational inference optimization.

Experimental evaluation across six benchmark datasets reveals performance characteristics that depend on data structure and complexity. On structured engineering and time series tasks, RBA demonstrated competitive performance with R² ranging from 0.920 to 0.972 and reasonable calibration quality. Ablation experiments confirm that the Bayesian attention mechanism contributes substantively to both accuracy and stability. However, when applied to complex physical systems such as yacht hydrodynamics involving regime transitions across flow regimes, RBA exhibited R² values ranging from − 27.3% to 3.2%, similar to difficulties encountered by Monte Carlo Dropout and Deep Ensemble methods.

The comparative analysis indicates that RBA provides advantages in prediction interval calibration on medium-scale structured datasets, particularly in maintaining consistency between predicted confidence and actual accuracy. The framework’s uncertainty decomposition offers interpretability for decision-support applications, though computational overhead and hyperparameter sensitivity remain practical considerations. These findings suggest that RBA represents a viable solution for uncertainty quantification in scenarios with learnable statistical dependencies and moderate nonlinearity.Future research directions for RBA involve incorporating domain-specific physical priors to improve the model’s generalization performance in complex physical systems.