A variational framework for residual-based adaptivity in neural PDE solvers and operator learning

Toscano, Juan Diego; Chen, Daniel T.; Ooomen, Vivek; Darbon, Jérôme; Karniadakis, George Em

doi:10.1038/s44387-026-00084-4

Download PDF

Article
Open access
Published: 07 March 2026

A variational framework for residual-based adaptivity in neural PDE solvers and operator learning

Juan Diego Toscano¹^na1,
Daniel T. Chen¹^na1,
Vivek Ooomen²,
Jérôme Darbon¹ &
…
George Em Karniadakis^1,3

npj Artificial Intelligence volume 2, Article number: 32 (2026) Cite this article

2624 Accesses
1 Citations
Metrics details

Subjects

Abstract

Residual-based adaptive strategies are widely used in scientific machine learning yet remain largely heuristic. We introduce a variational framework that formalizes these methods through convex transformations of the residual, where different transformations correspond to distinct objective functionals. For instance, exponential weights target uniform error minimization, while linear weights recover quadratic error minimization. This perspective reveals adaptive weighting as a means of selecting sampling distributions that optimize a primal objective, directly linking discretization choices to error metrics. This principled approach yields three key benefits: it enables systematic design of adaptive schemes, reduces discretization error by lowering estimator variance, and enhances learning dynamics by improving gradient signal-to-noise ratio. Extending the framework to operator learning, we demonstrate substantial performance gains across diverse optimizers and architectures. Our results provide a theoretical perspective for residual-based adaptivity and establish a foundation for principled discretization and training.

Discovering cognitive strategies with tiny recurrent neural networks

Article Open access 02 July 2025

Sufficient is better than optimal for training neural networks

Article Open access 04 December 2025

Variational tensor neural networks for deep learning

Article Open access 16 August 2024

Introduction

Scientific machine learning (SciML) has emerged as a powerful alternative to traditional numerical methods for solving partial differential equations (PDEs). Here, we consider two of the main approaches in SciML. The first, which includes Physics-Informed Neural Networks (PINNs)¹ and their variants, focuses on function approximation, where a representation model is trained to satisfy the governing equations of a specific problem². The second is operator learning^3,4, where a model learns the underlying solution operator itself, allowing it to generate solutions for new boundary conditions, source terms, or parameters almost instantaneously.

At their core, SciML models employ parameterized functions with strong approximation capabilities^5,6 to represent the solution of a PDE or a solution map from parameters and boundary data to solutions to PDEs. The problem is thus reduced to an optimization task of finding the parameters that minimizes certain loss function. For example, the loss function for physics-informed methods is usually composed of the PDE residuals and the mismatch with observational data. Compared to traditional numerical method, this optimization-centric approach provides significant flexibility, say, for solving inverse problems that incorporates sparse data or lacks prescribed boundary conditions, or for high-dimensional problems⁷. However, there is no free lunch: these parameter optimization in SciML are generally high-dimensional and non-convex, making models susceptible to converging to poor local minima. Consequently, tackling the optimization has drawn significant research attention, with efforts including the development of specialized optimizers^8,9,10 and methods that simplify the optimization task by explicitly encoding physical constraints, such as the exact imposition of boundary conditions¹¹.

Among the various approaches, one of the most prominent strategies, which addresses both optimization and discretization errors, is to modify the loss function itself. This is typically achieved through adaptive sampling^12,13 and weighting schemes^14,15,16 that dynamically adjust the training process to focus on regions of the domain that are more difficult to learn. Adaptive sampling methods achieve this by concentrating collocation points in areas where the PDE residual is high¹³, while adaptive weighting methods assign larger local weights to the same important regions. The strategies for determining these weights are diverse, ranging from direct residual-based schemes to more complex adversarial or augmented Lagrangian formulations^14,17,18.

Due to their simplicity and efficiency, weighting/sampling methods based on the residuals are particularly popular^{16,19,20,21,22,23,24,25,26}, as they do not require specialized architectures or additional parameters. Two such examples are residual-based attention (RBA)¹⁵ and residual-based adaptive distribution (RAD)¹³ for residual-based weighting and sampling strategies, respectively. At first glance, these heuristic strategies are conceptually related to importance sampling; however, we remark an essential difference. Standard importance sampling reweights the sample to produce an unbiased estimator of the desired functional with less variance, whereas the schemes used in SciML estimate an entirely different functional.

In this work, we propose a mathematical framework for interpreting and designing new adaptive sampling/weighting schemes. Leveraging variational formulas of general statistical divergences, we show that minimizing the norms (e.g., L² or L^∞) of the residual can be written in a dual form that naturally involves sampling adaptively from distributions tilted by a factor depending on the current residual. Estimates from this new distributions can be realized through either direct sampling or importance weights. We refer to the multipliers obtained from this variational approach as variational residual-based attention (vRBA).

Our framework provides a compelling interpretation for RBA and RAD and shows how the choice of residual-based weights/distribution influences the primal minimization objective. We extend these methods to operator learning with a hybrid strategy employing importance sampling over the function space and importance weighting over the spatial domain, which can be seamlessly integrated into architectures like FNO and DeepONet. The framework yields a twofold benefit: it lowers the discretization error by reducing the variance of the loss estimator, and it improves learning dynamics by enhancing the signal-to-noise ratio of the gradients, leading to faster convergence. Finally, we demonstrate the efficacy of vRBA across a range of challenging PINN and operator learning tasks. Our empirical results show that using vRBA is pivotal to achieve lower errors, providing significant improvements even when paired with state-of-the-art second-order optimizers⁹ or specialized architectures like TC-UNet²⁷.

Results

A central goal in scientific machine learning (SciML) is to train a parameterized model, u(x; θ) for parameters θ in some parameter space ${\mathcal{T}}$, to solve problems ranging from function approximation to satisfying physical laws described by a differential operator, ${\mathcal{F}}$. The performance of such a model is quantified by a residual function, r(x; θ), which measures the local error at each point x in the domain Ω. The ultimate objective is to find optimal parameters θ^* that minimize this residual across the entire domain by optimizing a loss function ${\mathcal{L}}$, typically formulated as the L² minimization of the residual over the uniform distribution, i.e.,

$$\mathop{\min }\limits_{\theta \in {\mathcal{T}}}{\mathcal{L}}(\theta )={\int }_{\Omega }| r(x;\theta ){| }^{2}p(dx).$$

(1)

While effective for simple problems, this L²-based objective is often insufficient for complex PDEs involving multi-scale physics or singularities¹⁴. A suggestive reason comes from the perspective of PDE theory: classical solutions to PDEs are in the space of continuously-differentiable functions endowed with the supremum norm. Therefore, minimizing the maximum residual, rather than the average, can be desirable for capturing strong solutions²⁸, i.e.,

$$\mathop{\min }\limits_{\theta }\left\{\mathop{\max \,}\limits_{x\in \Omega }r(x,\theta )\right\}.$$

(2)

This objective can also be (superfluously) expressed as an optimization over the space of probability measures, where the inner maximum is achieved by a measure concentrated at the point of highest error, which reads

$$\mathop{\min }\limits_{\theta \in {\mathcal{T}}}\left\{\mathop{\max }\limits_{q\in {\mathcal{P}}(\Omega )}{\int }_{\Omega }r(x,\theta )q(dx)\,\right\}.$$

(3)

For each θ, the optimizer of the inner optimization problem is ${q}^{* }={\delta }_{\{{x}^{* }(\theta )\}}$ where ${x}^{* }(\theta )=\arg \mathop{\min }\limits_{\Omega }r(\cdot ,\theta )$. (The notation δ_{x} here refers to the Dirac measure at point x, which is the measure that satisfies ∫_Ωf(y)δ_{x}(dy) = f(x) for all bounded, measurable $f:\Omega \to {\mathbb{R}}$.)

vRBA: a generative framework for residual-based adaptive scheme

We introduce a variational framework that solves a certain regularized version of using Φ-divergences²⁹. Rather than fixing the metric a priori, the choice of regularization informs the residual-based adaptive scheme, which we show to correspond to specific error norms (e.g., L^∞, variance) in a certain primal formulation. For a chosen potential function $\Phi :{\mathbb{R}}\to {\mathbb{R}}$, the proposed optimization reads

$$\mathop{\sup }\limits_{\epsilon > 0}\mathop{\min }\limits_{\theta \in {\mathcal{T}}}\mathop{\sup }\limits_{q\in {\mathcal{P}}(\Omega )}\left\{{\int }_{\Omega }r(x;\theta )q(dx)-\epsilon {{\bf{D}}}_{{\Phi }^{* }}(q| p)\right\},$$

(4)

where ${{\bf{D}}}_{{\Phi }^{* }}$ is the statistical divergence associated with the convex conjugate. This formulation offers a compelling interpretation for adaptive weighting and sampling^{12,13,14,15,16}. The approach builds upon classical variational techniques commonly used in the theory of large deviations ^30,31,32. See the supplementary information (SI) for further theoretical details.

Our proposed algorithm for solving (4) follows an intuitive alternative optimization scheme. At each iteration k, the method updates the sampling distribution q, the model parameters θ, and the regularizer ϵ as follows:

$$\left\{\begin{array}{l}{q}^{k+1}(x)\propto {\Phi }^{{\prime} }\left(\frac{{\rm{r}}({\rm{x}};{{\rm{\theta }}}^{{\rm{k}}})}{{{\rm{\epsilon }}}^{{\rm{k}}}}\right)p(x),\\ {\theta }^{k+1}\leftarrow \mathrm{Minimize}{\int }_{\Omega }r(x;\theta ){q}^{k+1}(dx),\\ {\epsilon }^{k+1}\leftarrow \mathrm{Annealing}(r(\cdot ;{\theta }^{k}),\Phi ).\end{array}\right.$$

(5)

The regularization—the main novelty of the framework—is characterized by the choice of sampling distribution q, which is optimal in light of the (generalized) Gibbs variational formula. The Annealing step slowly decreases ϵ → 0, recovering (2). Depending on the choice of potential—in fact, for all choices besides the exponential—a normalization condition would be needed (see (20) and surrounding discussion in the Methods section for more details). Table 1 compiles a list of potentials used in the paper and their corresponding tilt q. See the Methods section for implementation details or SI for theoretical discussions.

Table 1 Summary of the seven generated adaptive schemes

Full size table

The power of this framework lies in its generative capability: by simply choosing the potential function Φ, we can generate and interpret new adaptive approaches. For example, we recovered existing heuristics: the Exponential potential Φ(r) = e^r targets L^∞ minimization and recovers softmax-based attention mechanisms similar to those used in transformers³³ and uncertainty quantification³⁴; on the other hand, the quadratic potential Φ(r) = r² + 1 targets variance minimization and recovers linear weighting schemes like RAD¹³ and RBA¹⁵. In summary, vRBA collects residual-based adaptive schemes into a single interpretable framework that facilitates the design and comparison of numerical schemes.

Extension to operator learning: a hybrid adaptivity strategy

Neural Operators (NOs) are a class of models designed to learn the solution map to a PDE. More specifically, they approximate the solution operator G_θ from an input function, such as a source term f, to a corresponding solution u^3,4. Alternatively, they can be formulated as propagators that evolve a solution in time by mapping an initial state u(t₀) to a future state u(t₀ + Δt) for some Δt > 0.

The residual for this task, R(v, x; θ), measures the pointwise error between the model’s prediction G_θ(v)(x) and the true solution ${\mathcal{G}}[v](x)$ for each spatial point x ∈ Ω. The learning task is now defined over a product of the function space ${\mathcal{X}}$ and the spatiotemporal domain Ω_Y and directly optimizing the objective ${{\mathbb{E}}}_{{q}^{k}}[R]$ is challenging as a single adaptive scheme is ill-suited for this heterogeneous product space. We address this by disintegrating the Radon-Nikodym derivative, which allows us to rewrite the expectation as a nested integral, which reads

$${{\mathbb{E}}}_{{q}^{k}}[R]={\int }_{{\mathcal{X}}}\left[{\int }_{{\Omega }_{Y}}R(v,x)\frac{d{q}^{k}}{dp}(x| v)p(dx)\right]\frac{d{q}^{k}}{dp}(v)p(dv).$$

(6)

The outer integral over the function space ${\mathcal{X}}$ is approximated via importance sampling, where we draw functions v from an adaptive distribution defined by the marginal derivative $\frac{d{q}^{k}}{dp}(v)$. Concurrently, the inner integral over the spatiotemporal domain Ω_Y is approximated via importance weighting, where weights Λ_i,j are derived from the conditional $\frac{d{q}^{k}}{dp}(x| {v}_{j})$. While this formulation fully accommodates unstructured meshes or point clouds on variable geometries, this division is practically motivated by the constraints of specific architectures. In particular, importance weighting is beneficial for operator learning schemes that are restricted to fixed spatial grids. Moreover, the mathematical framework of vRBA requires little of the underlying domain, enabling operator learning applications in infinite-dimensional function spaces.

vRBA accelerates convergence, achieves higher accuracy and reduces error accumulation

We evaluate the performance of the vRBA framework in both Physics-Informed Neural Network (PINN) and operator learning settings. To demonstrate the generative flexibility of the framework, we test seven distinct schemes: the standard uniform baseline (Φ(r) = r) and six adaptive potentials derived from the proposed framework described in Table 1.

For the PINN benchmarks, we solve the forward problem for the Allen-Cahn and Burgers’ equations and the Korteweg-De Vries (KdV) equations. Crucially, we investigate two distinct regimes for the Burgers’ equation: a standard case (ν = 1/100π) and a “hard” case (ν = 1/1000) featuring extremely sharp gradients, which allows us to evaluate the method’s performance on quasi-discontinuous solutions. To ensure a fair comparison with the current development of PINNs, both our baseline and vRBA-enhanced models incorporate Fourier features to improve expressive capabilities and encode periodic boundary conditions³⁵. Furthermore, to demonstrate that the benefits of vRBA are complementary to advanced optimization techniques, we evaluate its performance on the Allen-Cahn equation using both first-order and state-of-the-art second-order optimizers, such as SSBroyden⁹.

Similarly, we evaluate vRBA’s performance in the operator learning setting across two primary scenarios. The first scenario involves learning a direct mapping from an input function to the corresponding solution. We apply this approach to the Bubble Growth Dynamics using a DeepONet. To explicitly assess the framework’s capability to handle weak solutions with pure discontinuities, we introduce the Sod Shock Tube (SST) benchmark using an SVD-DeepONet³⁶. The second, more challenging scenario involves learning an operator that propagates a solution forward in time. This approach is often implemented autoregressively, where the model’s output at one timestep becomes the input for the next. We investigate this recursive setting by solving the Navier-Stokes (Kolmogorov flow) and Wave equations with FNO⁴ and TC-UNet²⁷ architectures, respectively. The governing PDE formulations and specific implementation details for all benchmarks are provided in the Methods section.

Figure 1, Tables 2, and 3 demonstrate that vRBA consistently accelerates convergence and improves accuracy across all benchmarks compared to the baseline. While all adaptive potentials generally outperform uniform sampling, the optimal choice of potential correlates with the physical characteristics of the solution. For problems with smooth but complex structures like the Allen-Cahn and KdV equations, the Exponential potential (Φ(r) = e^r) tends to yield the best performance. This is most dramatic in the KdV benchmark, where the baseline model fails completely, while the Exponential vRBA model achieves high accuracy. In contrast, for the “hard” Burgers’ case (ν = 1/1000), where the solution develops steep gradients akin to a shock, the Quadratic potential (Φ(r) = r²) outperforms the exponential variants, suggesting that variance reduction may be more stable than aggressive L^∞ minimization in the presence of extremely sharp transitions. Finally, in the Sod Shock Tube problem, which features a true discontinuity, the Dual-KL potential ($\Phi (r)\approx \log r$) achieves the lowest error, highlighting the benefit of asymmetric or logarithmic penalties for weak solutions. Additionally, we compare the vRBA performance against other state-of-the-art PINN methods. We focus on the Allen-Cahn and Burgers’ equations, which have been considered challenging benchmarks since the introduction of PINNs due to their sharp transitions and complex dynamics. Furthermore, we introduce the Korteweg-De Vries (KdV) equation as an additional “stress test”—a problem known to cause failure in vanilla PINNs, yet possessing a smooth exact solution that enables precise error quantification. Table 4 demonstrates that combining the SSBroyden optimizer with vRBA yields the most accurate result among the compared methods across all benchmarks.

**Fig. 1: vRBA accelerates convergence and improves accuracy across all benchmarks.**

Table 2 Performance of the vRBA framework across PIML benchmarks

Full size table

Table 3 Performance of the vRBA framework across operator-learning benchmarks

Full size table

Table 4 State-of-the-art comparison of relative L² errors for the Allen-Cahn, Burgers’ and KdV equations

Full size table

Notably, our results reveal that vRBA is beneficial regardless of the optimization strategy. Even when coupled with the standard first-order Adam optimizer, vRBA ($\Phi ={e}^{{r}^{2}}$) significantly outperforms not only Adam-based baselines but also recent methods that rely on L-BFGS and complex architectures.

The table highlights that one of the high-performance methods with ADAM combines RBA with complex architectural enhancements. Moreover, the previous state-of-the-art results were achieved using SSBroyden and RAD. As pointed out in the previous sections, RAD and RBA can be identified as specific variations of vRBA with Φ(r) = r² and specific smoothing configurations. This underscores our main conclusion: the combination of a powerful optimizer with a principled adaptive sampling strategy is the essential factor for achieving the highest accuracy.

Figure 1B indicates that vRBA significantly accelerates convergence for all operator learning tasks. For the direct mapping scenario, Table 2 shows that vRBA reduces the final error for the Bubble Growth Dynamics problem by more than an order of magnitude. While the overall relative L² error improvements for the autoregressive Navier-Stokes (FNO) and Wave Equation (TC-UNet) tasks appear more modest, this single, aggregated metric does not capture the full performance gain. As their rate of error accumulation governs the long-term performance of these models, Fig. 1C, D reveals an additional advantage: the vRBA-enhanced models exhibit significantly slower error growth and a much smaller standard deviation across the test trajectories, indicating more robust and generalizable long-term predictions.

vRBA captures fine details and promotes uniform error distribution

In addition to the overall error reduction documented in the previous section, a key advantage of vRBA is its ability to capture fine-scale solution features. This improved capability can be analyzed from the theoretical perspective of our framework. When an exponential potential (Φ(x) = e^x) is used, the framework attempts to minimize the L^∞-norm in the primal formulation, which pressures the model to suppress the largest residuals. On the other hand, when a quadratic potential (Φ(r) = r²) is used, the primal objective corresponds to variance minimization, forcing the model to address high-magnitude outliers. Both mechanisms compel the model to fit the entire solution domain more evenly, leading to a more uniform error distribution.

This effect is observed empirically in Figs. 2 and 3. For the PINN benchmarks Figs. 2 and 3, the baseline model’s error is highly concentrated along specific structures, whereas vRBA produces a more spatially uniform error distribution. This redistribution is particularly effective for the KdV and low viscosity- Burgers (ν = 1/1000) benchmarks (Fig. 2). In these cases, the baseline error spikes uncontrollably at shock fronts and solitons, whereas vRBA helps the optimizer improve mode performance across these sharp transitions.

**Fig. 2: vRBA promotes uniform error distributions and captures fine-scale solution features in PIML benchmarks.**

**Fig. 3: vRBA captures fine-scale solution features and sharp discontinuities in operator learning benchmarks.**

This advantage is also evident in the operator learning context, Fig. 3, where the baseline DeepONet fails to resolve high-frequency oscillations in the Bubble Growth Dynamics problem or the sharp discontinuities in the Sod Shock Tube (SST) benchmark (Fig. 3). In the SST case specifically, vRBA captures the shock profile without the severe artifacts seen in the baseline, despite the solution spanning multiple orders of magnitude.

This improved capability for resolving complex features is further demonstrated in the autoregressive operator learning tasks, as shown in Fig. 4. For both the Navier-Stokes (FNO) and Wave Equation (TC-UNet) benchmarks, the reference solutions involve intricate, evolving structures. The baseline models fail to track these details over time, leading to a rapid accumulation of structured, high-magnitude error. In contrast, the vRBA-enhanced models maintain a much lower pointwise error throughout the simulations by successfully capturing the fine-scale dynamics at each timestep, thereby preventing the compounding of errors.

**Fig. 4: vRBA accurately captures fine-scale solution features in complex, evolving systems.**

vRBA reduces discretization error via variance reduction

The total error of a neural network-based PDE solver can be decomposed into approximation, optimization, and discretization components³⁷. The discretization error arises when approximating the continuous loss integral with a finite sum via Monte Carlo sampling. More concretely, for any bounded, measurable functional $r:\Omega \to {\mathbb{R}}$ such as the residual or its gradient, we can estimate the mean ${{\mathbb{E}}}_{q}[r]$ with n samples by defining an unbiased estimator

$${\widehat{r}}^{n}=\frac{1}{n}\mathop{\sum }\limits_{i=1}^{n}r({X}_{i})\,{\text{where}}\,{({X}_{i})}_{i=1}^{n}\mathop{ \sim }\limits^{{\rm{i}}.{\rm{i}}.{\rm{d}}.}q.$$

(7)

Alternatively, if one only has the ability to sample from distribution p instead, importance sampling weight can be used assuming absolute continuity of q with respect to p, and the estimator now reads

$${\widehat{r}}^{n}=\frac{1}{n}\mathop{\sum }\limits_{i=1}^{n}\frac{dq}{dp}({X}_{i})r({X}_{i})\,{\text{where}}\,{({X}_{i})}_{i=1}^{n}\mathop{ \sim }\limits^{{\rm{i}}.{\rm{i}}.{\rm{d}}.}p.$$

(8)

In the context of vRBA (and other residual-based adaptive schemes), the distribution q^k at the k-th iteration has a normalizing constant in the Radon-Nikodym derivative, $\frac{d{q}^{k}}{dp}$, which introduces a bias. However, these Monte Carlo estimators are nonetheless consistent and converge according to the central limit theorem

$$\mathop{{\text{lim}}}\limits_{n\to \infty }\sqrt{n}({\widehat{r}}^{n}-{{\mathbb{E}}}_{{q}^{k}}[r])\to {\text{Normal}}\left(0,{{\text{Var}}}_{p}\left[{\rm{r}}\frac{{\mathrm{dq}}^{{\rm{k}}}}{\mathrm{dp}}\right]\right)\,{\text{in}}\,{\text{distribution}}.$$

(9)

This implies that the discretization error (i.e., $| {\widehat{r}}^{n}-{{\mathbb{E}}}_{{q}^{k}}[r]|$) vanishes at a rate of ${\mathcal{O}}(1/\sqrt{n})$, with a magnitude that scales directly with the variance of the estimator. Therefore, if the variance of the reweighted estimator is smaller than that of the standard estimator, such that

$${{\text{Var}}}_{q}[r]={{\text{Var}}}_{p}\left[r\frac{d{q}^{k}}{dp}\right]\le {{\text{Var}}}_{p}[r],$$

(10)

then vRBA actively reduces the discretization error at each training step.

Proving such variance reduction is difficult at this level of generality, and we resort to numerical demonstration. Figure 5 shows the evolution of the residual variance during training for both the PINN benchmarks (Fig. 5A) and the operator learning tasks (Fig. 5B). A consistent trend is immediately apparent across all six problems: the vRBA-enhanced models achieve a residual variance that is several orders of magnitude lower than their corresponding baseline models. This result suggests vRBA may effectively induce variance reduction.

**Fig. 5: vRBA reduces the variance and infinity norm of the residuals across benchmarks.**

Furthermore, Fig. 5 tracks the evolution of the infinity norm ∥r∥_∞ in the bottom rows. These results demonstrate that vRBA generally achieves lower ∥r∥_∞ values alongside reduced variance, indicating that the method may effectively suppresses extreme outliers and does not merely smooth the average error at the expense of worst-case residuals.

vRBA improves the learning dynamics

Due to the linearity of the gradient operator, the overall update direction, ${\widehat{p}}_{k}$, is equivalent to the average of gradients over different regions of the domain (see Proposition in SI). During training, conflicting gradient directions from different parts of the data coexist. If these component gradients are not well-aligned, the average direction—the direction of update—diminishes, leading to slow convergence or stagnation.

We formalize the above description. From any given partition j, the update direction ${\widehat{p}}_{j}^{k}={\widehat{p}}^{k}+{\epsilon }_{j}^{k}$ can be decomposed into two components: the “signal” ${\widehat{p}}^{k}$, the true update direction from the continuous loss, and the “noise” ${\epsilon }_{j}^{k}$, the deviation of the partition’s gradient from that signal. The Signal-to-Noise Ratio (SNR) measures the ratio of the magnitude of the signal to the root mean square (RMS) magnitude of the noise^38,39,40, which reads

$${\text{SNR}}=\frac{\parallel {{\mathbb{E}}}_{{B}_{j}}[{\widehat{p}}_{j}^{k}]{\parallel }_{2}}{\sqrt{{\text{Tr}}({\text{Var}}_{{B}_{j}}[{\widehat{p}}_{j}^{k}])}}.$$

(11)

The expectation and variance are taken over random partitions B_j of the full dataset. A high SNR indicates that the gradient from any given partition is a faithful representation of the true update direction, leading to a confident and effective optimization step. The full derivation showing how the denominator corresponds to the noise term is provided in the SI.

The evolution of the SNR, shown in Fig. 6 alongside the generalization error (measured as the relative L² error on the test set), reveals three distinct stages of learning i.e., fitting, transition, and diffusion. This phased progression has been observed across diverse applications such as function approximation, PINNs, and operator learning^40,41,42, and it provides a metric to quantify how vRBA improves the training dynamics.

**Fig. 6: vRBA improves learning dynamics by accelerating the transition to the productive diffusion phase.**

The first stage is the fitting phase, characterized by a high but decreasing SNR. Initially, when the model’s predictions are poor, a consensus on the update direction is easily found. As the training error reduces, the optimal direction becomes less clear, causing the SNR to drop. This behavior is clearly observed in the Allen-Cahn, Burgers’, and DeepONet experiments, and to some extent in the Baseline TC-UNet. The FNO model, however, appears to bypass this phase, with its generalization error improving almost immediately, likely due to the strong structural priors of its Fourier-based architecture.

Next, the model enters the transition phase, a low-SNR exploratory stage where the optimizer searches for an effective update direction to minimize the loss across the entire domain. This phase is noisy, with different data partitions pointing in conflicting directions, and models that get trapped here fail to converge. We observe this phase in the Allen-Cahn, Burgers’, and DeepONet results. For instance, the second-order Allen-Cahn model is temporarily trapped in this phase during its initial 5000 iterations of first-order pre-training. Again, this phase is less distinct for the FNO and TC-UNet, possibly due to their complex architectures.

After the exploratory phase, a successful model enters the productive diffusion phase, marked by a sharp jump in the SNR that correlates with a rapid decrease in generalization error. A key finding is that vRBA significantly accelerates the entry into this stage. This is most evident in the first-order Allen-Cahn benchmark: the baseline model is stuck in the transition phase for nearly 50,000 iterations, while the vRBA models enter diffusion in just 10,000. This provides a mechanistic explanation for claims that standard PINNs cannot solve this problem; often, the models are simply not trained long enough to escape the prolonged transition phase, a challenge vRBA mitigates.

The dynamics of the second-order optimizer also show important distinctions. For the Allen-Cahn problem, which was trapped in the transition phase during pre-training, the switch to the SSBroyden optimizer triggers a sharp initial increase in the SNR as the model finds the optimal direction, followed by a decrease as it converges. In contrast, the Burgers’ model was already in a productive diffusion phase, so the switch to the more efficient optimizer simply causes the SNR to drop.

Finally, the phenomenon of model saturation is particularly evident in the operator learning benchmarks. For both the FNO and TC-UNet models, the late-stage decay in the SNR corresponds precisely with the stagnation of the generalization error. This indicates that the model has “lost the signal”, confirming the SNR’s information as a powerful diagnostic metric for the entire learning process.

Discussion

Residual-based adaptive strategies are widely used in scientific machine learning to accelerate convergence and improve accuracy. Yet, there lacks a mathematical framework that for describing and generating these adaptive schemes. In this work, we introduce a unifying variational framework that formalizes these methods by leveraging the duality of statistical divergences, specifically the Laplace principle and the Gibbs variational formula (and generalizations thereof). We show, for a wide class of residual-based adaptive schemes, that optimization under adaptively tilted distributions can be interpreted as dual to certain primal objective minimization depending on the tilting “potential” function Φ.

Crucially, the power of this framework, which we term vRBA, lies in its generative capability: the choice of the convex potential function Φ dictates the nature of the adaptive scheme. Our analysis demonstrates that previously distinct reweighting methods are, in fact, specific instantiations of this single theoretical perspective. For instance, selecting an exponential potential (Φ(r) = e^r) targets the minimization of the L^∞-norm, recovering exponential distributions akin to attention mechanisms³³ but with an adaptive temperature parameter. Similarly, selecting a quadratic potential (Φ(r) = r² + 1) targets the minimization of the residual variance, recovering the linear weighting schemes utilized in Residual-Based Attention (RBA)¹⁵ and Residual-Based Adaptive Distribution (RAD)¹³. This unification bridges the conceptual gap between sampling and weighting strategies, which are traditionally treated as distinct optimization heuristics. Our analysis reveals that they are alternative computational approaches—one via resampling, the other via importance weighting—of the same underlying variational objective. Beyond unifying existing methods, this framework offers a generative recipe for designing new adaptive schemes tailored to specific physical regimes, such as using logarithmic potentials for sharp transitions or super-exponential potentials for aggressive error suppression. Furthermore, we extend this framework to Operator Learning, demonstrating that these variational principles are generic and apply equally to learning mappings between infinite-dimensional function spaces.

The implications of this variational perspective, however, extend even further. For instance, the formulation proposed in²⁵ can be recovered using quadratic potentials and constructing the distribution using a locally averaged residual. Our framework also enables us to interpret other advanced heuristics. For instance, methods that balance the residual decay rate¹⁶, for example, can be viewed as replacing the simple temporal smoothing of the EMA with a more sophisticated, history-aware mechanism for computing the adaptive distribution q^k. More broadly, while our analysis focuses on deriving the optimal distribution q analytically, the dual problem also admits an alternative solution strategy: learning the distribution q directly. This insight reframes self-adaptive methods—techniques with learnable weights or auxiliary networks^14,17—as implicitly learning this optimal biasing distribution. Our framework thus justifies these pioneering approaches, unifying them under the same variational principles.

Complementing this theoretical foundation, our empirical results indicate that the vRBA framework significantly improves model accuracy across all tested scenarios. For all our benchmarks, this improvement manifests as an enhanced ability to capture fine-scale solution features, leading to a more uniform error distribution. Crucially, this advantage holds for models trained with both standard first-order and state-of-the-art second-order optimizers, confirming that vRBA is a complementary enhancement to advanced optimization techniques. In the operator learning setting, particularly for autoregressive tasks using advanced architectures like FNO and TC-UNet, vRBA achieves a significant reduction in the error accumulation rate, leading to more robust and reliable long-term predictions.

We attribute vRBA’s empirical success to two underlying mechanisms. First, the residual-based adaptation lowers the discretization error by reducing the variance of the loss estimates. At each step, the gradients computed are stable and faithful to the true, continuous loss objective. Consequently, the optimizer can focus on minimizing the actual objective function rather than navigating a noisy loss landscape induced by a high-variance estimator. Second, our results indicate that vRBA improves the learning dynamics as quantified by the Signal-to-Noise Ratio (SNR) of the back-propagated gradients. The SNR measures the reliability of the gradient signal relative to the stochastic noise and is a well-studied metric in the context of stochastic training³⁸. Our empirical analysis reveals that the vRBA-enhanced models consistently exhibit a higher SNR than their baselines across all analyzed benchmarks. Higher SNR correlates with improved training outcomes: vRBA models show significantly shorter transition phases while entering the diffusion phase faster. This effect is particularly pronounced in the PINN and DeepONet experiments.

Despite these advantages, the application of our framework is subject to certain constraints. While vRBA theoretically provides a rich supply of weighting functions by choosing different convex potentials, it requires these potentials to be both convex and increasing. In general, we observe that the framework works robustly for the majority of potentials. However, there are specific cases that can be too aggressive, such as super-exponential potentials (e.g., $\Phi (r)={e}^{{r}^{2}}$). This behavior is expected, as these potentials are designed to be extremely strong, imposing severe penalties on high residuals which can be destabilizing in sensitive regimes. Nevertheless, as illustrated in Fig. 1, valid potentials generally outperform the baselines. Furthermore, not all valid potentials yield an exact, closed-form solution for the annealing parameter. As demonstrated in our examples, closed-form solutions are readily available for the L^p family (which includes the linear weights of RBA) and the exponential potentials derived here. However, for other convex potentials, the optimal parameter may not be analytically invertible. While this is not a fundamental bottleneck, since the convexity of the problem allows the parameter to be computed efficiently using Newton’s method with almost null computational overhead, it does necessitate additional implementation effort compared to the closed-form cases.

Another limitation concerns the stochastic nature of the computed weights and the nuances of their application. Similar to adaptive optimizers like Adam⁴³, our method relies on memory terms—specifically an Exponential Moving Average (EMA), to ensure stability in the presence of stochastic noise. Consequently, the adaptive weights are not applied directly or sampled in their raw form. The necessity of this memory term varies by application: as shown in SI, for PINNs trained with robust second-order optimizers, the method can often be applied directly. However, for Operator Learning, the EMA is indispensable both for stability and to efficiently compute the aggregate scores needed to resample within the function space. While this introduces an additional hyperparameter (the decay rate), our sensitivity analysis demonstrates that the model is highly robust, consistently outperforming baselines with decay rates ranging from 0.4 to 0.99. Furthermore, as shown in SI, applying the derived weighting function directly yields significant performance gains. However, we observe that for exponential potentials targeting the L^∞-norm, using a small convex combination of the adaptive weights further improves performance. In contrast, for quadratic and logarithmic potentials, this interpolation yields no additional benefit, and direct application is optimal. This distinction is expected: the L^∞ optimization is inherently aggressive, even in its dual variational form, and thus benefits from the regularization provided by the convex combination, whereas the variance-reduction objectives are inherently more stable.

Finally, while the SNR analysis provides insights into the learning dynamics for most cases, the behavior of the two operator learning cases, FNO and TC-UNet, presents a more complex scenario. These models exhibit distinct dynamics where the generalization error decreases almost from the start of training, apparently bypassing the distinct fitting and transition phases observed in simpler models. We postulate that this discrepancy arises from the interplay between the SNR and the model’s geometric complexity. Previous studies found an inverse relationship: the geometric complexity tends to increase precisely when the SNR decreases⁴². This suggests that the low-SNR transition phase is an active period where the model increases its representational capabilities to resolve conflicting gradients. We speculate that because FNO and TC-UNet are sophisticated architectures endowed with strong structural priors, their initial states may be complex enough to facilitate immediate generalization, thereby bypassing these initial low-SNR stages. However, unlike in function approximation, where discrete Dirichlet energy, as introduced in⁴⁴, serves as a quantifiable metric for geometric complexity, a clear proxy for this property in operator learning is not currently established. As such, fully characterizing these dynamics remains an open question for future work.

Methods

Variational residual-based attention methods

The potential function $\Phi :{{\mathbb{R}}}_{+}\to {{\mathbb{R}}}_{+}$ is a crucial parameter for vRBA. It determines the form of the sampling distribution and (consequently) the corresponding primal problem. We restrict Φ to be a non-negative, convex, increasing, superliner function; we show two examples and give detailed calculations in the coming sections. The training process starts by sampling N i.i.d. random variables ${\{{X}_{i}\}}_{i=1}^{N}$ uniformly from the spatial domain Ω and calculating the corresponding residuals r(X_i). The general method then involves three steps per iteration:

1.
updating the tilted distribution q, which generally proportional to the derivative (or sub-differential) ${\Phi }^{{\prime} }$;
2.
updating the model parameters θ via a line search method (first- or second-order);
3.
updating the temperature parameter ϵ using an annealing scheme which can depend on Φ.

We elaborate on each step below, providing the implementation specifics for the examples shown in the Results section.

Update the tilted distribution

While the generalized Gibbs variational principle does not typically yield a closed-form update rule for the distribution q, for a fixed potential Φ, one can obtain the optimal tilt by the appropriate annealing schedules (to be discussed in the coming subsection). Under this assumption, the optimal sampling distribution for the next iteration q^k+1 is given by

$${q}^{k+1}(x)\propto {\Phi }^{{\prime} }\left(\frac{r(x;{\theta }^{k})}{{\epsilon }^{k}}\right),$$

(12)

and we evaluate only at the collocation points ${\{{X}_{i}\}}_{i=1}^{N}$. The above form holds for all choices of potential functions while normalization differs. Only when Φ(r) = e^r, we can immediately deduce that

$${q}^{k+1}(x)=\frac{\exp \left(\frac{r(x;{\theta }^{k})}{{\epsilon }^{k}}\right)}{{\int }_{\Omega }\exp \left(\frac{r(y;{\theta }^{k})}{{\epsilon }^{k}}\right)dy}.$$

(13)

For other choices of Φ, normalization will be enforced via the choice of ϵ^k (the annealing parameter) instead, i.e., the proportionality constant in (12) is one.

For both models, the collocation points {X_i} are randomly sampled and the empirical target distribution q^k+1 can be prone to spurious fluctuations. To promote stability, especially when using fast-growing potentials like Φ(r) = e^r, we smooth the target distribution over time using an exponential moving average (EMA). Furthermore, we introduce an additional smoothing mechanism by interpolating between the adaptive distribution q^k+1 and the base uniform distribution p. The combined update rule for the importance weights reads

$${\lambda }_{i}^{k+1}=\gamma {\lambda }_{i}^{k}+{\eta }^{* }(\phi {q}^{k+1}({X}_{i},{\theta }^{k})+(1-\phi ){p}_{u}({X}_{i}))$$

(14)

where (γ ∈ [0, 1)] is a memory term and η^* is a learning rate. The parameter ϕ ∈ [0, 1] controls the degree of adaptivity; ϕ = 1 corresponds to the fully adaptive case from which the original method is recovered, while smaller values of ϕ increase stability by biasing the distribution towards uniformity. For stability reasons, which becomes particularly important for second-order methods, we have found that normalizing the learning rate as ${\eta }^{* }=\eta /ma{x}_{\varOmega }{q}^{k+1}$ is beneficial. The resulting vector λ^k+1 can be interpreted as the smoothed, unnormalized distribution that guides the optimization. While it incorporates information from the optimal distribution q^k+1, it is not itself a probability mass function as it does not necessarily sum to one.

Update the model parameters

Once the smoothed importance weights $\{{\lambda }_{i}^{k+1}\}$ have been computed, they can be used to formulate the loss function in two primary ways: by guiding a resampling process or by directly weighting the residuals.

Adaptive Sampling

In this approach, the weights are first normalized to recover a smooth probability mass function (p.m.f.) over the discrete domain Ω

$${\overline{q}}_{i}^{k+1}=\frac{{\lambda }_{i}^{k+1}}{{\sum }_{j}{\lambda }_{j}^{k+1}}$$

(15)

This new distribution $\overline{q}$ is then used to resample a new set of training points ${\{{X}_{i}\}}_{i=1}^{N}$ from the full collocation set Ω. By focusing the sampling on high-importance regions, the loss can be computed as a standard, unweighted mean-squared error on this new, more challenging set of points

$${\mathcal{L}}({\theta }^{k+1})=\frac{1}{N}\mathop{\sum }\limits_{i=1}^{N}r{({X}_{i},{\theta }^{k+1})}^{2}.$$

(16)

Notably, this framework can recover methods similar to residual-based adaptive sampling¹³ by setting the potential to Φ(x) = x² + 1 and the EMA parameters to η^* = 1 and γ = 0.

Importance Weighting

Alternatively, the weights can be used directly in an importance weighting scheme. In this case, the training points ${\{{X}_{i}\}}_{i=1}^{N}$ are sampled uniformly from Ω. The weights $\{{\lambda }_{i}^{k+1}\}$ are then applied directly to the residuals within the loss calculation, creating a weighted objective that reads

$${\mathcal{L}}({\theta }^{k+1})=\frac{1}{N}\mathop{\sum }\limits_{i=1}^{N}{[{\lambda }_{i}^{k+1}r({X}_{i},{\theta }^{k+1})]}^{2}.$$

(17)

One might worry that squaring the weights and residual depart from the procedure outlined previously. By an application of Jensen’s inequality, one can obtain that squaring the residual (and weights when applicable) corresponds to solving a strictly stronger problem with the added benefits of differentiability.

This framework is also general enough to recover other popular methods. For example, with the potential Φ(x) = x² + 1 and specific choices of EMA parameters, it is possible to recover the traditional residual-based adaptive weights¹⁵ and their variations^42,45.

Once the objective function ${\mathcal{L}}({\theta }^{k+1})$ is calculated, the model parameters are optimized using a line search algorithm of the form

$${\theta }^{k+1}={\theta }^{k}+{\alpha }^{k}{p}^{k}$$

(18)

where α^k is the step size, and ${p}^{k}=-{H}_{k}{\nabla }_{\theta }{\mathcal{L}}({\theta }^{k})$ is the update direction which depends on the gradient of the objective function and some symmetric matrix H_k⁹.

Update the regularization parameter

As alluded to before, the choice of the annealing schedule depends on the choice of potential Φ. There are two cases to be discussed, and the latter has two further subcases.

Case I: Exponential Potential (Φ(x) = e ^x)

For this choice, there are no requirements on the choice of ϵ thanks to the duality between entropy and free energy. The particular choice we implemented reads

$${\epsilon }^{k+1}=\frac{cma{x}_{\varOmega }r(x;{\theta }^{k})}{\log (2+k)}.$$

(19)

This schedule has several advantages. Using the maximum residual, $ma{x}_{\varOmega }| r|$, in the numerator provides a dynamic, problem-dependent characteristic scale for the temperature ϵ. This helps to normalize the magnitude of the residuals relative to the magnitude of the solution itself. The logarithmic term in the denominator ensures a slow and stable decay, such that ϵ^k → 0 as k → ∞, which gradually sharpens the distribution’s focus on the largest residuals. In the context of simulated annealing, logarithmic decay is sufficiently slow to guarantee global convergence [ref. ⁴⁶, Theorem 1].

Case II: general potentials

For this case, the optimality of the distribution q holds under the normalization condition

$${\int }_{\Omega }{\Phi }^{{\prime} }\left(\frac{{\rm{r}}({\rm{x}};{\rm{\theta }})}{{\rm{\epsilon }}}\right)p(dx)=1.$$

(20)

For this case, the constraint is satisfied by choosing ϵ^k to be the normalizing constant. There are two further cases.

1. There are cases when the choice of ϵ can be computed analytically, for example, the polynomials Φ(r) = ∣r∣^p + c for any p ∈ (1, ∞) and $c\in {\mathbb{R}}$. In this case, we can see that

$${\int }_{\Omega }{\Phi }^{{\prime} }\left(\frac{r(x;\theta )}{\epsilon }\right)p(dx)=p{\int }_{\Omega }| r(x;\theta ){| }^{p-1}p(dx)={\epsilon }^{p-1},$$

(21)

so

$$\epsilon ={\left(\frac{1}{p}{\int }_{\Omega }| r(x;\theta ){| }^{p-1}p(dx)\right)}^{1/p-1}.$$

(22)

2. In many other cases, e.g., $\Phi (r)=\cosh (r)$, $\Phi (r)={e}^{{r}^{2}}$, or $\Phi (r)=(1+r)\log (1+r)-r$, analytic computation might challenging. We calculate ϵ^k dynamically at each training step using a Newton-Raphson solver. To ensure stability and speed, we derive robust initial guesses based on the asymptotic behavior of each potential: ${\epsilon }_{0}\approx {r}_{\max }/{\text{ln}}(2N)$ for $\cosh (r)$, ${\epsilon }_{0}\approx {r}_{\max }/\sqrt{{\text{ln}}N}$ for ${e}^{{r}^{2}}$, and ${\epsilon }_{0}\approx \overline{r}/(e-1)$ for the logarithmic potential. Due to the strict convexity of these functions and the accuracy of the initialization, the solver typically converges to machine precision in a few iterations (See Algorithm 1). This dynamic adjustment induces minimal computational overhead relative to the full backpropagation step. In particular, for all the analyzed examples, the inclusion of vRBA involves < 10% of computational overhead from the vanilla models.

Algorithm 1

Newton-Raphson Solver for ϵ^k

Require: Residual batch r; Potential Φ; Max iterations M = 20; Tolerance δ = 10⁻⁸.

Ensure: Optimal scaling parameter ϵ.

1: Compute statistics: ${r}_{\max }\Leftarrow \max (r)$, $\overline{r}\Leftarrow mean(r)$, N ⇐ length(r).

2: Initialization (Asymptotic Approximation):

3: if $\Phi (r)=\cosh (r)$ then

4: $\epsilon \Leftarrow {r}_{\max }/ln(2N)$

5: else if $\Phi (r)={e}^{{r}^{2}}$ then

6: $\epsilon \Leftarrow {r}_{\max }/\sqrt{ln(N)}$

7: else if $\Phi (r)=(1+r)\log (1+r)-r$ then

8: $\epsilon \Leftarrow \overline{r}/(e-1)$

9: end if

10: $\epsilon \Leftarrow \max (\epsilon ,\delta )$

11: Newton-Raphson Loop:

12: for j = 1…M do

13: u ⇐ r/ϵ

14: Compute constraint residual: $F(\epsilon )\Leftarrow \frac{1}{N}\sum {\Phi }^{{\prime} }(u)-1$

15: Compute gradient: ${F}^{{\prime} }(\epsilon )\Leftarrow \frac{1}{N}\sum {\Phi }^{{\prime\prime} }(u)\cdot (-u/\epsilon )$

16: Update: $\epsilon \Leftarrow \epsilon -F(\epsilon )/({F}^{{\prime} }(\epsilon )-\delta )$

17: Clamp: $\epsilon \Leftarrow \max (\epsilon ,\delta )$

18: end for

19: return ϵ

Physics-informed neural networks (PINNs)

In the Physics-Informed Neural Network (PINN) framework, the goal is to approximate the solution $\bar{u}(x)$ of a PDE or ODE using a parameterized representation model, u(x; θ). The training objective is to minimize a loss function composed of several residual terms that enforce the problem’s physical and data constraints.

The primary component is the PDE residual, which measures how well the model satisfies the governing equations. It is defined using a differential operator ${\mathcal{F}}$ as:

$$r(x):=| {\mathcal{F}}[u(\cdot ;\theta )]| .$$

(23)

Additionally, the model must match any available data, which may include boundary conditions, initial conditions, or sparse observations from a known function $\bar{u}(x)$. This is enforced through a data-fit residual, defined as the pointwise error:

$$r(x):=| \bar{u}(x)-u(x;\theta )| .$$

(24)

The total loss function, ${\mathcal{L}}$, is a weighted sum of the individual loss terms computed from these residuals. For a problem with governing equations (E), boundary/initial conditions (B), and observational data (D), the total loss is:

$${\mathcal{L}}={m}_{E}{{\mathcal{L}}}_{E}+{m}_{B}{{\mathcal{L}}}_{B}+{m}_{D}{{\mathcal{L}}}_{D},$$

(25)

where m_E, m_B, and m_D are global weights that balance the contribution of each term. The individual loss functions (${{\mathcal{L}}}_{E}$, ${{\mathcal{L}}}_{B}$, ${{\mathcal{L}}}_{D}$) are each computed as described in equation (16) for the adaptive sampling approach or as in equation (17) for the importance weighting. To ensure the update directions induced by the different loss components are balanced, we employ the self-scaling mechanism presented in⁴².

Global weights

Notice that for first-order optimizers such as ADAM, the update direction for PINNs (i.e., equation (25)) is given by:

$${p}^{k}=-{m}_{E}{\nabla }_{\theta }{{\mathcal{L}}}_{E}({\theta }^{k})-{m}_{B}{\nabla }_{\theta }{{\mathcal{L}}}_{B}({\theta }^{k})-{m}_{D}{\nabla }_{\theta }{{\mathcal{L}}}_{D}({\theta }^{k}),$$

(26)

where ${\nabla }_{\theta }{{\mathcal{L}}}_{E}$, ${\nabla }_{\theta }{{\mathcal{L}}}_{B}$, and ${\nabla }_{\theta }{{\mathcal{L}}}_{D}$ are the loss gradients which can be represented as high-dimensional vectors defining directions to minimize their respective loss terms. Notice that if the gradient magnitudes are imbalanced, one direction will dominate, which may lead to poor convergence. To address this challenge, we propose modifying the magnitude of the individual directions by scaling their respective global weights. In particular, we fix m_E and update the remaining global weights using the rule:

$${m}_{B}^{k}=\alpha {m}_{B}^{k-1}+(1-\alpha )\frac{||{\nabla }_{\theta }{{\mathcal{L}}}_{E}||}{||{\nabla }_{\theta }{{\mathcal{L}}}_{B}||},$$

(27)

$${m}_{D}^{k}=\alpha {m}_{D}^{k-1}+(1-\alpha )\frac{||{\nabla }_{\theta }{{\mathcal{L}}}_{E}||}{||{\nabla }_{\theta }{{\mathcal{L}}}_{D}||},$$

(28)

where α ∈ [0, 1] is a stabilization parameter⁴⁷. This formulation computes the iteration-wise average ratio between gradients, enabling normalized scaling, which, on average, allows us to define a balanced update direction ${\widehat{p}}^{k}$:

$${\widehat{p}}^{k}\approx -{m}_{E}||{\nabla }_{\theta }{{\mathcal{L}}}_{E}||\,\left[{\nabla }_{\theta }{{\mathcal{L}}}_{E}({\theta }^{k})-\frac{{\nabla }_{\theta }{{\mathcal{L}}}_{B}({\theta }^{k})}{||{\nabla }_{\theta }{{\mathcal{L}}}_{B}||}-\frac{{\nabla }_{\theta }{{\mathcal{L}}}_{D}({\theta }^{k})}{||{\nabla }_{\theta }{{\mathcal{L}}}_{D}||}\right].$$

(29)

Under this approach, all loss components have balanced magnitudes, allowing each optimization step to minimize all terms effectively.

A detailed description of the proposed method presented in Algorithm 2. For the second-order experiments, we follow the general methodology of⁹, which uses the SSBroyden optimizer after 5k Adam pre-training iterations. The crucial modification in our work is that the sampling distribution is generated by our vRBA framework, rather than the standard RAD formulation used in the reference.

Algorithm 2

vRBA for PINNs

Require: Representation model ${\mathcal{M}}$; Training points X_B, X_D, X_E; Optimizer parameters lr; vRBA parameters $\eta ,{\lambda }_{ma{x}_{0}},{\lambda }_{cap},{\alpha }_{g},{m}_{E},{\gamma }_{g}$; Iterations per stage N_stage; Total iterations N_train; Boolean flags adaptive_weights, adaptive_distribution.

Ensure: Optimized network parameters θ.

1: Initialize network parameters θ⁰.

2: Initialize weights ${\lambda }_{\alpha ,i}^{0}\Leftarrow 0.1{\lambda }_{max0}$ for each loss component α ∈ {B, D, E}.

3: Initialize sampling p.m.f. ${\overline{q}}_{\alpha }$ to be uniform for each α.

4: k ⇐ 0.

5: while k < N_train do

6: ${\lambda }_{max}\Leftarrow \min ({\lambda }_{max0}+k/{N}_{stage},{\lambda }_{cap})$

7: γ^k ⇐ 1 − η/λ_max

8: for α ∈ {B, D, E} do

9: if adaptive_distribution then

10: Update sampling p.m.f: ${\overline{q}}_{\alpha }^{k}\Leftarrow {{\boldsymbol{\lambda }}}_{\alpha }^{k}/\sum ({{\boldsymbol{\lambda }}}_{\alpha }^{k})$

11: end if

12: Sample batch ${X}_{\alpha }^{k} \sim {\overline{q}}_{\alpha }^{k}$ from X_α.

13: Compute predictions: ${u}_{\alpha ,i}\Leftarrow {\mathcal{M}}({\theta }^{k},{x}_{\alpha ,i}^{k})$ for each ${x}_{\alpha ,i}^{k}\in {X}_{\alpha }^{k}$.

14: Compute residuals ${r}_{\alpha ,i}^{k}$ using equations (24) or (23).

15: Update tilted distribution ${q}_{\alpha ,i}^{k}$ using equation (12).

16: Apply EMA: ${\lambda }_{\alpha ,i}^{k+1}\Leftarrow {\gamma }^{k}{\lambda }_{\alpha ,i}^{k}+{\eta }^{* }{q}_{\alpha ,i}^{k}$.

17: if adaptive_weights then

18: Compute loss term: ${{\mathcal{L}}}_{\alpha }^{k}\Leftarrow \frac{1}{| {X}_{\alpha }^{k}| }{\sum }_{i}{({\lambda }_{\alpha ,i}^{k+1}{r}_{\alpha ,i}^{k})}^{2}$.

19: else

20: Compute loss term: ${{\mathcal{L}}}_{\alpha }^{k}\Leftarrow \frac{1}{| {X}_{\alpha }^{k}| }{\sum }_{i}{({r}_{\alpha ,i}^{k})}^{2}$.

21: end if

22: Compute gradient ${\nabla }_{\theta }{{\mathcal{L}}}_{\alpha }^{k}$.

23: Update average gradient: $\parallel {\nabla }_{\theta }{\overline{{\mathcal{L}}}}_{\alpha }^{k}\parallel \Leftarrow {\gamma }_{g}\parallel {\nabla }_{\theta }{\overline{{\mathcal{L}}}}_{\alpha }^{k-1}\parallel +(1-{\gamma }_{g})\parallel {\nabla }_{\theta }{{\mathcal{L}}}_{\alpha }^{k}\parallel$.

24: end for

25: Update global weight: ${m}_{D}^{k+1}\Leftarrow {\alpha }_{g}{m}_{D}^{k}+(1-{\alpha }_{g}){m}_{E}\frac{| | {\nabla }_{\theta }{{\overline{{\mathcal{L}}}}_{E}}^{k}| | }{| | {\nabla }_{\theta }{{\overline{{\mathcal{L}}}}_{D}}^{k}| | }$.

26: Define total update direction: ${p}^{k}\Leftarrow -{m}_{E}{\nabla }_{\theta }{{\mathcal{L}}}_{E}^{k}-{m}_{B}{\nabla }_{\theta }{{\mathcal{L}}}_{B}^{k}-{m}_{D}^{k+1}{\nabla }_{\theta }{{\mathcal{L}}}_{D}^{k}$.

27: Update parameters: θ^k+1 ⇐ θ^k + lr ⋅ p^k.

28: k ⇐ k + 1.

29: end while

Benchmarks

Allen-Cahn

The Allen-Cahn equation is a widely recognized benchmark in PINNs due to its challenging characteristics. The 1D Allen-Cahn PDE is defined as:

$$\frac{\partial u}{\partial t}=k\frac{{\partial }^{2}u}{\partial {x}^{2}}-5u({u}^{2}-1),$$

(30)

where k = 10⁻⁴. The problem is further defined by the following initial and periodic boundary conditions:

$$u(0,x)={x}^{2}\cos (\pi x),\,\forall x\in [-1,1],$$

(31)

Burgers Equation

The Burgers’ equation is defined as

$${u}_{t}+u{u}_{x}=\nu {u}_{xx},$$

(32)

where u represents the velocity field, subject to the dynamic viscosity. In this study we consider two separate cases where $\nu =\frac{1}{100\pi }$ and $\nu =\frac{1}{1000}$. The initial conditions are described as follows

$$u(0,x)=-\sin (\pi x),\,\forall x\in \Omega ,$$

(33)

$$u(t,-1)=u(t,1)=0,\,\forall t\ge 0,$$

(34)

defined over the domain Ω = (−1, 1) × (0, 1) and periodic boundary conditions in x.

Korteweg-De Vries (KdV)

The Korteweg-De Vries (KdV) equation is a canonical model for shallow water waves and serves as a rigorous benchmark for PINNs due to the presence of third-order spatial derivatives and nonlinear soliton interactions. The PDE is defined as:

$$\frac{\partial u}{\partial t}+6u\frac{\partial u}{\partial x}+\frac{{\partial }^{3}u}{\partial {x}^{3}}=0,$$

(35)

defined over the spatio-temporal domain x ∈ [0, 20] and t ∈ [0, 5]. The problem is closed by the following initial and boundary conditions:

$$\begin{array}{c}\begin{array}{rcl}\begin{array}{rc}u(0,x) & =\end{array} & {g}_{0}(x), & \forall x\in [0,20],\end{array}\\ \begin{array}{cccc}\begin{array}{rc}u(t,0) & =\end{array} & {g}_{1}(t), & u(t,20)={g}_{2}(t), & \begin{array}{cl}\frac{\partial u}{\partial x}(t,20)={g}_{3}(t), & \forall t\in [0,5],\end{array}\end{array}\end{array}$$

(36)

where the boundary functions g₀, g₁, g₂, and g₃ are derived by evaluating the analytical solution at the domain boundaries. The exact solution, describing the interaction of two solitons, is given by:

$$u(x,t)=\frac{2({c}_{1}-{c}_{2})\left[{c}_{1}{\cosh }^{2}\left(\frac{\sqrt{{c}_{2}}{\zeta }_{2}}{2}\right)+{c}_{2}{\sinh }^{2}\left(\frac{\sqrt{{c}_{1}}{\zeta }_{1}}{2}\right)\right]}{{\left[\left(\sqrt{{c}_{1}}-\sqrt{{c}_{2}}\right)\cosh \left(\frac{\sqrt{{c}_{1}}{\zeta }_{1}+\sqrt{{c}_{2}}{\zeta }_{2}}{2}\right)+\left(\sqrt{{c}_{1}}+\sqrt{{c}_{2}}\right)\cosh \left(\frac{\sqrt{{c}_{1}}{\zeta }_{1}-\sqrt{{c}_{2}}{\zeta }_{2}}{2}\right)\right]}^{2}},$$

(37)

where ζ_i = x − c_it − x_i for i = 1, 2. The specific parameters for this benchmark are set to c₁ = 6.0, c₂ = 2.0, x₁ = − 2.0, and x₂ = 2.0.

Operator learning

Let ${\mathcal{X}}$ be a space of functions over a domain ${\Omega }_{X}\subset {{\mathbb{R}}}^{{d}_{x}}$, and ${\mathcal{Y}}$ be a space of functions over ${\Omega }_{Y}\subset {{\mathbb{R}}}^{{d}_{y}}$. The operator of interest is

$${\mathcal{G}}:{\mathcal{X}}\ni v\,\mapsto {\mathcal{G}}[v]\in {\mathcal{Y}}.$$

The goal is to learn a parametric model G_θ that approximates ${\mathcal{G}}$. The residual $R:{\mathcal{X}}\times {\Omega }_{y}\to {{\mathbb{R}}}^{+}$ for this task is defined as the difference between the operator’s prediction and the true solution and reads

$$R(v,x;\theta )=| {G}_{\theta }(v)(x)-{\mathcal{G}}[v](x)| ,$$

(38)

where $v\in {\mathcal{X}}$ is an input function and ${\mathcal{G}}[v]$ is the corresponding true output function evaluated at a point x ∈ Ω_Y. The training data consists of N_func input-output function pairs, ${\{{v}_{j},{\mathcal{G}}[{v}_{j}]\}}_{j=1}^{{N}_{func}}$, where each output function ${\mathcal{G}}[{v}_{j}]$ is evaluated at N discrete points ${\{{x}_{i}\}}_{i=1}^{N}$. The standard loss is an average over both the function instances and the spatial points:

$${\mathcal{L}}(\theta )=\frac{1}{{N}_{{\text{func}}}}\mathop{\sum }\limits_{j=1}^{{N}_{{\text{func}}}}\frac{1}{N}\mathop{\sum }\limits_{i=1}^{N}{[R({v}_{j},{x}_{i};\theta )]}^{2}$$

(39)

A single importance sampling or weighting scheme is ill-suited for this problem due to the two distinct levels of discretization (in function space and spatial domains). To address this, we propose a mixed strategy: importance weighting is used for the spatial points within each function, while adaptive sampling is used for the functions themselves. This is motivated by the fact that many neural operators have a fixed spatial discretization, making weighting a natural fit, while the function space offers more flexibility for sampling.

The loss function for a batch of b_u functions is updated as follows

$${\mathcal{L}}(\theta )=\frac{1}{{b}_{u}}\mathop{\sum }\limits_{j=1}^{{b}_{u}}\frac{1}{N}\mathop{\sum }\limits_{i=1}^{N}{[{\Lambda }_{i,j}R({v}_{j},{x}_{i};\theta )]}^{2},$$

(40)

where the functions ${\{{v}_{j}\}}_{j=1}^{{b}_{u}}$ are sampled from the full set of training functions. The term Λ is a matrix of importance weights, where Λ_i,j corresponds to point x_i for function v_j. These weights are constructed from a target p.m.f. matrix Q^k constructed based on the choice of potential. For instance when Φ(x) = e^x, ${Q}^{k}\in {{\rm{{\mathbb{R}}}}}^{N\times {N}_{{\text{func}}}}$ is defined recursively as follows

$${Q}_{i,j}^{k+1}({\theta }^{k})=\frac{\exp \left(\frac{R({v}_{j},{x}_{i};{\theta }^{k})}{{\epsilon }^{k}}\right)}{{\sum }_{\ell =1}^{N}\exp \left(\frac{R({v}_{j},{x}_{\ell };{\theta }^{k})}{{\epsilon }^{k}}\right)}.$$

(41)

Note that each column of the matrix Q (for a fixed function j) is a p.m.f. over the spatial points, focusing attention on high-residual regions for that specific function. The weights are then smoothed over time with an EMA

$${\Lambda }_{i,j}^{k+1}=\gamma {\Lambda }_{i,j}^{k}+{\eta }^{* }{Q}_{i,j}^{k+1}({\theta }^{k}).$$

(42)

As in the previous case, we can set the learning rate for stability, for example, by normalizing it as ${\eta }^{* }=\eta /ma{x}_{j}{Q}_{i,j}$. Note that this choice of η^* achieves a normalization per function which is consistent with our two-level discretization. This EMA formulation has the useful property of keeping the weights bounded. As described in¹⁵, the update rule ensures that the weights are constrained to the interval ${\Lambda }_{i,j}\in (0,\frac{{\eta }^{* }}{1-\gamma })$, which aids in stabilizing the training process.

A key advantage of this framework is that, if η ≠ 1 − γ, we can leverage the imbalance on learned spatial weights, Λ_i,j, to construct a sampling distribution over the functions themselves. The intuition is that functions with higher overall residuals will naturally accumulate larger Λ values over time. Therefore, we propose the following approach to create a function-level sampling distribution. First, we compute an aggregated importance score s_j for each function by summing its spatial weights

$${s}_{j}=\mathop{\sum }\limits_{i=1}^{N}{\Lambda }_{i,j}.$$

(43)

These scores are then normalized to create a p.m.f. over the function space:

$${\overline{q}}_{j}=\frac{{s}_{j}}{{\sum }_{\ell =1}^{{N}_{func}}{s}_{\ell }}.$$

(44)

This distribution $\overline{q}$ can then be used to sample the most informative functions v_j for the next training batch. A detailed description of the proposed method is given in Algorithm 3

Algorithm 3

vRBA for Operator Learning

Require: Neural Operator G_θ; Training data ${\{{v}_{j},{u}_{j}\}}_{j=1}^{{N}_{func}}$; Optimizer parameters lr; vRBA parameters $\eta ,{\lambda }_{ma{x}_{0}},{\lambda }_{cap},\gamma$; Batch size b_u; Update frequency N_update; Total iterations N_train.

Ensure: Optimized network parameters θ.

1: Initialize network parameters θ⁰.

2: Initialize weights ${\Lambda }_{i,j}^{0}\Leftarrow 0.1{\lambda }_{max0}$ for all i, j.

3: Initialize function sampling p.m.f. ${\overline{q}}_{j}^{0}\Leftarrow 1/{N}_{func}$ for all j.

4: for k ⇐ 0 to N_train − 1 do

5: ${\lambda }_{max}\Leftarrow \min ({\lambda }_{cap},{\lambda }_{max0}+k/{N}_{stage})$

6: γ^k ⇐ 1 − η/λ_max

7: Sample a batch of b_u function indices ${{\mathcal{J}}}_{k} \sim {\overline{q}}^{k}$.

8: Compute the batch residuals: ${R}_{i,j}^{k}\Leftarrow {G}_{{\theta }^{k}}({v}_{j})({x}_{i})-{u}_{j}({x}_{i})$ for $i\in \{1..N\},j\in {{\mathcal{J}}}_{k}$.

9: Update target distribution ${Q}_{i,j}^{k+1}$ using $| {R}_{i,j}^{k}|$ (via Eq. (12)).

10: Update weights via EMA: ${\Lambda }_{i,j}^{k+1}\Leftarrow {\gamma }^{k}{\Lambda }_{i,j}^{k}+{\eta }^{* }{Q}_{i,j}^{k+1}$ for $j\in {{\mathcal{J}}}_{k}$.

11: Compute weighted loss for the batch: ${{\mathcal{L}}}^{k}\Leftarrow \frac{1}{{b}_{u}N}{\sum }_{j\in {{\mathcal{J}}}_{k}}{\sum }_{i=1}^{N}{[{\Lambda }_{i,j}^{k+1}{R}_{i,j}^{k}]}^{2}$.

12: Compute gradient of the loss: ${g}^{k}\Leftarrow {\nabla }_{\theta }{{\mathcal{L}}}^{k}{| }_{\theta ={\theta }^{k}}$.

13: Update parameters: θ^k+1 ⇐ θ^k − lr ⋅ g^k.

14: if $k\,(mod\,\,{N}_{update})==0$then

15: Aggregate scores: ${s}_{j}^{k+1}\Leftarrow {\sum }_{i=1}^{N}{\Lambda }_{i,j}^{k+1}$ for j = 1. . N_func.

16: Normalize to form new p.m.f.: ${\overline{q}}_{j}^{k+1}\Leftarrow {s}_{j}^{k+1}/{\sum }_{\ell =1}^{{N}_{func}}{s}_{\ell }^{k+1}$.

17: end if

18: end for

DeepONet

DeepONet consists of two networks - a trunk network and a branch network. The trunk network encodes spatial coordinates and learns a basis in the target function space, while the branch network maps the input function, evaluated at a fixed set of sensors, to coefficients that project onto this learned basis. The resulting dot product yields the output function at each spatial location. This design is rooted in the operator approximation theorem and enables expressive and efficient modeling of nonlinear operators. DeepONet and its variants are widely applied in mechanics, high-speed flows³⁶, materials science, and multi-phase flows⁴⁸.

SVD-DeepONet

To address the challenges of modeling discontinuous solutions such as shocks, a two-step training strategy⁴⁹ is often employed to enhance the standard DeepONet architecture. In this approach, the trunk network is trained first to extract a basis, which is then orthonormalized using QR factorization or Singular Value Decomposition (SVD). While QR factorization ensures orthonormality, SVD is frequently preferred because it provides a unique solution and generates a hierarchical set of orthonormal basis functions that allow for physical interpretation of the flow features. Once this optimized basis is established, the branch network is trained in the second stage to map input parameters to the corresponding coefficients. This modification significantly improves the network’s accuracy, efficiency, and robustness, particularly when solving Riemann problems with extreme pressure ratios³⁶.

FNO

FNO learn solution operators by leveraging spectral convolutions in the Fourier domain⁴. The input function is first lifted to a high-dimensional latent space through pointwise linear transformations. A Fourier transform is applied to these lifted features, enabling convolutional operations to be performed as multiplications in frequency space. High-frequency modes are typically truncated to enforce smoothness, reduce overfitting, and improve training dynamics. The result is then transformed back to physical space via the inverse Fourier transform and projected to the target dimension. The global receptive field of FNOs makes them particularly effective for modeling long-range dependencies in solutions to PDEs, as demonstrated in applications such as weather forecasts, porous media flows, and turbulence.

TC-UNet

Unlike FNOs, TC-UNet²⁷ operates entirely in physical space using local convolutions. The architecture is based on a UNet, a hierarchical fully convolutional neural network that captures multiscale features through successive downsampling and upsampling. TC-UNet uses time conditioning via feature-wise linear modulation (FiLM)⁵⁰, applied at each level of the hierarchy. This allows the model to adaptively modulate intermediate features based on the time coordinate input, enabling accurate modeling of spatiotemporal dynamics. TC-UNet or UNet-based architectures are particularly well-suited for problems characterized by sharp gradients³⁶ or fine-scale structures⁵¹ and are, in general, more robust to spectral bias⁵² compared to other neural operator architectures.

Benchmarks

Bubble growth dynamics

We study the dynamics of a single gas bubble in an incompressible liquid governed by the Rayleigh-Plesset (R-P) equation⁴⁸, a nonlinear ordinary differential equation describing the evolution of the bubble radius R(t) under a time-varying pressure field P_∞(t). Under isothermal assumptions and negligible temperature variations, the simplified linearized R-P equation reads

$$-\frac{\Delta p(t)}{{\rho }_{L}}={R}_{0}\frac{{d}^{2}r}{d{t}^{2}}+\frac{4{\nu }_{L}}{{R}_{0}}\frac{dr}{dt}+\frac{1}{{\rho }_{L}{R}_{0}}\left(3{P}_{G0}-\frac{2\gamma }{{R}_{0}}\right)r(t),$$

(45)

where r(t) = R(t) − R₀ is the deviation from the initial bubble radius R₀, ρ_L is the liquid density, ν_L is the kinematic viscosity, γ is the surface tension, and P_G0 is the initial gas pressure inside the bubble.

We generate a dataset by numerically solving equation (45) for 1000 independent realizations of the forcing function Δp(t), which is constructed as a product of a Gaussian random field and a smooth ramp function, following the procedure in⁴⁸. Specifically, the pressure field is modeled as

$$\Delta p(t)=g(t)s(t),\,g(t) \sim {\mathcal{GP}}(\mu ,{\sigma }^{2}k({t}_{1},{t}_{2})),$$

where k(t₁, t₂) is a squared exponential kernel with correlation length ℓ, and s(t) is a smooth ramp used to induce a sharp initial pressure drop.

The data were split into training, validation, and testing subsets in the ratio 80:10:10. Each simulation yields a trajectory of the bubble radius R(t), sampled over a fixed time window with initial condition R(0) = R₀, $\mathop{R}\limits^{^\circ }(0)=0$. All simulations assume periodic boundary conditions and are performed with parameters corresponding to the physical properties of water at room temperature.

High-pressure sod-shock tube

We consider the one-dimensional Riemann problem governed by the compressible Euler equations of gas dynamics. This system describes the conservation of mass, momentum, and energy in an inviscid flow and is given by the hyperbolic system

$$\frac{\partial U}{\partial t}+\frac{\partial F(U)}{\partial x}=0,\,x\in [0,L],\,t > 0,$$

(46)

where the vector of conservative variables U and the flux vector F(U) are defined as

$$U=\left(\begin{array}{c}\rho \\ \rho u\\ \rho E\end{array}\right),\,F(U)=\left(\begin{array}{l}\rho u\\ \rho {u}^{2}+p\\ u(\rho E+p)\end{array}\right).$$

Here, ρ denotes the fluid density, u the velocity, and p the pressure. The total energy E is related to the pressure by the equation of state for an ideal gas, $\rho E=\frac{p}{\gamma -1}+\frac{1}{2}\rho {u}^{2}$, with the specific heat ratio set to γ = 1.4.The system is subject to discontinuous initial conditions consisting of two constant states, U_L and U_R, separated by a diaphragm at x = x_c. In this study, we specifically focus on the High-Pressure Ratio (HPR) regime, characterized by extreme pressure jumps (up to a ratio of 10¹⁰) across the discontinuity1. The dataset is generated by varying the initial left-state pressure p_L while keeping the right-state parameters fixed, utilizing an exact Riemann solver to provide the ground truth solutions at a final time t_f.While the system involves three primitive variables, we restrict the neural operator to map the initial pressure parameter p_L exclusively to the density field ρ(x, t_f). The density profile is particularly challenging and representative as it uniquely exhibits all three fundamental wave structures inherent to the Riemann problem: the rarefaction wave, the contact discontinuity, and the shock wave.

Navier-Stokes Equations- Kolmogorov’s flow

We consider the two-dimensional unsteady Navier-Stokes equations in vorticity formulation, modeling an incompressible, viscous fluid on the periodic domain (x, y) ∈ (0, 2π)². The system is driven by a Kolmogorov-type external forcing, as previously studied in⁵³, and is governed by:

$$\left\{\begin{array}{l}{\partial }_{t}\omega +{\boldsymbol{u}}\cdot \nabla \omega =\nu \Delta \omega +f(x,y),\\ \nabla \cdot {\boldsymbol{u}}=0,\\ \omega (x,y,0)={\omega }_{0}(x,y),\end{array}\right.$$

(47)

with viscosity ν = 10⁻³, and the source term defined as

$$f(x,y)=\chi (\sin (2\pi (x+y))+\cos (2\pi (x+y))),$$

(48)

where χ = 0.1. The Laplacian Δ acts in two spatial dimensions, ω denotes the vorticity, and u is the velocity.

Initial conditions ω₀(x, y) are sampled from a Gaussian random field with zero mean and covariance operator ${\mathcal{N}}(0,{7}^{3/2}{(-\Delta +49I)}^{-5/2})$. To generate the data, we employ a Fourier-based pseudo-spectral solver introduced in⁴. The simulation output consists of 1000 spatiotemporal vorticity realizations, each on a 512 × 512 spatial grid, subsequently downsampled to 128 × 128 for downstream learning tasks.

We partition the dataset into training, validation, and testing subsets in an 80:10:10 ratio. A neural operator model ${\mathcal{G}}$ is trained to predict evolution of the vorticity field by learning the mapping from the initial condition at t = 0 to the interval [t ∈ (0, 50]).

For the 2D Navier-Stokes problem, we train a Fourier Neural Operator (FNO) to learn the mapping from an initial vorticity field ω₀(x, y) to the full spatiotemporal solution ω(x, y, t).

Wave Equation

We investigate the propagation of acoustic waves governed by the linear wave equation in heterogeneous media. In 2D, the governing equation is given by:

$$\left\{\begin{array}{lc}{\partial }_{t}^{2}u({\boldsymbol{x}},t)={c}^{2}({\boldsymbol{x}})\Delta u({\boldsymbol{x}},t), & {\boldsymbol{x}}\in {[0,\pi ]}^{2},\,t\in [0,2],\\ u({\boldsymbol{x}},0)={u}_{0}({\boldsymbol{x}}), & \begin{array}{cl}\,{\partial }_{t}u({\boldsymbol{x}},0)=0, & {\boldsymbol{x}}\in {[0,\pi ]}^{2},\end{array}\end{array}\right.$$

(49)

where u(x, t) represents the acoustic pressure at spatial location x = (x, y), c(x) is the spatially varying wave speed, and Δ denotes the Laplacian operator. We assume fully reflective (homogeneous Dirichlet) boundary conditions throughout the domain.

For the spatially varying wave speed, we set $c(x,y)=1+\sin (x)\sin (y)$. The initial pressure profile u₀(x) is modeled as a localized Gaussian source centered at a point x_c, i.e.,

$${u}_{0}({\boldsymbol{x}})=\exp \left(-\frac{\parallel {\boldsymbol{x}}-{{\boldsymbol{x}}}_{c}{\parallel }^{2}}{10}\right),$$

with x_c sampled randomly on the spatial grid. We solve this system numerically using a second-order finite difference method on a grid with a spatial resolution of 64 × 64 and generate 1000 simulations corresponding to different realizations of u₀. The dataset is partitioned into training, validation, and test sets in the ratio 80:10:10. We train a neural operator ${\mathcal{G}}$ to learn the mapping u(x, 0) ↦ u(x, t) for all [t ∈ (0, 2]).

In our final operator learning example, we train a Time-Conditioned U-Net (TC-UNet) to learn the solution operator for the 2D wave equation, mapping an initial pressure profile u₀(x) to the full wave propagation over time u(x, t).

vRBA hyperparameter selection

This section details the selection of the hyperparameters introduced by the vRBA framework. The present formulation does not require extensive problem-specific tuning; in fact, all hyperparameters discussed below were held constant across every benchmark presented in this paper. Unless otherwise stated, we utilize a consistent set of vRBA hyperparameters: an Exponential Moving Average (EMA) memory of γ = 0.999, an EMA learning rate of η = 0.01, and a smoothing factor of ϕ = 1.0.

Annealing schedule

For the exponential potential, the convex duality between entropy and free energy warrants any choice of annealing schedule. We observe the parallel between vRBA and simulated annealing, which is provably convergent under sufficiently low temperature decay, e.g., [ref. ⁴⁶, Theorem 1]. Inspired by such theoretical results, our annealing schedule (see Eq. (19)) follows a logarithmic decay in the number of iterations scaled by a universal constant, which we took to be one (i.e., c = 1) in all our examples.

For the general potential-dependent case, there are generically two cases. For potentials such as $\cosh (r)$, ${e}^{{r}^{2}}$, and $(1+r)\log (1+r)-r$ where the appropriate ϵ is not easily found via analytic calculations, the annealing parameter is determined dynamically at each iteration by solving the optimality condition via Newton’s method. On the other hand, for the polynomial (r^p, for p > 1) potentials, the formulation remains effectively constant throughout training. Consequently, these parameters are governed by theoretical or numerical optimality conditions rather than manual hyperparameter selection.

EMA learning rate

Here, the values of λ are bounded in the interval (0, η^*/(1 − γ)), implying a maximum importance score of ${\lambda }_{\max }={\eta }^{* }/(1-\gamma )$. These parameters were originally introduced in the RBA framework¹⁵, and we utilize the exact same values established in that work. We note that the use of Exponential Moving Averages (EMA) to stabilize stochastic estimates is a standard practice in machine learning, most notably in the Adam optimizer⁴³, where it is essential for convergence stability. Subsequent analyses in related studies, such as the sensitivity analysis performed for standard RBA (Φ(r) = r²) in KKANs⁴², have investigated the effect of ${\lambda }_{\max }$ on convergence. That study demonstrated that while initializing with a low bound (${\lambda }_{\max }\approx 1$) yields suboptimal results due to insufficient attention, and overly aggressive bounds (${\lambda }_{\max } > 20$) can slightly degrade performance, there exists a broad, robust plateau of optimal performance for ${\lambda }_{\max }\in [5,20]$. The configuration used in this paper targets ${\lambda }_{\max }\approx 10$, which lies in this optimal regime. Thus, the selection of η^* is not a new free parameter requiring tuning, but is a fixed value inherited from previous empirical evidence.

EMA memory parameter

Similar to the learning rate, we inherit the memory parameter from the previous studies that introduced the RBA framework¹⁵. To further validate the robustness of the proposed method, we performed a sensitivity analysis in the Supplementary Information. Our results indicate that our framework is quite robust to this value. For operator learning, γ cannot be zero since we need to collect the λ scores to sample the function space (see Eq. (43)); however, our results indicate that γ ∈ [0.4, 0.999] outperforms the baseline. On the other hand, for PINNs with second-order optimizers, it is even possible to train without any memory, significantly outperforming the baselines. Nevertheless, the values used in this study, inherited from previous studies, lead to the best results in our benchmarks.

Stabilization parameter

The stabilization parameter ϕ governs the convex combination of the adaptive distribution q and a uniform prior p_u. We primarily utilized pure adaptivity by setting ϕ = 1.0 for all PIML examples, while adopting a slight regularization for Operator Learning benchmarks. A sensitivity analysis provided in SI confirms that the method is robust to this choice, as vRBA consistently outperforms the baseline even in the absence of regularization. Notably, the inclusion of the uniform prior yields performance gains specifically for potentials targeting the L^∞ norm, such as the exponential variants, by mitigating their aggressive nature; for variance-minimizing potentials, the sensitivity to ϕ is negligible.

SSBroyden optimizer

This section details our custom JAX implementation of the Self-Scaled Broyden (SSBroyden) optimizer, which was used for all second-order optimization experiments. The original method, proposed by Urbán et al.⁹, relies on modified SciPy routines that are CPU-bound and not directly portable to a JAX-native, GPU-accelerated workflow.

Our implementation preserves the core SSBroyden update logic, which dynamically computes scaling and updating parameters. However, the line search portion of the algorithm required a complete rewrite. Due to the absence of SciPy’s advanced line search routines in JAX Scipy, we developed a custom three-stage fallback line search mechanism to promote robust convergence. This procedure creates a cascade of attempts with progressively more strict Wolfe conditions, starting with strict parameters (c₂ = 0.9) and constraining them (c₂ = 0.8, then c₂ = 0.5) only upon failure. This adaptation was essential for ensuring the optimizer could consistently make progress on the challenging loss landscapes of the problems studied.

Implementation details

Physics-Informed Neural Networks (PINNs) We evaluate two optimization strategies for PINNs.

First-order optimization

For the Allen-Cahn equation, we adopt the network architecture and self-scaling strategy from⁴², utilizing a 6-layer network (H = 64) with $\tanh$ activations and Fourier Feature embeddings. Training proceeds for 3 × 10⁵ Adam iterations (lr = 10⁻³) with vRBA applied as an importance weighting scheme.

Second-order optimization

For Allen-Cahn and Burgers’ equations, we follow the methodology of⁹, employing a 3-layer network (H = 30) with periodic encodings. Training consists of 5000 initialization iterations (Adam) followed by 60,000 iterations using the SSBroyden optimizer. Here, vRBA is applied via importance sampling with a resampling frequency of 100 iterations.

Operator learning

For operator learning tasks, we employ a hybrid vRBA strategy: importance weighting is applied to the spatial domain, while importance sampling is applied to the function space. Detailed listings of all network dimensions, learning rates, and specific coefficient settings are provided in SI.

DeepONet

For Bubble Growth Dynamics, we use a DeepONet with 4 hidden layers (H = 100, GELU activation) for both branch and trunk networks, optimized via Adam.

SVD-DeepONet

For the Sod-Shock tube problem, we utilize the SVD-DeepONet architecture from³⁶ with adaptive activation functions⁵⁴, comprising 6 hidden layers (H = 100, $\tanh$ activation) for both branch and trunk.

FNO and TC-UNet

For the Navier-Stokes benchmarks, we adopt the exact Fourier Neural Operator (FNO) and Time-Conditioned U-Net (TC-UNet) architectures and code provided by²⁷ to ensure direct comparability.

Data availability

To support reproducibility, all data and code for this study are publicly available on Zenodo at https://zenodo.org/records/18089934. Additionally, to ease accessibility and provide detailed usage instructions, the codebase has also been made available via a public GitHub repository at https://github.com/jdtoscano94/NABLA-SciML/tree/main/vRBA_variational_residual_based_attention_PINNs_Operator_learning.

References

Raissi, M., Perdikaris, P. & Karniadakis, G. E. Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. J. Comput. Phys. 378, 686–707 (2019).
Article MathSciNet Google Scholar
Toscano, J. D. et al. From pinns to pikans: Recent advances in physics-informed machine learning. Mach. Learn. Comput. Sci. Eng. 1, 1–43 (2025).
Article Google Scholar
Lu, L., Jin, P., Pang, G., Zhang, Z. & Karniadakis, G. E. Learning nonlinear operators via DeepONet based on the universal approximation theorem of operators. Nat. Mach. Intell. 3, 218–229 (2021).
Article Google Scholar
Li, Z. et al. Fourier neural operator for parametric partial differential equations. In International Conference on Learning Representations (2021).
Hornik, K., Stinchcombe, M. & White, H. Multilayer feedforward networks are universal approximators. Neural Netw. 2, 359–366 (1989).
Article Google Scholar
Liu, Z. et al. KAN: Kolmogorov-Arnold Networks. In International Conference on Learning Representations (2025).
Hu, Z., Shukla, K., Karniadakis, G. E. & Kawaguchi, K. Tackling the curse of dimensionality with physics-informed neural networks. Neural Netw. 176, 106369 (2024).
Article Google Scholar
Jnini, A., Vella, F. & Zeinhofer, M. Gauss-Newton naturalgradient descent for physics-informed computational fluid dynamics. Comput. & Fluids 307, 106955 (2025).
Urbán, J. F., Stefanou, P. & Pons, J. A. Unveiling the optimization process of Physics Informed Neural Networks: How accurate and competitive can PINNs be? J. Comput. Phys. 523, 113656 (2025).
Article MathSciNet Google Scholar
Kiyani, E., Shukla, K., Urbán, J. F., Darbon, J. & Karniadakis, G. E. Optimizing the optimizer for physics-informed neural networks and kolmogorov-arnold networks. Comput. Methods Appl. Mech. Eng. 446, 118308 (2025).
Article MathSciNet Google Scholar
Zeinhofer, M., Masri, R. & Mardal, K.-A. A unified framework for the error analysis of physics-informed neural networks. IMA J. Numer. Anal. drae081 (2024).
Lu, L., Meng, X., Mao, Z. & Karniadakis, G. E. DeepXDE: A deep learning library for solving differential equations. SIAM Rev. 63, 208–228 (2021).
Article MathSciNet Google Scholar
Wu, C., Zhu, M., Tan, Q., Kartha, Y. & Lu, L. A comprehensive study of non-adaptive and residual-based adaptive sampling for physics-informed neural networks. Comput. Methods Appl. Mech. Eng. 403, 115671 (2023).
Article MathSciNet Google Scholar
McClenny, L. D. & Braga-Neto, U. M. Self-adaptive physics-informed neural networks. J. Comput. Phys. 474, 111722 (2023).
Article MathSciNet Google Scholar
Anagnostopoulos, S. J., Toscano, J. D., Stergiopulos, N. & Karniadakis, G. E. Residual-based attention in physics-informed neural networks. Comput. Methods Appl. Mech. Eng. 421, 116805 (2024).
Article MathSciNet Google Scholar
Chen, W., Howard, A. A. & Stinis, P. Self-adaptive weights based on balanced residual decay rate for physics-informed neural networks and deep operator networks. J. Comput. Phys. 114226 (2025).
Zhang, G. et al. DASA-PINNs: Differentiable adversarial self-adaptive pointwise weighting scheme for physics-informed neural networks. SSRN (2023).
Basir, S. & Senocak, I. Physics and equality constrained artificial neural networks: Application to forward and inverse problems with multi-fidelity data fusion. J. Comput. Phys. 463, 111301 (2022).
Article MathSciNet Google Scholar
Ramireza, I. Residual-based attention physics-informed neural networks for spatio-temporal ageing assessmentof transformers operated in renewable power plants. Eng. Appl. Artif. Intell. 139, 109556 (2025).
Article Google Scholar
Wang, S., Zhao, P. & Song, T. Aspinn: An asymptotic strategy for solving singularly perturbed differential equations. arXiv preprint arXiv:2409.13185 (2024).
Ramirez, I. et al. Residual-based attention physics-informed neural networks for spatio-temporal ageing assessment of transformers operated in renewable power plants. Eng. Appl. Artif. Intell. 139, 109556 (2025).
Article Google Scholar
Wang, S., Zhao, P., Ma, Q. & Song, T. General-kindred physics-informed neural network to the solutions of singularly perturbed differential equations. Phys. Fluids 36 (2024).
Rigas, S., Papachristou, M., Papadopoulos, T., Anagnostopoulos, F. & Alexandridis, G. Adaptive training of grid-dependent physics-informed Kolmogorov-Arnold networks. IEEE Access 12, 176982–176998 (2024).
Wu, C. et al. Fmenets: Flow, material, and energy networks for non-ideal plug flow reactor design. Chem. Eng. Sci. 320, 122348 (2026).
Article Google Scholar
Si, C. & Yan, M. Convolution-weighting method for the physics-informed neural network: A primal-dual optimization perspective. J Comput. Phys. 555, 114773 (2026).
Article MathSciNet Google Scholar
Toscano, J. D. et al. Mr-aiv reveals in-vivo brain-wide fluid flow with physics-informed ai. bioRxiv 2025–07 (2025).
Ovadia, O. et al. Real-time inference and extrapolation with time-conditioned unet: Applications in hypersonic flows, incompressible flows, and global temperature forecasting. Comput. Methods Appl. Mech. Eng. 441, 117982 (2025).
Article Google Scholar
Wang, C., Li, S., He, D. & Wang, L. Is L2 Physics Informed Loss Always Suitable for Training Physics Informed Neural Network?. Adv. Neural Inf. Process. Syst. 35, 8278–8290 (2022).
Google Scholar
Rényi, A. On measures of entropy and information. In Proceedings of the fourth Berkeley symposium on mathematical statistics and probability, volume 1: contributions to the theory of statistics, vol. 4, 547–562 (University of California Press, 1961).
Dembo, A. & Zeitouni, O. Large Deviations Techniques and Applications, vol. 38 (Springer Science & Business Media, 2009).
Dupuis, P. & Ellis, R. S.A weak convergence approach to the theory of large deviations (John Wiley & Sons, 2011).
Budhiraja, A. & Dupuis, P. Analysis and approximation of rare events. Representations Weak Convergence Methods Ser. Prob. Theory Stoch. Model. 94, 8 (2019).
MathSciNet Google Scholar
Vaswani, A. et al. Attention is all you need. Advances in Neural Information Processing Systems 30 (2017).
Alberts, A. & Bilionis, I. Physics-informed information field theory for modeling physical systems with uncertainty quantification. J. Comput. Phys. 486, 112100 (2023).
Article MathSciNet Google Scholar
Wang, S., Sankaran, S. & Perdikaris, P. Respecting causality for training physics-informed neural networks. Comput. Methods Appl. Mech. Eng. 421, 116813 (2024).
Article MathSciNet Google Scholar
Peyvan, A., Oommen, V., Jagtap, A. D. & Karniadakis, G. E. Riemannonets: Interpretable neural operators for riemann problems. Comput. Methods Appl. Mech. Eng. 426, 116996 (2024).
Article MathSciNet Google Scholar
Shin, Y., Darbon, J. & Karniadakis, G. E. On the convergence of physics informed neural networks for linear second-order elliptic and parabolic type PDEs. arXiv preprint arXiv:2004.01806 (2020).
Schaul, T., Zhang, S. & LeCun, Y. No more pesky learning rates. In International conference on machine learning, 343–351 (PMLR, 2013).
Tishby, N. & Zaslavsky, N. Deep learning and the information bottleneck principle. In 2015 IEEE Information Theory Workshop (ITW), 1–5 (IEEE, 2015).
Anagnostopoulos, S. J., Toscano, J. D., Stergiopulos, N. & Karniadakis, G. E. Learning in pinns: Phase transition, diffusion equilibrium, and generalization. Neural Netw. 193, 107983 (2026).
Article Google Scholar
Shwartz-Ziv, R. & Tishby, N. Opening the black box of deep neural networks via information. arXiv preprint arXiv:1703.00810 (2017).
Toscano, J. D., Wang, L.-L. & Karniadakis, G. E. Kkans: Kurkova-kolmogorov-arnold networks and their learning dynamics. Neural Networks 107831 (2025).
Kingma, D. P. & Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
Dherin, B., Munn, M., Rosca, M. & Barrett, D. Why neural networks find simple solutions: The many regularizers of geometric complexity. Adv. Neural Inf. Process. Syst. 35, 2333–2349 (2022).
Google Scholar
Toscano, J. D. et al. Aivt: Inference of turbulent thermal convection from measured 3d velocity data by physics-informed kolmogorov-arnold networks. Sci. Adv. 11, eads5236 (2025).
Article Google Scholar
Geman, S. & Hwang, C.-R. Diffusions for global optimization. SIAM J. Control Optim. 24, 1031–1043 (1986).
Article MathSciNet Google Scholar
Wang, S., Teng, Y. & Perdikaris, P. Understanding and mitigating gradient flow pathologies in physics-informed neural networks. SIAM J. Sci. Comput. 43, A3055–A3081 (2021).
Article MathSciNet Google Scholar
Lin, C. et al. Operator learning for predicting multiscale bubble growth dynamics. J. Chem. Phys. 154 (2021).
Lee, S. & Shin, Y. On the training and generalization of deep operator networks. SIAM J. Sci. Comput. 46, C273–C296 (2024).
Article MathSciNet Google Scholar
Perez, E., Strub, F., De Vries, H., Dumoulin, V. & Courville, A. Film: Visual reasoning with a general conditioning layer. In Proceedings of the AAAI conference on artificial intelligence, 32 (2018).
Khodakarami, S., Oommen, V., Bora, A. & Karniadakis, G. E. Mitigating spectral bias in neural operators via high-frequency scaling for physical systems. Neural Net. 108027 (2025).
Oommen, V., Bora, A., Zhang, Z. & Karniadakis, G. E. Integrating neural operators with diffusion models improves spectral representation in turbulence modeling. arXiv preprint arXiv:2409.08477 (2024).
Chandler, G. J. & Kerswell, R. R. Invariant recurrent solutions embedded in a turbulent two-dimensional kolmogorov flow. J. Fluid Mech. 722, 554–595 (2013).
Article MathSciNet Google Scholar
Jagtap, A. D., Kawaguchi, K. & Em Karniadakis, G. Locally adaptive activation functions with slope recovery for deep and physics-informed neural networks. Proc. R. Soc. A 476, 20200334 (2020).
Article MathSciNet Google Scholar
Gao, Z., Yan, L. & Zhou, T. Failure-informed adaptive sampling for pinns. SIAM J. Sci. Comput. 45, A1971–A1994 (2023).
Article MathSciNet Google Scholar
Wang, S., Li, B., Chen, Y. & Perdikaris, P. PirateNets: Physics-informed deep learning with residual adaptive networks. J. Mach. Learn. Res. 25, 1–51 (2024).
MathSciNet Google Scholar
Wu, H. et al. Propinn: Demystifying propagation failures in physics-informed neural networks. arXiv preprint arXiv:2502.00803 (2025).
Zhao, Z., Ding, X. & Prakash, B. A. Pinnsformer: A transformer-based framework for physics-informed neural networks. In The Twelfth International Conference on Learning Representations https://openreview.net/forum?id=a6f2763089c0bd8f56006c42f09ee24c (2024).
Xu, C., Liu, D., Nassereldine, A. & Xiong, J. Fp64 is all you need: Rethinking failure modes in physics-informed neural networks. In Advances in Neural Information Processing Systems (2025).
Wang, S., bhartari, A. K., Li, B. & Perdikaris, P. Gradient alignment in physics-informed neural networks: A second-order optimization perspective. In The Thirty-ninth Annual Conference on Neural Information Processing Systems https://openreview.net/forum?id=iweeVl1RHU (2025).
Zhongkai, H. et al. Pinnacle: A comprehensive benchmark of physics-informed neural networks for solving pdes. Adv. Neural Inf. Process. Syst. 37, 76721–76774 (2024).
Google Scholar
Wang, S., Wang, H. & Perdikaris, P. On the eigenvector bias of Fourier feature networks: From regression to solving multi-scale PDEs with physics-informed neural networks. Comput. Methods Appl. Mech. Eng. 384, 113938 (2021).
Article MathSciNet Google Scholar
Wang, S., Sankaran, S., Wang, H. & Perdikaris, P. An expert’s guide to training physics-informed neural networks. arXiv preprint arXiv:2308.08468 (2023).
Wang, S., Yu, X. & Perdikaris, P. When and why PINNs fail to train: A neural tangent kernel perspective. J. Comput. Phys. 449, 110768 (2022).
Article MathSciNet Google Scholar

Download references

Acknowledgements

This work was supported by the NIH grant R01AT012312, the MURI/AFOSR FA9550-20-1-0358 project, the DOE-MMICS SEA-CROGS DE-SC0023191 award, and the ONR Vannevar Bush Faculty Fellowship (N00014-22-1-2795).

Author information

These authors contributed equally: Juan Diego Toscano, Daniel T. Chen.

Authors and Affiliations

Division of Applied Mathematics, Brown University, Providence, RI, USA
Juan Diego Toscano, Daniel T. Chen, Jérôme Darbon & George Em Karniadakis
School of Engineering, Brown University, Providence, RI, USA
Vivek Ooomen
Pacific Northwest National Laboratory, Richland, WA, USA
George Em Karniadakis

Authors

Juan Diego Toscano
View author publications
Search author on:PubMed Google Scholar
Daniel T. Chen
View author publications
Search author on:PubMed Google Scholar
Vivek Ooomen
View author publications
Search author on:PubMed Google Scholar
Jérôme Darbon
View author publications
Search author on:PubMed Google Scholar
George Em Karniadakis
View author publications
Search author on:PubMed Google Scholar

Contributions

1. Conceptualization: J.D.T, D.T.C. 2. Methodology: J.D.T., D.T.C., G.E.K. 3. Software: J.D.T., V.O. 4. Formal analysis: J.D.T., D.T.C., J.D. 5. Investigation: J.D.T., D.T.C., V.O. 6. Resources: G.E.K., J.D. 7. Writing—original draft: J.D.T., D.T.C., V.O. 8. Writing—review \& editing: J.D.T., D.T.C., V.O., J.D., G.E.K. 9. Visualization: J.D.T., V.O. 10. Supervision: G.E.K., J.D. 11. Project administration: G.E.K. 12. Funding acquisition: G.E.K., J.D.

Corresponding author

Correspondence to George Em Karniadakis.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information (download PDF )

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Toscano, J.D., Chen, D.T., Ooomen, V. et al. A variational framework for residual-based adaptivity in neural PDE solvers and operator learning. npj Artif. Intell. 2, 32 (2026). https://doi.org/10.1038/s44387-026-00084-4

Download citation

Received: 24 October 2025
Accepted: 18 February 2026
Published: 07 March 2026
Version of record: 07 March 2026
DOI: https://doi.org/10.1038/s44387-026-00084-4

Subjects

Abstract

Similar content being viewed by others

Discovering cognitive strategies with tiny recurrent neural networks

Sufficient is better than optimal for training neural networks

Variational tensor neural networks for deep learning

Introduction

Results

vRBA: a generative framework for residual-based adaptive scheme

Extension to operator learning: a hybrid adaptivity strategy

vRBA accelerates convergence, achieves higher accuracy and reduces error accumulation

vRBA captures fine details and promotes uniform error distribution

vRBA reduces discretization error via variance reduction

vRBA improves the learning dynamics

Discussion

Methods

Variational residual-based attention methods

Update the tilted distribution

Update the model parameters

Adaptive Sampling

Importance Weighting

Update the regularization parameter

Case I: Exponential Potential (Φ(x) = e x)

Case II: general potentials

Algorithm 1

Physics-informed neural networks (PINNs)

Global weights

Algorithm 2

Benchmarks

Allen-Cahn

Burgers Equation

Korteweg-De Vries (KdV)

Operator learning

Algorithm 3

DeepONet

SVD-DeepONet

FNO

TC-UNet

Benchmarks

Bubble growth dynamics

High-pressure sod-shock tube

Navier-Stokes Equations- Kolmogorov’s flow

Wave Equation

vRBA hyperparameter selection

Annealing schedule

EMA learning rate

EMA memory parameter

Stabilization parameter

SSBroyden optimizer

Implementation details

First-order optimization

Second-order optimization

Operator learning

DeepONet

SVD-DeepONet

FNO and TC-UNet

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Supplementary information

Supplementary Information (download PDF )

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links

Case I: Exponential Potential (Φ(x) = e ^x)