Introduction

Long-Term Time Series Forecasting (LTSF) predicts future values over extended horizons from historical observations and supports applications in finance, energy, meteorology, transportation, and healthcare1. The problem frequently encounters nonstationarity and distribution shifts such that models trained offline are miscalibrated as the underlying data distribution changes over time2. Training also suffers from noisy gradients and high-variance updates under limited budgets, making local optimization sensitive to initialization and prone to trapping in poor local optima3. Outliers and measurement noise further distort the loss surface and slow convergence, increasing the risk of unstable generalization across horizons4. These challenges are especially critical under fixed optimization budgets, where stable adaptation matters as much as accuracy. Evolutionary-Guided Module Fusion with Gradient Refinement (EGMF-GR) is positioned as a training framework that optimizes how existing forecasters are trained, rather than being a new forecasting architecture.

When applying artificial intelligence to LTSF, researchers have pursued multiple approaches. (1) Transformer-based methods5,6,7. These models utilize self-attention, which effectively captures long-range dependencies, though often with high computational cost. Related Transformer-inspired designs have also demonstrated strong feature interaction and fusion capability in other deep learning tasks, although these studies are not specific to LTSF8,9,10. However, they face high computational complexity and difficult optimization due to their complex internal mechanisms. (2) Evolutionary Algorithm (EA)-based methods11,12,13. These methods offer flexibility in capturing both linear and nonlinear patterns in time series. They are well-suited for non-differentiable series but suffer from slow convergence and high computational cost, limiting their practical use.

Transformer-based and EA-based LTSF methods offer complementary modeling and search capabilities, but their performance can be constrained by unstable optimization under nonstationarity and distribution shifts and by the computational burden of population-level evaluation, which together limit robustness and efficiency under fixed training budgets. Existing hybrid attempts often remain coarse-grained and lack a reproducible, fine-grained mechanism to decide when to copy, fuse, or preserve module states under a fixed budget. Moreover, state inconsistencies after merging are commonly overlooked when non-learnable buffers such as running statistics are left unsynchronized, which degrades stability.

However, existing hybrid approaches for LTSF rarely provide a reproducible and architecture-agnostic mechanism for fine-grained weight adaptation during training under a fixed budget. EGMF-GR fills this gap through a hybrid evolutionary and gradient framework. EGMF-GR targets general training optimization challenges for differentiable LTSF forecasters and is not restricted to Transformer backbones, while requiring aligned modules with comparable intermediate outputs for discrepancy evaluation. The framework maintains multiple weight diverse model instances, selects an appropriate global model for guidance, identifies aligned modules, measures module discrepancy using multiple complementary criteria, and applies a hybrid threshold to decide whether weighted module-state fusion is activated or the original module state is retained. These fusion or retention decisions generate offspring models, which then undergo brief gradient based fine tuning to strengthen local adaptation. The EA enhances global exploration and increases the likelihood of reaching a strong global solution, while gradient based fine tuning improves local optimization and convergence quality. Under a matched optimization budget, this design supports stable and accurate long-term forecasting across different backbones, datasets, and multiple prediction horizons.

The main contributions of this paper are summarized as follows.

  • An architecture-agnostic hybrid framework is proposed to couple EA with gradient-based refinement for LTSF. The framework requires no model redesign and can be integrated into existing differentiable backbones in a plug-and-play manner. Extensive experiments on eight standard benchmarks demonstrate improvements in forecasting accuracy and stability across diverse datasets and prediction horizons.

  • A module-level, multi-metric adaptive fusion mechanism is designed by jointly leveraging Jensen–Shannon divergence (JSD), Kullback–Leibler divergence (KLD), Mean Squared Error (MSE), and Mean Absolute Error (MAE). Instead of relying on naive weighting, a robust hybrid threshold is derived by combining local module-wise discrepancies with a global statistic based on the interquartile range (IQR), which mitigates unstable or suboptimal fusion decisions. This thresholding strategy balances conservative replacement and weighted aggregation, improving reliability under heterogeneous modules and datasets.

  • A model-state-level fusion strategy is introduced to update learnable parameters and to synchronize non-learnable buffers such as running statistics when present. This synchronization reduces state inconsistency after merging and improves practical stability when internal running statistics exist.

The remainder of this paper is organized as follows. “Related work” reviews recent advances in LTSF and weight optimization. “Proposed method” introduces the proposed evolutionary-gradient training framework, including multi-criteria module scoring and a hybrid-threshold fusion strategy for combining global and local model states. “Experimental results and analysis” reports results on diverse LTSF benchmarks and compares EGMF-GR with representative state-of-the-art methods. “Conclusions” provides concluding remarks and outlines future research directions.

Related work

LTSF

The core objective of LTSF is to infer future trends or values from historical data14. Contemporary methods combine signal-processing techniques, such as filtering, transformation, and feature extraction, with multi-scale analysis15. Widely used statistical and deep-learning models include Autoregressive Integrated Moving Average (ARIMA), which suits linear short-term forecasting; and Long Short-Term Memory (LSTM) networks, which capture long-range dependencies in sequences. In addition, machine-learning approaches such as Support Vector Machines (SVM), Random Forests, Convolutional Neural Networks (CNN), and ensemble methods like XGBoost and LightGBM are commonly applied to enhance prediction accuracy and robustness16,17,18. Wei et al.19 apply wavelet analysis to develop a wavelet attention mechanism that extracts multiple periodic features and mitigates the influence of anomalies, thereby strengthening seasonal-pattern capture and improving forecasting accuracy. Cai et al.20 employ standard Empirical Mode Decomposition (EMD) to decompose each variable separately and align the resulting subsequences to the intrinsic frequency components of the target series, which enhances the quality and reliability of multivariate forecasting. For multi-scale analysis, the original time series are decomposed into components at different time scales to more precisely capture the data’s information. Hou et al.21 introduce a method that combines spatial–temporal attention with multi-scale dynamic graph generation to automatically capture implicit dependencies and produce multi-scale structures, and they augment this approach with a global graph convolution module to improve multivariate forecasting across time scales. Beyond generic forecasting benchmarks, related building-operation time-series tasks, such as Air-Handling Unit Fault Detection and Diagnosis (AFDD), are also studied with Transformer-based diagnosis, self-supervised temporal representation learning, and cross-building training strategies, supported by recent real operational and semi-labelled AFDD datasets22,23,24,25,26,27.

Weight optimization

Weight optimization in neural learning can be broadly grouped into gradient based optimization, evolutionary and other population based optimization, and model combination strategies that transfer information across trained or partially trained models28,29. Gradient based methods such as Stochastic Gradient Descent and adaptive optimizers remain the dominant choice for differentiable deep networks because they provide efficient large scale training through backpropagation30,31. However, under nonstationarity, noisy updates, and limited optimization budgets, purely gradient based training can remain sensitive to initialization, local minima, and unstable update trajectories. Evolutionary computation provides a complementary search paradigm by maintaining and evolving a population of candidate solutions. In neural learning, evolutionary algorithms (EAs) have been used for weight search, hyperparameter optimization, architecture search, feature selection, and hybrid forecasting pipelines32,33. Population Based Training follows a related idea by combining population level exploration with within individual gradient updates, while recent model merging studies also show that population style search can be used to identify effective combinations of trained neural networks34. These studies confirm that evolutionary ideas in neural network learning are well established, and the contribution of EGMF-GR is not to introduce evolutionary optimization itself. Instead, the relevance of EGMF-GR lies in how evolutionary guidance is integrated with gradient based learning for budget constrained LTSF training. Existing population based or evolutionary schemes often operate at the whole model level, focus on hyperparameters or global recombination rules, or perform post hoc combination after standard training. By contrast, EGMF-GR performs training time selective transfer at the module level. The transfer decision is triggered online by discrepancies between aligned intermediate module outputs on the current minibatch, rather than by a fixed global recombination rule. The transfer operator also acts on module states, including learnable parameters and, when applicable, non learnable buffers, so that internal state consistency is better preserved after fusion. This distinction is important because module heterogeneity makes indiscriminate whole model recombination potentially harmful, especially under distribution shift or noisy gradients. For this reason, EGMF-GR uses a multi metric discrepancy score together with an IQR regularized robust threshold to decide whether transfer is beneficial for each aligned module. A short gradient refinement stage then restores local adaptation after discrete state transfer under a matched optimization budget. In this sense, the method is positioned as a specific hybrid training framework rather than a generic claim that EAs are newly applied to neural network learning. Related directions such as CGO Ensemble, Population Based Training, Model Soups, Model Ratatouille, and knowledge distillation all transfer useful information across models in different ways35,36,37. However, they generally do not provide discrepancy triggered module level state transfer during training with explicit synchronization of non learnable buffers. This gap motivates the design of EGMF-GR for differentiable LTSF backbones under constrained optimization budgets.

Proposed method

Overview

EGMF-GR maintains a population of N architecture matched model instances. Individuals share the same structure but differ in their weights. Training uses three splits with non overlapping roles to keep evaluation leakage free. The training split \(\mathcal {D}_{\textrm{tr}}\) is used for gradient updates only. The selection split \(\mathcal {D}_{\textrm{sel}}\) is an internal holdout inside the training portion under time order, and it is used for population ranking only. The current best individual in the population is determined by the selection objective \(\mathcal {L}_{\textrm{sel}}\) as in Eq. (1). The validation split \(\mathcal {D}_{\textrm{val}}\) is reserved for early stopping and final reporting only, and it does not participate in population ranking, fusion triggering, or global best updates.

$$\begin{aligned} \textbf{w}_{\textrm{best}}=\arg \min _{\textbf{w}_i\in \mathcal {P}} \mathcal {L}_{\textrm{sel}}\!\left( \textbf{w}_i\right) \end{aligned}$$
(1)

where, \(\mathcal {P}=\{\textbf{w}_i\}_{i=1}^{N}\) denotes the population.

Figure 1 summarizes the workflow and Algorithm 1 provides the full procedure. Each iteration selects the global best individual using \(\mathcal {L}_{\textrm{sel}}\) on \(\mathcal {D}_{\textrm{sel}}\). A paired forward pass monitors aligned modules from the current individual and the global best individual. Let \(M=|\mathcal {S}_{\textrm{mon}} |\) denote the number of monitored aligned modules. Here, a module refers to a named neural-network subcomponent that can be uniquely matched across individuals under the same architecture, and it is typically instantiated as a repeated trunk block of the forecasting backbone. Module alignment means that the current individual and the global best individual share the same module naming scheme so that each monitored module has a one-to-one counterpart. A multi metric discrepancy score based on JSD, KLD, MSE, and MAE is computed at the module level and a robust hybrid threshold decides whether module state transfer is activated. When activated, the operator applies module state fusion, and synchronizes non learnable buffers when present to keep the model state consistent. A brief gradient refinement step on \(\mathcal {D}_{\textrm{tr}}\) then stabilizes the offspring after the discrete transfer.

The evolutionary component serves as population based global exploration and the framework is not tied to a specific evolutionary operator set. The algorithm description is therefore framework level, while the experimental setting later specifies one concrete instantiation to ensure reproducibility.

Common training strategies such as exponential moving average and stochastic weight averaging perform trajectory level smoothing for a single model instance and do not include population based selection or module level transfer. A matched comparison with these strategies is reported in the experimental evaluation under the same backbone and training recipe.

Fig. 1
Fig. 1
Full size image

Overview of EGMF-GR. A population of architecture matched individuals is maintained with weight diversity. Training integrates population based exploration with gradient based refinement by selecting a global best individual, generating an offspring via module level state transfer, and refining the offspring with a short backpropagation stage. Module transfer is applied only to aligned modules and is triggered by a robust threshold on multi metric discrepancies. When transfer is activated, learnable parameters are updated and non learnable buffers are synchronized when present to keep the model state consistent.

Algorithm 1
Algorithm 1
Full size image

EGMF-GR

The description in Algorithm 1 is framework level and remains compatible with different population based evolutionary search choices. The reported results use a simple generational style instantiation in which each generation retains the global best individual and updates remaining individuals by offspring ranked on the selection loss evaluated on \(\mathcal {D}_{\textrm{sel}}\).

Problem definition

Assume that, at time t, the observed multivariate time series within a look-back window of length L is \(\mathcal {X}_{t-L+1:t} \in \mathbb {R}^{L \times C}\), where C denotes the number of variables. The forecasting model \(F(\cdot ;\textbf{w})\) maps the historical input window to a T-step-ahead prediction:

$$\begin{aligned} \hat{\mathcal {Y}}_{t+1:t+T} = F\!\left( \mathcal {X}_{t-L+1:t};\textbf{w}\right) \end{aligned}$$
(2)

where, as defined in Eq. (2), the objective is to predict the future segment of length T following time t using the observations in the historical segment of length L preceding time t.

Evolutionary-gradient optimization framework

EGMF-GR is a hybrid training framework that combines the EA with conventional gradient-based optimization. A core component is the fusion operator, which selectively merges the global best individual with other candidates at the module level to transfer useful weight patterns while suppressing noisy or harmful updates. The procedure maintains a population of weight diverse instances, selects the global best individual using the selection loss on \(\mathcal {D}_{\textrm{sel}}\), evaluates architecture aligned module outputs with a multi metric criterion, and applies discrepancy-aware weighted fusion when the trigger condition is met, otherwise retaining the original module state under an IQR regularized hybrid threshold. After generating the offspring, a gradient-based refinement is enabled to further improve local optimality. This schedule preserves the efficiency of backpropagation while injecting population-level global guidance, which mitigates the tendency of conventional gradient descent to get trapped in suboptimal local minima and explicitly promotes a more global search behavior.

Initialization

As the first step of the evolutionary procedure, a population of N neural network models is created by sampling different random seeds while keeping the architecture fixed. The initial population is defined in Eq. (3) as

$$\begin{aligned} \mathcal {P} = \{\textbf{w}_i\}_{i=1}^{N} \end{aligned}$$
(3)

where \(\textbf{w}_i\) denotes the weights of the i-th individual at initialization. Equivalently, the i-th individual corresponds to the model instance \(F(\cdot ;\textbf{w}_i)\). The selection objective \(\mathcal {L}_{\textrm{sel}}\) evaluated on \(\mathcal {D}_{\textrm{sel}}\) is used to select the global best individual.

Globally guided module-level fusion

Globally guided module-level fusion serves as the core operator in EGMF-GR, and its workflow is illustrated in Fig. 2. The framework is inspired by evolutionary search while remaining independent of any specific EA. It requires a population of individuals that share the same architecture but differ in weights, together with a global best individual selected by the selection loss \(\mathcal {L}_{\textrm{sel}}\) on \(\mathcal {D}_{\textrm{sel}}\). At each iteration, the globally best individual provides module-wise guidance at the model-state level.

Module monitoring in EGMF-GR refers to the automatic identification of aligned modules between the current individual model and the global best model. A monitored module is a module that appears at the same structural position in the two models, shares the same module name and parameter organization, and therefore supports direct state level operations. Under this definition, monitoring does not mean manual selection of a few preferred layers. Instead, it means establishing one to one correspondence between the same modules in two neural network models with the same architecture.

Once aligned modules are identified, the two models run the same module under the same input, and the resulting intermediate outputs are used for discrepancy evaluation. Since the paired modules are structurally consistent, their parameters can also participate in subsequent weight computation, state copying, or weighted fusion. Therefore, module monitoring serves two purposes at the same time. It provides matched intermediate features for discrepancy measurement, and it determines the valid module pairs on which parameter level operations can be performed.

For each individual, modules are aligned under the same architecture. Each aligned module is monitored during forward propagation, and its output is collected for discrepancy assessment. Intermediate outputs are captured via forward hooks under a paired forward pass of the current individual and the global best individual. For the aligned module, let the output tensor be \(z \in \mathbb {R}^{B \times C \times \cdots }\). It is summarized into a compact channel descriptor \(m \in \mathbb {R}^{C}\), whose c-th entry is defined in Eq. (4) as

$$\begin{aligned} m_c = \textrm{mean}_{b,\ldots }\left( \left| z_{b,c,\ldots } \right| \right) \end{aligned}$$
(4)

The compact descriptor is adopted because the output tensor of an aligned module is often high-dimensional and its non-channel dimensions depend on the module type, such as time steps in sequence layers, tokens in patch embeddings, or heads in attention blocks. A channel-wise summary therefore provides a unified representation for discrepancy computation, while remaining insensitive to auxiliary dimensions beyond the channel axis. The channel-wise mean absolute activation provides a stable and lightweight signal, and it remains invariant to auxiliary dimensions such as time steps, heads, and spatial tokens.

On the same input, let \(m_{\text {best}}\in \mathbb {R}^{C}\) and \(m_{\text {cur}}\in \mathbb {R}^{C}\) denote the descriptors obtained from the global-best and current individuals, respectively. To compute distribution-based metrics, descriptors are converted into normalized nonnegative vectors as in Eq. (5):

$$\begin{aligned} P=\frac{m_{\text {best}}+\epsilon \textbf{1}}{\textbf{1}^\top m_{\text {best}}+\epsilon C},\quad Q=\frac{m_{\text {cur}}+\epsilon \textbf{1}}{\textbf{1}^\top m_{\text {cur}}+\epsilon C} \end{aligned}$$
(5)

where, \(\textbf{1}\in \mathbb {R}^{C}\) is an all-ones vector and \(\epsilon =10^{-12}\) ensures positivity.

JSD is computed on P and Q according to Eq. (6):

$$\begin{aligned} \textrm{JSD}\!\left( P \parallel Q\right) =\frac{1}{2}\textrm{KLD}\!\left( P \parallel M\right) +\frac{1}{2}\textrm{KLD}\!\left( Q \parallel M\right) \end{aligned}$$
(6)

where, \(M=\frac{1}{2}P+\frac{1}{2}Q\) and \(P,Q,M\in \mathbb {R}^{C}\) are probability vectors.

The KLD is defined in Eq. (7):

$$\begin{aligned} \textrm{KLD}\!\left( P \parallel Q\right) =\sum _{c=1}^{C} P_c \log \frac{P_c}{Q_c} \end{aligned}$$
(7)
Fig. 2
Fig. 2
Full size image

Pipeline of globally guided module level fusion. A current individual and the global best individual execute paired forward propagation under the same inputs, and aligned modules are monitored to collect intermediate outputs. For each module, a normalized multi-metric discrepancy score is computed from JSD, KLD, MSE, and MAE after normalization. A hybrid threshold based on the third quartile and a gamma scaled IQR decides whether fusion is triggered. When triggered, the fusion weight interpolates between the best and current module states, optionally with a small Gaussian perturbation, otherwise no change is applied. Module monitoring automatically matches architecture aligned trunk modules under a fixed exclusion rule for input embedding and final head, so the operator remains reproducible and not tied to a specific backbone family.

MSE and MAE are computed consistently at the module level, as given in Eq. (8):

$$\begin{aligned} \textrm{MSE}=\frac{1}{C}\sum _{c=1}^{C}\left( m_{\text {best},c}-m_{\text {cur},c}\right) ^2,\qquad \textrm{MAE}=\frac{1}{C}\sum _{c=1}^{C}\left| m_{\text {best},c}-m_{\text {cur},c}\right| \end{aligned}$$
(8)

To make heterogeneous criteria comparable within the same iteration, each metric is normalized over the monitored modules before aggregation. For the aligned module l, the resulting normalized multi-metric discrepancy is defined in Eq. (9). Here, \(\textrm{norm}(\cdot )\) denotes an iteration-wise normalization operator applied to the module-level values computed on the monitored set.

$$\begin{aligned} \textrm{Fit}^{(l)} = \textrm{norm}\!\left( \textrm{JSD}^{(l)}\right) +\textrm{norm}\!\left( \textrm{KLD}^{(l)}\right) +\textrm{norm}\!\left( \textrm{MSE}^{(l)}\right) +\textrm{norm}\!\left( \textrm{MAE}^{(l)}\right) \end{aligned}$$
(9)

where an equal-weight sum is adopted to avoid introducing additional tunable coefficients. Since each criterion is normalized within an iteration, the resulting score serves as a relative triggering signal rather than an absolute task-level objective. This design keeps the trigger rule simple and reproducible across different model architectures and datasets.

Multiple discrepancy criteria serve distinct signals for reliable triggering. JSD and KLD quantify distributional mismatch between normalized channel descriptors and remain sensitive to channel-wise shape changes. MSE and MAE quantify point-wise magnitude differences in the original descriptor space and capture scale deviations that may not be reflected by divergence alone. The combined score therefore supports a trigger that responds to both distribution shift and amplitude deviation at the module-output level.

For the aligned module l, the module discrepancy is defined as in Eq. (10):

$$\begin{aligned} \Delta \textrm{Fit}^{(l)}=\textrm{Fit}^{(l)} \end{aligned}$$
(10)

To reduce unstable fusion decisions caused by heterogeneous module scales and noisy updates, a robust trigger threshold is derived from the dispersion of module discrepancies across monitored modules, so fusion activates only for outlier-level mismatch, as specified in Eq. (11). Let \(\Delta \textrm{Fit}^{(l)}\) denote the discrepancy of the lth monitored module, and let M be the number of monitored modules.

$$\begin{aligned} \tau = Q_{3}\!\left( \{\Delta \textrm{Fit}^{(l)}\}_{l=1}^{M}\right) + \gamma \, \textrm{IQR}\!\left( \{\Delta \textrm{Fit}^{(l)}\}_{l=1}^{M}\right) \end{aligned}$$
(11)

where \(Q_{3}(\cdot )\) denotes the third quartile and \(\textrm{IQR}(\cdot )=Q_{3}(\cdot )-Q_{1}(\cdot )\) denotes the IQR. For the aligned module l, fusion is triggered when \(\Delta \textrm{Fit}^{(l)} > \tau\).

This rule activates fusion only when a module discrepancy exceeds a robust population-wide dispersion level, which reduces over-frequent fusion under distribution shifts and noisy updates. The threshold acts as a practical robustness control, and its effect is validated by ablations and a sensitivity check on \(\gamma\).

Fusion is applied module-wise through discrepancy-aware weighted fusion, while the original module state is retained when the trigger condition is not satisfied. The module state includes learnable parameters and, when present, non-learnable buffers such as running statistics, so that internal inconsistencies after merging are reduced. Optimizer states are not transferred. When the trigger condition holds, discrepancy-aware weighted fusion is applied to the learnable parameters as in Eq. (12), where the fusion weight is defined in Eq. (13). Non-learnable buffers are synchronized separately when present. A small Gaussian perturbation is added to maintain population diversity, and its standard deviation \(\sigma\) is set proportional to the mean absolute magnitude of the module parameters with a small scaling coefficient. Unless otherwise stated, the perturbation scale is kept fixed across all training settings.

$$\begin{aligned} & \textbf{w}^{(l)} \leftarrow \alpha ^{(l)}\,\textbf{w}_{\textrm{best}}^{(l)} + \left( 1-\alpha ^{(l)}\right) \textbf{w}_{\textrm{cur}}^{(l)} + \mathcal {N}(0,\sigma ^2) \end{aligned}$$
(12)
$$\begin{aligned} & \alpha ^{(l)} = 1 - \exp \!\left( -\Delta \textrm{Fit}^{(l)}\right) \end{aligned}$$
(13)

Alternation with gradient refinement

This schedule alternates between population-level selection and a short gradient refinement stage. A population of candidate networks evolves under a selection-driven criterion, while a brief backpropagation phase stabilizes the parameters after fusion and restores local fit on the training split. Figure 3 summarizes the full loop, including initialization, fitness evaluation, global-best selection, fusion, and refinement.

At generation g, the population is denoted by \(\mathcal {P}^{(g)} = \{\textbf{w}_i^{(g)}\}_{i=1}^{N}\), where \(\textbf{w}_i^{(g)}\) represents the full model state of the i-th individual. The global best individual follows the selection criterion in Eq. (14):

$$\begin{aligned} \textbf{w}_{\textrm{best}}^{(g)} = \arg \min _{\textbf{w}_i^{(g)} \in \mathcal {P}^{(g)}} \mathcal {L}_{\textrm{sel}}\left( \textbf{w}_i^{(g)}\right) \end{aligned}$$
(14)

For each individual, a fused offspring is constructed by the selective module-wise transfer operator in Eq. (15):

$$\begin{aligned} \textbf{w}_{i,\textrm{off}}^{(g)} = \mathcal {M}\left( \textbf{w}_i^{(g)}, \textbf{w}_{\textrm{best}}^{(g)}\right) \end{aligned}$$
(15)

The operator acts only on architecture-aligned modules. It transfers module states at the model-state level, including learnable parameters and non-learnable buffers when applicable, so that the offspring preserves internal consistency after merging.

A short gradient refinement stage is then applied to the offspring to reduce instability introduced by discrete module transfer. Let \(\mathcal {L}_{\textrm{tr}}(\textbf{w})\) denote the training objective. The refined candidate is produced by the refinement mapping in Eq. (16):

$$\begin{aligned} \tilde{\textbf{w}}_i^{(g)} = \mathcal {G}_{\eta }^{K}\left( \textbf{w}_{i,\textrm{off}}^{(g)}\right) \end{aligned}$$
(16)

The refinement mapping applies the per-step update in Eq. (17):

$$\begin{aligned} \textbf{w} \leftarrow \textbf{w} - \eta \nabla _{\textbf{w}} \mathcal {L}_{\textrm{tr}}(\textbf{w}) \end{aligned}$$
(17)

The refinement stage is intentionally short to keep the added backward cost modest, while still providing an efficient local search that complements the population step. After refinement, the selection loss on \(\mathcal {D}_{\textrm{sel}}\) is evaluated to update the global-best record and to form the next generation.

Fig. 3
Fig. 3
Full size image

Training loop with alternation between population evolution and short gradient refinement.

Weak stability of the alternating evolutionary-gradient procedure

A weak stability rationale is provided for the alternating procedure with module-wise fusion and gradient-based refinement. The objective is to rule out unbounded weight growth under standard boundedness conditions. One iteration is summarized by the composite update in Eq. (18):

$$\begin{aligned} \textbf{w}^{+} = \mathcal {G}_{\eta }^{K}\left( \mathcal {M}\left( \textbf{w}, \textbf{w}_{\textrm{best}}\right) \right) \end{aligned}$$
(18)

For each fused module l, a convex move toward the best module with bounded zero-mean noise is assumed in Eq. (19):

$$\begin{aligned} \textbf{w}^{(l)} \leftarrow \alpha ^{(l)} \textbf{w}_{\textrm{best}}^{(l)} + \left( 1 - \alpha ^{(l)}\right) \textbf{w}_{\textrm{cur}}^{(l)} + \varvec{\xi }, \quad \mathbb {E}\left[ \varvec{\xi }\right] = \textbf{0}, \quad \mathbb {E}\left[ \Vert \varvec{\xi }\Vert _2^2\right] \le \sigma ^2 \end{aligned}$$
(19)

The assumption implies the expected distance bound in Eq. (20):

$$\begin{aligned} \mathbb {E}\left[ \left\| \textbf{w}^{(l)} - \textbf{w}^{(l)}_{\textrm{best}}\right\| _2^2\right] \le \left( 1 - \alpha ^{(l)}\right) ^2 \left\| \textbf{w}^{(l)}_{\textrm{cur}} - \textbf{w}^{(l)}_{\textrm{best}}\right\| _2^2 + \sigma ^2 \end{aligned}$$
(20)

Let \(\mathcal {L}_{\textrm{tr}}(\textbf{w})\) denote the refinement-stage objective, and assume it has \(\beta\)-Lipschitz continuous gradients in the explored region with \(\eta \le 1/\beta\). This condition restricts the refinement step size and supports bounded iterates in the explored region.

Computational complexity

Dominant backward and optimizer updates are kept the same as the baseline, so the main backward cost remains controlled and the extra cost mainly comes from forward only evaluations and lightweight discrepancy computation. Hooks do not cache full activations. Each monitored output \(z^{(l)}\in \mathbb {R}^{B\times C_l\times \cdots }\) is immediately reduced to a channel descriptor \(m^{(l)}\in \mathbb {R}^{C_l}\), yielding a hook memory cost \(O\!\left( \sum _{l=1}^{M} C_l\right)\).

Discrepancy metrics computed on \(m^{(l)}\) have linear cost, so the per iteration metric FLOPs scale as \(O\!\left( \sum _{l=1}^{M} C_l\right)\). The evolutionary stage adds approximately one extra forward only evaluation per step by pairing the current and global best individuals, while the refinement stage has the same asymptotic complexity as standard training. Measured overhead is reported using wall clock time and peak GPU memory under the same hardware and software setting.

Experimental results and analysis

Datasets and evaluation protocol

EGMF-GR is the current name of the framework and it refers to the same method as the earlier name E-Informer. Experiments cover eight public benchmarks from energy, transportation, economy, and weather, as summarized in Table 1. All methods use the same data split, normalization pipeline, input length, label length, prediction horizon, optimizer, learning rate schedule, batch size, and early stopping rule. The baseline of each backbone corresponds to the official training recipe with all EGMF-GR specific operations disabled. Backbone implementations and default hyperparameters follow the official implementations released by the original authors. Only minimal interface level adjustments are applied to enable a unified data loader, normalization pipeline, and evaluation protocol across backbones. Under this definition, the baseline set includes iTransformer, Crossformer38, and TimesNet39 under the standard long term time series forecasting setting. The reported experiments instantiate the population based exploration as a simple generational genetic algorithm style procedure under the matched optimization budget. Population size is set to \(N=10\) and the number of generations is set to \(G=10\) unless otherwise stated. Fitness is defined as the selection loss on \(\mathcal {D}_{\textrm{sel}}\), which is a holdout split inside the training portion under time order. The validation split \(\mathcal {D}_{\textrm{val}}\) is reserved for early stopping and final reporting and it does not enter the evolutionary loop. Each generation retains the global best individual and constructs one offspring for each remaining individual by discrepancy triggered module level weighted fusion with the global best individual, while retaining the original module state when the trigger condition is not satisfied. After transfer, a short gradient refinement stage is executed using the same optimizer setting as the corresponding backbone baseline recipe.

MSE and MAE are both reported because they emphasize different error geometries. MSE assigns larger penalty to occasional large deviations and reflects sensitivity to tail events, while MAE reflects the typical magnitude of deviations and is less dominated by rare spikes. Reporting both metrics avoids conclusions tied to a single error geometry and supports a simple cross check of method ordering. Substantial variation between the two metrics is not a requirement for this role, since the purpose is to test whether conclusions remain stable under two complementary error views.

Statistical significance is evaluated under the leakage free rolling origin protocol. For each backbone, dataset, and horizon setting, forecasting errors across rolling origins are collected for multiple random seeds. A paired Wilcoxon signed rank test is applied between EGMF-GR and the corresponding backbone baseline under the same setting. Benjamini Hochberg correction is applied across datasets and horizons to control false discovery rate, with q equals 0.05. The correction is applied to the family of p values obtained from all backbone, dataset, and horizon settings.

Table 1 Summary of benchmarks used in this work.

A fair comparison uses a matched optimization budget defined by the number of optimizer updates that include backpropagation. Additional forward only operations from module monitoring and discrepancy evaluation are treated as extra cost and are reported via wall clock time under the same environment.

A quantitative comparison of training time versus the baseline is reported under matched optimizer updates. Table 2 reports wall clock training time for representative datasets, together with overhead percent and speed ratio. The same table also provides a complementary runtime breakdown for a representative configuration and isolates the forward only fusion operator cost from standard evaluation time.

Table 2 Wall clock results under matched settings and the same environment. Time is measured in seconds. Overhead is measured in percent. Speed ratio equals baseline time divided by EGMF-GR time. The first block reports training time. The second block reports a runtime breakdown for a representative configuration.

Main forecasting results

EGMF-GR is integrated into iTransformer, Crossformer38, and TimesNet39. Table 3 reports MSE and MAE under four forecasting horizons, namely 96, 192, 336, and 720, under matched settings. Consistent ordering under both metrics indicates that conclusions do not depend on a single metric choice even when the absolute differences between metrics appear small. Reporting both metrics avoids conclusions tied to a single error geometry and provides a simple cross check for robustness of method ordering across datasets and horizons. Our iTransformer, Our Crossformer, and Our TimesNet denote the corresponding backbone equipped with EGMF-GR under the same recipe.

EGMF-GR improves iTransformer and TimesNet on most datasets and horizons. Clear gains appear on Traffic and Weather, which is consistent with improved tolerance to high dimensional and noisy regimes. On Exchange, performance remains competitive across horizons, which indicates that the benefit does not rely on explicit seasonal structure. Figure 4 summarizes the improvement percentage on average MSE across horizons for each dataset. Positive values indicate reduced error relative to the corresponding backbone baseline.

Crossformer shows stronger dataset dependence. Gains appear on Electricity, Exchange, Traffic, and Weather, while degradations appear on the ETT benchmarks, most notably ETTm2 and ETTh2. This pattern indicates that the benefit of module level fusion depends on the backbone and the data regime, and it motivates additional diagnostic analysis in the later ablation and robustness studies.

Table 3 Performance analysis of EGMF-GR across datasets. Avg denotes the arithmetic mean over horizons 96, 192, 336, and 720 under the matched protocol. A dagger indicates statistical significance after Benjamini Hochberg correction based on paired Wilcoxon signed rank tests on rolling origin errors, comparing EGMF-GR with the corresponding backbone baseline under the same setting. For compactness, significance marks are reported for Avg MSE.
Fig. 4
Fig. 4
Full size image

Improvement percentage of EGMF-GR on average MSE over three long term forecasting backbones on eight benchmarks. The average MSE is computed over horizons 96, 192, 336, and 720. Positive values indicate reduced error relative to the corresponding backbone baseline.

Evidence beyond common training strategies

This subsection clarifies what improves beyond common weight smoothing baselines and provides matched evidence under the same backbone and training recipe. Exponential moving average (EMA) and stochastic weight averaging (SWA) perform trajectory level weight averaging for a single model instance and do not include population based selection, module level fusion, or a trigger mechanism. In contrast, EGMF-GR maintains a population of weight diverse individuals during training, ranks individuals with a selection objective, and performs module level fusion only when a multi metric discrepancy score exceeds an IQR regularized robust threshold. After fusion, a short gradient refinement stage and state level synchronization, including buffer alignment when applicable, promote stable transfer.

To strengthen evidence beyond a vanilla baseline, EGMF-GR is compared with two widely used training strategies under the same backbone and training recipe, namely EMA and SWA. The comparison uses the same data split and the same evaluation protocol as the main results. Table 4 reports two complementary views.

The first block evaluates iTransformer at horizon 96 on four ETT benchmarks, including ETTm1, ETTm2, ETTh1, and ETTh2. EGMF-GR achieves lower error than Vanilla, EMA, and SWA on ETTm1, ETTm2, and ETTh1. On ETTh2, EMA achieves the lowest error, while EGMF-GR stays lower than Vanilla and SWA.

The second block evaluates iTransformer on ETTm1 across horizons 96, 192, 336, and 720 under the same recipe. EGMF-GR achieves lower error than Vanilla, EMA, and SWA on horizons 96, 192, and 336. On horizon 720, MSE is comparable to EMA while MAE is lower. Overall, the evidence across datasets and horizons supports that EGMF-GR provides consistent benefits beyond common training strategies under matched training settings. EGMF-GR refers to the same framework as the earlier name E Informer.

Table 4 Comparison with common training strategies under the same backbone and training recipe. Block one reports horizon 96 across four ETT datasets. Block two reports ETTm1 across four horizons. Lower is better.

Ablation study

This subsection isolates the contribution of key components under the same backbone and training recipe. All variants share the same data split, normalization, optimizer updates, batch size, early stopping rule, and leakage free rolling origin evaluation protocol. Baseline and EGMF-GR are reported together as reference anchors to support direct comparison under identical settings. EGMF-GR refers to the same framework as the earlier name E Informer.

Key component variants are evaluated on ETTm2 under the matched optimization budget. Table 5 reports results across horizons 96, 192, 336, and 720 and includes an average summary over the four horizons. No JSD and KLD removes divergence based terms from module scoring and retains only normalized MSE and normalized MAE. Fixed threshold replaces the adaptive robust threshold with a single constant threshold shared by all modules and horizons. Another fusion mode replaces discrepancy aware weighted fusion with hard selection that copies parameters from the parent module with smaller module level fitness. The results indicate that multi metric scoring and the adaptive robust trigger support reliable module transfer, while fixed thresholding or simplified fusion rules reduce stability across horizons.

Table 5 Comparison of key component variants for multi horizon prediction on ETTm2 under the same backbone and training recipe. No JSD and KLD removes divergence based terms from module scoring and retains only normalized MSE and normalized MAE. Fixed threshold uses a single constant threshold shared by all modules and horizons, and the threshold is set as the median of warm up adaptive thresholds from the same run. Another fusion mode replaces weighted fusion with hard selection that copies parameters from the parent module with smaller module level fitness. Lower is better.

Sensitivity to the robust trigger hyperparameter gamma

Sensitivity to the robust trigger hyperparameter gamma is evaluated on ETTm1, ETTm2, ETTh1, and ETTh2 under the same backbone and training recipe. Figure 5 visualizes forecasting error and fusion activity on a shared gamma grid. Table 6 summarizes baseline anchored performance together with the best gamma selected by the lowest MSE on the tested grid. Larger gamma reduces trigger rate and suppresses fusion, while smaller gamma increases trigger frequency and strengthens transfer.

Table 6 Sensitivity to the robust trigger hyperparameter gamma across multiple datasets with baseline references. Baseline and EGMF-GR serve as reference anchors under identical settings for each dataset. Best gamma reports the lowest MSE on the tested grid.
Fig. 5
Fig. 5
Full size image

Effect of the robust trigger hyperparameter gamma on forecasting error and fusion activity across multiple datasets under the same backbone and training recipe. Larger gamma reduces trigger rate and suppresses fusion.

Robustness under distribution shift

Robustness under distribution shift is evaluated using a leakage free rolling origin protocol under identical settings. Three shift settings are considered, including stress period, noise injection, and missing blocks. The stress period follows a deterministic selection rule on the clean test rolling origins of the baseline model. All test rolling origins are partitioned into four contiguous chronological segments of equal size, per origin forecasting errors are computed, and the segment with the largest average MSE is selected as the stress period. MSE and MAE are then reported using only the rolling origins inside that worst segment. For noise injection, corruption is applied to the input window only, while prediction targets remain unchanged. Additive Gaussian noise with zero mean is injected, and the noise standard deviation is set to 0.05 times the standard deviation of the current input batch. For missing blocks, one contiguous block of length 24 is masked in each input window, the start index is sampled uniformly within the valid input range, and masked values are filled with zero. Under the default input length of 96, the missing ratio equals 0.25. Table 7 reports baseline and EGMF-GR under the same protocol for direct comparison. A clear pattern appears across the three settings. EGMF-GR provides consistent gains under noise injection and missing blocks, but does not show the same advantage during the stress period. This difference suggests that the method is more effective against input corruption than against abrupt regime volatility. Under noise injection and missing blocks, the main challenge lies in degraded or incomplete observations, and the fusion mechanism helps stabilize intermediate representations by transferring more reliable module level information from the current best individual. During the stress period, however, the dominant difficulty is a rapid change in temporal dynamics rather than simple corruption of the input. In that case, aggressive fusion can reduce useful diversity across individuals and can make adaptation to the new regime less flexible, which explains the slightly weaker result relative to the baseline. This observation also indicates a practical direction for improvement. For volatility dominated shifts, fusion strength can be reduced adaptively when module discrepancy rises sharply across rolling origins. A stress aware trigger can also be introduced so that the threshold becomes more conservative under rapid distributional movement. Such designs can preserve the benefit of fusion under noisy or incomplete inputs while reducing the risk of over alignment during abrupt regime transition.

Table 7 Robustness under distribution shift with baseline references. Stress period results are computed on the chronological quarter of clean test rolling origins with the largest average MSE under the baseline model. Noise injection applies additive Gaussian noise with zero mean and standard deviation equal to 0.05 times the standard deviation of the current input batch. Missing blocks mask one contiguous block of length 24 in each input window with zero filling. Under input length 96, the corresponding missing ratio is 0.25. Lower is better.

Conclusions

This paper presents EGMF-GR, a hybrid training framework that couples evolutionary search with gradient based optimization for LTSF under nonstationarity, noisy updates, and distribution shift. Rather than redesigning backbone architectures, EGMF-GR improves robustness from the training perspective by maintaining a population of weight diverse individuals and introducing globally guided module level fusion. The fusion decision is governed by a multi metric discrepancy score that integrates JSD, KLD, MSE, and MAE, together with an IQR regularized hybrid threshold. After fusion, a short gradient based refinement stage promotes local optimality, while model state level synchronization, including buffer alignment when applicable, improves internal consistency after merging. A weak stability rationale supports bounded iterates of the alternating fusion and refinement procedure under standard boundedness conditions.

Experiments on eight public benchmarks show that EGMF-GR improves forecasting accuracy, training stability, and optimization robustness under matched settings. By combining population level guidance with local gradient refinement, the framework helps suppress unstable updates and supports more reliable learning across diverse time series conditions. Future work will focus on adaptive discrepancy monitoring, more refined fusion control, and stronger evaluation under distribution shift.