Remaining useful life prediction method based on gated dilation causal convolution

He, Jing; Sun, Wei; Zhang, Changfan

doi:10.1038/s41598-026-44784-y

Download PDF

Article
Open access
Published: 28 March 2026

Remaining useful life prediction method based on gated dilation causal convolution

Jing He^1,2,
Wei Sun² &
Changfan Zhang²

Scientific Reports volume 16, Article number: 10809 (2026) Cite this article

360 Accesses
Metrics details

Subjects

Abstract

Rolling bearings are a crucial element of rotating mechanical equipment. Therefore, predicting the remaining useful life (RUL) of this equipment is vital. Monitoring data for rolling bearing operation usually consists of a long-life cycle sequence covering the entire life cycle of the bearing. This process requires deep learning models to model global and local modeling capabilities. However, traditional convolutional neural networks are not very good at modeling global features. We proposed a novel RUL prediction framework based on gating inflation causal convolution to address the above shortcomings. This framework includes a multi-feature squeeze excitation unit that adaptively proofreads feature responses from local and global perspectives. A sinusoidal position coding is also designed to allow the network to obtain information between the distant time steps. The framework also features a gated dilated causal convolution (GDCC) network to reduce the vanishing gradient probability, retain the nonlinear ability of the network, and dynamically realize information changes. Finally, the effectiveness of this method is verified through comparative experimental analysis, ablation study experiments, and visual analysis.

Introduction

Rolling bearing RUL is a crucial technology essential for ensuring the maximum continuous running time efficiency and reducing the maintenance cost of the equipment^1,2. However, the actual crack propagation of rolling bearings is exceptionally irregular, making it challenging to construct an effective model of the failure mechanism³. Conversely, the data-driven approach can infer causality hidden in the data without requiring knowledge of the machine’s explicit failure mechanism⁴.

Deep learning techniques, particularly Recurrent Neural Networks (RNNs) such as Long Short-Term Memory (LSTM) networks^{5,6,7,8,9,10,11}, have demonstrated superior representational learning capabilities compared to traditional machine learning methods. LSTM effectively addresses gradient explosion and vanishing gradient problems faced by earlier RNNs. However, its inherent chain structure precludes parallel processing, leading to computational inefficiencies and slow operation speeds¹².

To overcome these efficiency bottlenecks, Convolutional Neural Networks (CNNs) have been widely adopted for their parallel computing capabilities. For instance, Wang et al.¹³ combined a deep convolution autoencoder with CNN for RUL prediction, while Ma et al.¹⁴ proposed a channel attention CNN algorithm. Bai et al.¹⁵ further evaluated convolutional structures for time series modeling. Despite these advancements, traditional CNNs suffer from two significant disadvantages: (1) They rely on fixed geometric structures, lacking the flexibility to adapt to varying time window sizes; (2) They require deep stacking to achieve sufficient receptive fields for global feature modeling, which significantly increases the risk of gradient vanishing in deep networks.

Beyond conventional CNNs, advanced deep learning architectures leveraging attention mechanisms and graph neural networks (GNNs) have recently emerged to capture more complex dependencies. For instance, Zhang et al.¹⁶ proposed a self-supervised graph feature enhancement method with scale attention to improve node-level representation. Zhang et al.¹⁷ developed a graph feature dynamic fusion framework based on multi-head attention, which realizes effective fault diagnosis of large rotating machinery with multi-sensor data. Furthermore, graph-driven calibration frameworks^18,19 have been introduced to address distribution shifts in RUL prediction across different operating conditions. While these methods offer strong global modeling capabilities, they often entail high computational complexity and require explicit topology construction.

To address these limitations, Temporal Convolutional Networks (TCNs)^{20,21,22,23,24} introduced dilated convolutions to exponentially expand the receptive field without increasing depth excessively, utilizing residual structures to mitigate gradient vanishing. Shang et al.²⁴ further enhanced TCNs by incorporating multiple self-attention layers to adaptively assign feature weights. However, existing TCN-based approaches still face challenges. Standard convolutions are translation-invariant and cannot explicitly perceive the global sequence order between distant time steps. Furthermore, most methods lack an effective mechanism to adaptively calibrate feature responses from both local (time-step) and global (channel) perspectives to filter noise in complex degradation signals.

Inspired by gated convolutional networks²⁵, this study proposes a novel RUL prediction framework integrating Gated Dilated Causal Convolution (GDCC) with Multi-Scale Encoding Units (MSEU). The main contributions of this paper are summarized as follows:

1.
A novel Gated Dilated Causal Convolution (GDCC) network is proposed. We constructed a GDCC module where the linear Dilated Causal Convolution (DCC) provides a direct linear path for gradients, thereby significantly reducing the probability of gradient vanishing in deep networks. Moreover, a gating mechanism is introduced to control information flow, preserving the network’s nonlinear modeling capability and allowing it to dynamically adapt to information changes during the degradation process. This structure differs from prior TCN variants by integrating gating directly into the dilated causal branch while maintaining stability under large receptive fields.
2.
A Multi-Scale Encoding Unit (MSEU) is designed for adaptive feature calibration. We designed a specific MSEU cell that establishes correlations between multi-feature variables from both local and global perspectives. It assesses the information content of each feature map and adaptively recalibrates feature responses. This mechanism effectively suppresses useless features and uncorrelated noise, enhancing the predictive network’s ability to resolve informative degradation patterns. Unlike conventional channel attention blocks, the MSEU is strategically positioned after feature compression to better target multi-sensor fusion.
3.
Sinusoidal positional encoding is introduced to enforce global temporal order. Addressing the limitation that standard convolution can only perceive local order information, we introduce sinusoidal positional encoding to inject explicit location information into the features. This artificially adds global order relationships between distant time steps, thereby increasing the accuracy of time series network predictions. We further place positional encoding after the MSEU (rather than at the input) to avoid information loss under low channel counts and to align effectively with the downstream stacked GDCC modeling.
4.
State-of-the-art performance is demonstrated. Comprehensive comparative experiments and ablation studies on the XJTU-SY dataset verify the effectiveness of the proposed method. The results demonstrate that our framework outperforms existing advanced baselines, including TCN-SA and CNN-BiLSTM, in terms of prediction accuracy and robustness under various operating conditions.

Prerequisites

Description of the problem

This article defines the problem of predicting a multivariate time series prediction as a series pair problem. A time series signal $X_{I} = (x_{1} ,x_{2} ,...,x_{T} )$ is given as an input series to predict the output $Y_{O} = (y_{1} ,y_{2} ,...,y_{T} )$ for each time. The sequence modeling network z is designed to establish a mapping relationship between the monitoring data and the mechanical degradation process:

$$(y_{1} ,y_{2} ,...,y_{T} ) = z(x_{1} ,x_{2} ,...,x_{T} )$$

(1)

Causal convolution and dilated convolution

Causal convolution is a technique that follows the principles of “the input and output scales remain unchanged” and “future data is not visible.” This framework shows that the known condition is all time steps before t, and then the prediction ${\text{P}}(x_{t} |x_{t - 1} ,x_{t - 2} ,...,x_{1} )$ given by the t-moment model cannot depend on any future time steps. The causal convolution at $x_{t}$ is as follows:

$$(F * X)(x_{t} ) = \sum\limits_{k = 1}^{K} {f_{k} x_{t - K + k} }$$

(2)

where $F(f_{1} ,f_{2} ,...,f_{K} )$ is the filter; Ki is the filter size; $*$ is the convolution operation.

However, causal convolution has a drawback, requiring more network layers as the information is traced longer. This issue may lead to the disappearance of the network gradient. Dilated convolution solves the above problems.

Dilated convolution works by skipping the partial input, allowing the convolution kernel to be applied to regions more extensive than the convolution kernel’s length. With dilated convolution, convolutional networks can sense a large receptive field with fewer layers. At $x_{t}$, the dilated factor is d dilated convolution is revealed as follows:

$$(F * {}_{d}X)(x_{t} ) = \sum\limits_{k = 1}^{K} {f_{k} x_{t - (K - k)d} }$$

(3)

When the dilated factor $d = 1$, it is an ordinary convolution. However, both K and d can theoretically increase the receptive field (size is $(K - 1)d + 1$). In practice, as the number of layers in the network increases, the dilated factor usually increases by an exponential of 2.

Squeeze excitation block

The squeeze excitation block shows the correlation between feature channels modeled geographically²⁶. It emphasizes the importance of learning the characteristics of different channels, and each output channel is weighted after a constant weight is predicted. The block is shown in Fig. 1.

Figure 1 presents the squeeze as follows:

$$Z = S(T) = \frac{1}{H \times W}\sum\limits_{i = 1}^{H} {\sum\limits_{j = 1}^{W} {T_{C} (i,j)} }$$

(4)

The excitation is expressed as follows:

$$E\left( {Z,W} \right) = \sigma \left( {W_{2} \delta (W_{1} Z)} \right)$$

(5)

where T is the feature representation; H, W, and C are the height, width, and channel of T, respectively; $\delta$ and $\sigma$ represent the rectified linear unit (ReLU) and sigmoid activation functions, respectively; $W_{1}$ and $W_{2}$ are fully connected layers; S is the squeeze operation; E is the excitation operation.

Proposed methodology

This study introduces a framework, as illustrated in Fig. 2, which includes several components that enhance the accuracy of the prediction network. Specifically, feature compression fusion reduces the feature size and compacts the extracted features. MSEU is designed to recalibrate the prediction network and enhance its resolution adaptively and for information processing. Sinusoidal position encoding is incorporated into hidden layer features to add global order relationships. The GDCC module is built and stacked further to learn the sequence features of the front layer. Finally, the network is trained using an optimization algorithm to obtain the optimal network weights.

A new method called MSEU has been proposed to effectively identify the differences between feature variables without a precise learning mechanism during feature compression fusion. The process in the C structure in Fig. 2 accurately pays attention to valid variables and noise reduction to uncorrelated clutter by refining the input features twice, effectively helping information flow through the network. In the first step, the method learns how dependent the feature variables are at each time step from a local perspective. The second step aggregates the feature variables under the total time step from a global perspective and learns the degree of dependence between channels. Finally, the method adjusts the features and feature maps. The specific implementation details of MSEU are summarized in Algorithm 1.

where N is the number of convolutional cores of the previous convolutional layer. Moreover, the convolutional kernel $F_{1,N/2}$. $F_{1,N}$ is used to model the correlation between characteristic variables at each time step. M is the average pooling. In convolutional kernel $F_{2,N/2}$,$F_{2,N}$ is used to model correlations between channels under global distribution.

It is worth noting that the conventional one-dimensional convolution operations used in steps 1 and 2 are not DCC because they do not consider the association between the time steps involved. Generally, there are not too many characteristic variables for a single time step, so there is no need to use a sizeable sensory field.

Sinusoidal position encoding

The order relationship between time steps in time series data often affects the final prediction results because the information carried by the latest time step is more conducive to prediction than the early information. Convolutional kernels can extract local sequential information from sequences but cannot obtain information between distant time steps.

This article also adds a position vector to $X^{\prime}_{f}$ to enhance the “position sense” of GDCC. The construction method of the position vector follows the usage schemas discussed in a study²⁷:

$$PE = \left\{ \begin{gathered} PE_{(pos,2i)} = \sin \left( {\frac{pos}{{10000^{{\frac{2i}{{d_{m} }}}} }}} \right) \hfill \\ PE_{(pos,2i + 1)} = \cos \left( {\frac{pos}{{10000^{{\frac{2i}{{d_{\text{m}} }}}} }}} \right) \hfill \\ \end{gathered} \right.$$

(9)

where the integer of $pos \in (0,T_{f} - 1)$; $T_{f}$ is the total time step of $X^{\prime}_{f}$; $d_{m}$ is the dimension of the position vector, the same as the $X^{\prime}_{f}$ dimension value. The integer value of $i \in \left( {0,\frac{{d_{m} }}{2} - 1} \right)$ .

In natural language processing, the position encoding is usually applied immediately after the input. However, position encoding is placed after the MSEU for two main reasons. First, channel C of the input data is equal to 2. As shown in Fig. 3, only a 2-dimensional position vector will inevitably lead to a significant loss of location information. Second, the MSEU does not involve sequential relationships between time steps, unlike the stacked GDCC module, which learns between time series time steps. Taken together, this paper places the sinusoidal position encoding after the MSEU and before the stacking of GDCC modules. Figure 2 presents the D structure showing the sinusoid position encoding.

The output of the sinusoidal position encoding module is as follows:

$$X_{PE} = X^{\prime}_{f} + PE$$

(10)

Stack GDCC modules and predict RUL

The vector sequence to be processed is $X_{PE}$ using DCC. Inspired by the gate mechanism, this section adds a gating unit to the DCC, as indicated in (11):

$$\begin{gathered} g(X_{PE} ) = \sigma (DCC_{1} (X_{PE} )) \\ G(X_{PE} ) = DCC_{2} \times {\text{g}}(X_{PE} ) \\ \end{gathered}$$

(11)

This structure is referred to as GDCC. Moreover, $DCC_{1}$ and $DCC_{2}$ have the same in convolutional cores, window size, and dilated factor size. However, they are independent, indicating their weights are not shared. The use of the $\sigma$ function preserves the nonlinearity of the network making the network not a complex linear problem. The value field of $\sigma$ is (0,1). Intuitively, each output of $DCC_{2}$ is added to a “valve” to control the information flow and passed to the next neuron. During training, useful data that match historical degradation data will be saved.

One advantage of GDCC is that it hardly has to worry about the gradient disappearing problem because $DCC_{2}$ does not add any activation function. Therefore, the differentiation of this part of $DCC_{2}$ is constant. However, a linear path is provided for the gradient, alleviating the problem of disappearing gradients in deep networks.

Therefore, we add the input to it using the residual structure, as indicated in (12):

$$G(X_{PE} ) = G(X_{PE} ) + X_{PE}$$

(12)

Therefore, the dimension sizes of the input and output do not match, as indicated in (13):

$$G(X_{PE} ) = G(X_{PE} ) + DCC_{3} (X_{PE} )$$

(13)

We combine (11) and (13) to rewrite into a more vivid equivalent form to see information flows clearly, as indicated in (14):

$$\begin{aligned} G(X_{{PE}} ) = & DCC_{2} (X_{{PE}} ) \times g(X_{{PE}} ) + DCC_{3} (X_{{PE}} ) \\ & = (DCC_{2} (X_{{PE}} ) - DCC_{3} (X_{{PE}} )) \times g(X_{{PE}} ) + DCC_{3} (X_{{PE}} ) \\ & = DCC_{2} (X_{{PE}} )g(X_{{PE}} ) + DCC_{3} (X_{{PE}} )(1 - g(X_{{PE}} )) \\ \end{aligned}$$

(14)

$DCC_{2} (X_{PE} )$ is a linear transformation. Therefore, in (14), $DCC_{2} (X_{PE} ) - DCC_{3} (X_{PE} )$ is equivalent to $DCC_{2} (X_{PE} )$. In other words, learning can occur in the training process $DCC_{2} (X_{PE} ) - DCC_{3} (X_{PE} )$ while gaining learning knowledge $DCC_{2} (X_{PE} )$.

In summary, this section proposed a gated DCC and designed a GDCC module, as shown in Fig. 4. This section uses the gating mechanism, unlike the full pre-activation structure, which uses the ReLU activation function.The specific implementation of the E structure is detailed in Algorithm 2.

Figure 2 shows the specific implementation of the E structure, shown as follows:

Experimental analysis

Experimental instructions

Performance metrics

This study selects the mean absolute error (MAE) and the root mean square error (RMSE) to measure the distance between prediction and true value. MAE and RMSE are defined in (15) and (16):

$$\begin{gathered} er_{t} = RUL_{t}^{act} - RUL_{t}^{pre} \\ MAE = \frac{1}{T}\sum\limits_{t = 1}^{T} {\left| {er} \right|} \\ \end{gathered}$$

(15)

$$RMSE = \sqrt {\frac{1}{T}\sum\limits_{t = 1}^{T} {(er_{t} )^{2} } }$$

(16)

In (15), the error between the actual RUL ($RUL_{t}^{act}$) and the predicted RUL ($RUL_{t}^{{pre}}$) when the time is t is indicated; T is the total time.

coefficient of determination (R²) measures the amount of information captured by the data model. Therefore, we determine whether the model fits information other than the values, as indicated in (17):

$$R2 = 1 - \frac{{\sum\limits_{t = 1}^{T} {(er_{t} )^{2} } }}{{\sum\limits_{t = 1}^{T} {(RUL_{t}^{act} - \sum\limits_{t = 1}^{T} \frac{1}{T} RUL_{t}^{act} )^{2} } }}$$

(17)

$A_{t}$ is an asymmetric function, as indicated in (18). As shown in Fig. 5, undervaluation is punished more severely than overvaluation because the worst possible outcome of underestimation is to replace equipment early. However, overestimation can lead to severe mechanical accidents. Therefore, overvaluation should be punished more severely than undervaluation. Finally, the predictive model scoring function is defined, as shown in (19):

$$A_{t} = \left\{ {\begin{array}{*{20}l} {\exp \left( { - \ln (0.6) \times \left( {\frac{{er_{t} }}{10}} \right)} \right),} \hfill & {\quad er_{t} \le 0} \hfill \\ {\exp \left( {\ln (0.6) \times \left( {\frac{{er_{t} }}{40}} \right)} \right),} \hfill & {\quad er_{t} \ge 0} \hfill \\ \end{array} } \right.$$

(18)

$$Score = w_{1} \frac{1}{m}\sum\limits_{t = 1}^{m} {A_{t} + } w_{2} \frac{1}{T - m}\sum\limits_{m + 1}^{T} {A_{t} }$$

(19)

In (19), m is the percentage of early stages; $w_{1}$ and $w_{2}$ are early and late option weights, respectively.

In real scenarios, the accuracy of late RUL predictions is more critical than early days. As a result, more weight should be assigned to the predicted outcome evaluation at a later stage. In the formula $w_{1} = 0.35$,$w_{2} = 0.65$,$m = \frac{T}{2}$.

Dataset and parameter setting

The study used experimental to analyze bearings for faults using a dataset published by the PHM 2012 Data Challenge²⁸. The dataset was collected on an accelerated aging platform to gather vibration data, as shown in Fig. 6. The experiment in this article used data from two operating conditions, as listed in Table 1.

Table 1 Operating conditions and bearing numbers.

Full size table

Train/Test Split Protocol. We adopt a leave-one-out protocol within the same operating condition. When testing a bearing in condition 1 (e.g., B1-1), the training set consists exclusively of the remaining condition-1 bearings (B1-2–B1-7); no condition-2 bearings are used. Likewise, when testing a bearing in condition 2 (e.g., B2-4), the training set consists of the remaining condition-2 bearings (B2-1–B2-3 and B2-5–B2-7). We run seven leave-one-out experiments per condition (14 in total) and report the best checkpoint per run.

This article utilizes vertical and horizontal vibration signals, and the a records vibration signal is split into samples with a duration of 0.1s. The output label is the percentage of each bearing’s life with the actual RUL of each bearing standardized within the range of 0–100%. Therefore, (20) is used to normalize the actual RUL:

$$RUL_{t}^{act} = \frac{{y_{t} }}{T} \times 100$$

(20)

In the above equation,$y_{t}$ is the essentially real RUL at time t.

This experiment trained the entire neural network in a Keras framework. The mean squared error (MSE) is chosen as the training loss function. The specific learning process of the network is as follows:

Inputs: Training data X and label Y.

Output: Network weight $W_{network}$.

Step 1 The feature compression fusion module initially extracts X, which reduces the feature size to obtain a compact feature $X_{f}$;

Step 2 We adaptively adjust the feature response $X_{f}$ to obtain an attention-weighted feature $X^{\prime}_{f}$ ;

Step 3 $X^{\prime}_{f}$ adds sinusoidal position coding to obtain the feature $X_{PE}$ with global position information.

Step 4 The stacked GDCC module further learns $X_{PE}$ and uses it for the final RUL prediction.

Step 5 We update the network weight $W_{network}$ using the gradient descent algorithm until the iteration terminates. The algorithm ends.

Table 2 lists the parameter settings.

Table 2 Model structure parameter settings.

Full size table

Table 2 presents information about filters, strides, and l2 representing the number of convolution kernels, convolution step size, and l2 regularization, respectively. Moreover, pool_size represents the pooled window size. The use_bias = False indicates no bias term is used, and lr is the learning rate.

Experimental design and analysis

We experimented with the following four perspectives, analyzing and comparing the experimental results. Each experiment ensures the data are consistently divided across the training and test sets. We performed 7 experiments for each working conditions, and there are 14 experiments in total. During training, the best weights were retained.Specifically, we use a within-condition leave-one-out split: when one bearing is the test set, the remaining bearings under the same condition form the training set; data from other conditions are not used.

1.
We verify the method’s effectiveness and compare it with other advanced methods^29,30.
2.
Experiment 1 lacks a valid MSEU and sinusoidal position coding design analysis. Therefore, quantitatively, ablation studies are used to analyze MSEU and sinusoidal position coding.
3.
In addition, GDCC is compared with benchmark methods mentioned in the literature³⁰, such as CNN and TCN.
4.
Visualization module output.

Comparative experiment

This paper presents a bearing RUL prediction algorithm compared with advanced algorithms, as shown in Table 3.

Table 3 Predictive evaluation of each test bearing.

Full size table

The proposed algorithm’s results are visually presented using bearings B1-1, B1-3, B2-4, and B2-6 data. Figure 7 shows the RUL prediction results of the proposed algorithm under two different working conditions.. The figure also displays the true label value and predicted value of RUL along the error bounds of 15% above and below the real label value.

Table 3 shows that the prediction model presented in this paper outperforms several other advanced RUL prediction algorithms in various performance evaluation indicators. At the same time, the performance of the latest method CNN-BiLSTM³¹ is also significantly lower than that of the method proposed in this paper.For test bearing B1-1, although the optimal method in Score and literature³⁰ is the same as the proposed method, MAE and RMSE are significantly lower than other models, and R² achieves good results. As shown in Fig. 7, the predicted value of the bearing RUL follows its true label value. All this evidence indicates that the presented prediction algorithm provides more accurate RUL prediction results.

Ablation experiment

The ablation experiments are proposed to quantify the impact of different components on the predicted results.

Table 4 presents the performance evaluation results of each ablation network in predicting the RUL of different bearing data after sequentially removing the sinusoidal position code and MSEU from the network. The test bearing data B1-1, B1-3, B2-4, and B2-6 were used as an example.

Table 4 Ablation experiments.

Full size table

Table 4 lists the sinusoidal position coding and MSEU improving the RUL prediction accuracy to varying degrees. For bearing B1-1, the network first removed the sinusoidal position code, with MAE up 52.5% and RMSE up 52.3%. Then, MSEU is removed, with MAE increasing by 26.2% and RMSE by 19.8%. Table 4 shows that the RUL prediction error gradually increases as the ablation experiment progresses. This increase suggests that the network’s bearing RUL prediction is positively impacted by adding the global order relationship of timing features through sinusoidal position coding. The local and global adaptive proofreading feature responds through MSEU.

Figure 8 shows the RUL prediction results when the network removes the sinusoidal position code. Figure 9 shows the results of the RUL prediction when the network removes the sinusoidal position coding and MSEU.

Benchmark model experiment

This study compared GDCC with two popular time series forecasting benchmark models, conventional CNN and TCN, as shown in Fig. 10. The benchmark models followed the traditional supervised learning model. The above experiments show that the prediction error obtained by the benchmark model is large, and the GDCC performance outperforms the benchmark model. These results indicate that DCC combined with gating mechanisms can flexibly increase the time series’s perceived field of view. This process allows information to flow more efficiently in the deep network.

Visualize and analyze

Visualizing and analyzing the features extracted by key modules within this network is vital to further confirm the GDCC network’s validity based on MSEU and sinusoidal position coding. Figure 11 shows the visualization feature maps for test set B1-1, taking the first sample as an example. For graphical aesthetics, only the first four features of each module’s output are drawn. As shown in Fig. 11, the features are relatively concentrated, and uncorrelated clutter is filtered after MSEU adaptive proofreading. The extracted features are gradually separated after sinusoidal position coding with the deepening of the model, which gradually aggregates after GDCC module 1. Our features are gradually ordered to GDCC module 2 output and finally to GDCC module 3 output, reflecting the strong feature extraction power of the network.

Generalization experiment on XJTU-SY dataset

To further verify the generalization capability and robustness of the proposed GDCC-MSEU framework, we conducted supplementary experiments on the XJTU-SY bearing dataset³². Unlike the PHM 2012 dataset, the XJTU-SY dataset contains real-world run-to-failure data collected under three different operating conditions (35Hz/12kN, 37.5Hz/11kN, 40Hz/10kN), characterized by significantly higher noise levels and complex degradation patterns. This poses a greater challenge for RUL prediction models. We adopted the same leave-one-out cross-validation protocol and compared our method with the advanced baselines: DANN²⁹, TCN-SA³⁰, and CNN-BiLSTM³¹.

The comparative results for all 15 test bearings are summarized in Table 5 . As observed, the proposed method consistently achieves the lowest MAE and RMSE across most test cases. Specifically, under the high-speed condition of 40Hz (Bearings 3–1 to 3–5), where rapid degradation leads to limited training samples, our method outperforms the second-best model (CNN-BiLSTM) by reducing the MAE by approximately 15% on average. Although the overall prediction errors on the XJTU-SY dataset are higher than those on the PHM 2012 dataset due to the inherent complexity and noise of the data, the proposed framework maintains a relatively high Score and R², demonstrating its superior ability to capture the underlying degradation trend amidst severe noise. The experimental results confirm that the integration of the gated mechanism and multi-scale encoding effectively enhances the model’s feature extraction capability, ensuring robust performance even under varying and harsh operating conditions.

Table 5 Predictive evaluation of each test bearing.

Full size table

As shown in Table 5 , the proposed GDCC framework achieves the best overall performance with an average MAE of 26.15 , outperforming DANN (23.92, though less stable), CNN-BiLSTM (30.43), and TCN-SA (32.16). Notably, our method demonstrates superior stability in complex operating conditions (e.g., Condition 2 and 3), achieving the highest R² values on challenging bearings such as B1-2 (0.68) and B2-4 (0.64). In contrast, baseline methods often exhibit negative R² values, indicating poor fitting on unseen data. These results confirm that the proposed GDCC network possesses strong generalization ability across different datasets and working conditions.

Conclusion

This study proposes a new deep prediction network called a GDCC network based on MSEU and sinusoidal position coding. The feature compression fusion mechanism lacks an effective attention mechanism to identify differences between feature variables. MSEU is designed to proofread multiple feature variables from two different perspectives, local and global, to effectively determine the differences between features after feature compression and fusion. In addition, the traditional convolutional structure had a poor ability to adapt to various time windows, and the deep convolutional structure is prone to gradient vanishing when processing long-term series. Therefore, a new GDCC module is proposed to address this issue. Dilated factors are used to accommodate various time window sizes flexibly. Moreover, stacking GDCC modules obtains sufficient receptive fields without quickly vanishing gradients. Sinusoidal position coding embedded in the network allowed neural networks to obtain information separated over long steps to enhance the GDCC “position sense.” This framework helps improve the accuracy of bearing RUL prediction. Finally, the correctness of this method is verified through comparative experimental analysis, ablation study experiments, and visual analysis. However, the method described here may result in the loss of time series feature information due to the dilated factor. In future research, we plan to address this issue by exploring ways to prevent information loss in time series modeling that uses dilated convolution.

Data availability

https://github.com/Lucky-Loek/ieee-PHM-2012-data-challenge-dataset.

References

Zhu, H., Huang, Z., Lu, B. & Zhou, C. Bearing remaining useful life prediction of fatigue degradation process based on dynamic feature construction. Int. J. Fatigue 164, 107169. https://doi.org/10.1016/j.ijfatigue.2022.107169 (2022).
Article Google Scholar
Cao, Y., Sun, Y., Xie, G. & Wen, T. Fault diagnosis of train plug door based on a hybrid criterion for IMFs selection and fractional wavelet package energy entropy. IEEE Trans. Veh. Technol. 68, 7544–7551. https://doi.org/10.1109/TVT.2019.2925903 (2019).
Article ADS Google Scholar
He, B., Liu, L. & Zhang, D. Digital twin-driven remaining useful life prediction for gear performance degradation: A review. J. Comput. Inf. Sci. Eng 21, 031007. https://doi.org/10.1115/1.4049537 (2021).
Article Google Scholar
Zhang, C., Huang, C. & He, J. Defects recognition of train wheelset tread based on improved spiking neural network. Chin. J. Electron. 32, 941–954. https://doi.org/10.23919/cje.2022.00.162 (2023).
Article Google Scholar
Yang, H., He, J., Liu, Z. & Zhang, C. LLD-MFCOS: A multiscale anchor-free detector based on label localization distillation for wheelset tread defect detection. IEEE Trans. Instrum. Meas. 73, 1–15. https://doi.org/10.1109/TIM.2023.3316214 (2024).
Article Google Scholar
Susandhika, M. & Edward, A. S. A review on recent trends in deep learning-based channel estimation techniques. in 2024 4th International Conference on Data Engineering and Communication Systems (ICDECS) 1–4. https://doi.org/10.1109/ICDECS59733.2023.10503562 (2024).
Verma, S., Chauhan, R., Rawat, R., & Pratibha. Deep learning algorithm for digital image forensics. in 2024 International Conference on Automation and Computation (AUTOCOM) 631–634. https://doi.org/10.1109/AUTOCOM60220.2024.10486071 (2024).
Liu, J., Du, D., He, J. & Zhang, C. Prediction of remaining useful life of railway tracks based on DMGDCC-GRU hybrid model and transfer learning. IEEE Trans. Veh. Technol. 73, 7561–7575. https://doi.org/10.1109/TVT.2024.3351763 (2024).
Article ADS Google Scholar
Liu, W. et al. A survey of deep neural network architectures and their applications. Neurocomputing 234, 11–26. https://doi.org/10.1016/j.neucom.2016.12.038 (2017).
Article Google Scholar
Janiesch, C., Zschech, P. & Heinrich, K. Machine learning and deep learning. Electron Markets 31, 685–695. https://doi.org/10.1007/s12525-021-00475-2 (2021).
Article Google Scholar
Sherstinsky, A. Fundamentals of recurrent neural network (RNN) and long short-term memory (LSTM) network. Physica D 404, 132306. https://doi.org/10.1016/j.physd.2019.132306 (2020).
Article MathSciNet Google Scholar
Datta, D., Mittal, D., Mathew, N. P. & Sairabanu, J. Comparison of Performance of Parallel Computation of CPU Cores on CNN model. in 2020 International Conference on Emerging Trends in Information Technology and Engineering (ic-ETITE) 1–8. https://doi.org/10.1109/ic-ETITE47903.2020.142 (2020).
Wang, C., Jiang, W., Yang, X. & Zhang, S. RUL prediction of rolling bearings based on a DCAE and CNN. Appl. Sci. 11, 11516. https://doi.org/10.3390/app112311516 (2021).
Article CAS Google Scholar
Ma, P., Li, G., Zhang, H., Wang, C. & Li, X. Prediction of remaining useful life of rolling bearings based on multiscale efficient channel attention CNN and bidirectional GRU. IEEE Trans. Instrum. Meas. 73, 1–13. https://doi.org/10.1109/TIM.2023.3347787 (2024).
Article ADS Google Scholar
Bai, S., Kolter, J. Z. & Koltun, V. An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling. Preprint at https://doi.org/10.48550/arXiv.1803.01271 (2018).
Zhang, X., Liu, J., Zhang, X. & Lu, Y. Self-supervised graph feature enhancement and scale attention for mechanical signal node-level representation and diagnosis. Adv. Eng. Inform. 65, 103197. https://doi.org/10.1016/j.aei.2025.103197 (2025).
Article Google Scholar
Zhang, X., Zhang, X., Liu, J., Wu, B. & Hu, Y. Graph features dynamic fusion learning driven by multi-head attention for large rotating machinery fault diagnosis with multi-sensor data. Eng. Appl. Artif. Intell. 125, 106601. https://doi.org/10.1016/j.engappai.2023.106601 (2023).
Article Google Scholar
Zhang, X., Liu, J., Zhang, X. & Lu, Y. Multiscale channel attention-driven graph dynamic fusion learning method for robust fault diagnosis. IEEE Trans. Ind. Inform. 20, 11002–11013. https://doi.org/10.1109/TII.2024.3397401 (2024).
Article Google Scholar
Xie, F., Wang, L., Zhang, X. & Chen, Z. Spatiotemporal graph structure similarity evaluation for pure zero-shot anomaly detection tasks. Knowl. Based Syst. 335, 115191. https://doi.org/10.1016/j.knosys.2025.115191 (2026).
Article Google Scholar
Lea, C., Flynn, M. D., Vidal, R., Reiter, A. & Hager, G. D. Temporal convolutional networks for action segmentation and detection. in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 1003–1012. https://doi.org/10.1109/CVPR.2017.113 (2017).
Wei, Y. et al. Revisiting dilated convolution: A simple approach for weakly- and semi-supervised semantic segmentation. in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition 7268–7277. https://doi.org/10.1109/CVPR.2018.00759 (2018).
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 770–778. https://doi.org/10.1109/CVPR.2016.90 (2016).
Yang, W., Yao, Q., Ye, K. & Xu, C.-Z. Empirical mode decomposition and temporal convolutional networks for remaining useful life estimation. Int. J. Parallel Program. 48, 61–79. https://doi.org/10.1007/s10766-019-00650-1 (2020).
Article CAS Google Scholar
Shang, Z., Zhang, B., Li, W., Qian, S. & Zhang, J. Machine remaining life prediction based on multi-layer self-attention and temporal convolution network. Complex Intell. Syst. 8, 1409–1424. https://doi.org/10.1007/s40747-021-00606-4 (2022).
Article Google Scholar
Dauphin, Y. N., Fan, A., Auli, M. & Grangier, D. Language Modeling with Gated Convolutional Networks. Preprint at https://doi.org/10.48550/arXiv.1612.08083 (2017).
Hu, J., Shen, L. & Sun, G. Squeeze-and-Excitation Networks. in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition 7132–7141. https://doi.org/10.1109/CVPR.2018.00745 (2018).
Vaswani, A. et al. Attention Is All You Need. Preprint at https://doi.org/10.48550/arXiv.1706.03762 (2023).
Nectoux, P. et al. PRONOSTIA : An experimental platform for bearings accelerated degradation tests. in Conference on Prognostics and Health Management. vol. sur CD ROM 1–8 (IEEE Catalog Number : CPF12PHM-CDR, Denver, Colorado, United States, 2012).
Li, X., Zhang, W., Ma, H., Luo, Z. & Li, X. Data alignments in machinery remaining useful life prediction using deep adversarial neural networks. Knowl. Based Syst. 197, 105843. https://doi.org/10.1016/j.knosys.2020.105843 (2020).
Article Google Scholar
Wang, Y., Deng, L., Zheng, L. & Gao, R. X. Temporal convolutional network with soft thresholding and attention mechanism for machinery prognostics. J. Manuf. Syst. 60, 512–526. https://doi.org/10.1016/j.jmsy.2021.07.008 (2021).
Article Google Scholar
Marjani, M., Mahdianpari, M. & Mohammadimanesh, F. CNN-BiLSTM: A novel deep learning model for near-real-time daily wildfire spread prediction. Remote Sens. 16, 1467. https://doi.org/10.3390/rs16081467 (2024).
Article ADS Google Scholar
Maurya, S. & Verma, N. K. Intelligent hybrid scheme for health monitoring of degrading rotary machines: An adaptive fuzzy c-means coupled with 1-D CNN. IEEE Trans. Instrum. Meas. 72, 3510110. https://doi.org/10.1109/TIM.2023.3253887 (2023).

Download references

Acknowledgements

We thank LetPub (www.letpub.com) for its linguistic assistance during the preparation of this manuscript.

Funding

This work is supported by the National Nature Science Foundation of China (52572375),Project of Hunan Provincial Department of Education, China (25A0798).

Author information

Authors and Affiliations

College of Engineering, Shandong Xiehe University, Shandong, 250109, China
Jing He
College of Railway Transportation, Hunan University of Technology, Zhuzhou, 412001, China
Jing He, Wei Sun & Changfan Zhang

Authors

Jing He
View author publications
Search author on:PubMed Google Scholar
Wei Sun
View author publications
Search author on:PubMed Google Scholar
Changfan Zhang
View author publications
Search author on:PubMed Google Scholar

Contributions

J.H: Conceptualization, Original Draft; W.S: Methodology; C.Z: Writing - Writing—review and editing.

Corresponding author

Correspondence to Changfan Zhang.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

He, J., Sun, W. & Zhang, C. Remaining useful life prediction method based on gated dilation causal convolution. Sci Rep 16, 10809 (2026). https://doi.org/10.1038/s41598-026-44784-y

Download citation

Received: 06 October 2025
Accepted: 13 March 2026
Published: 28 March 2026
Version of record: 31 March 2026
DOI: https://doi.org/10.1038/s41598-026-44784-y