Introduction

Rolling bearing RUL is a crucial technology essential for ensuring the maximum continuous running time efficiency and reducing the maintenance cost of the equipment1,2. However, the actual crack propagation of rolling bearings is exceptionally irregular, making it challenging to construct an effective model of the failure mechanism3. Conversely, the data-driven approach can infer causality hidden in the data without requiring knowledge of the machine’s explicit failure mechanism4.

Deep learning techniques, particularly Recurrent Neural Networks (RNNs) such as Long Short-Term Memory (LSTM) networks5,6,7,8,9,10,11, have demonstrated superior representational learning capabilities compared to traditional machine learning methods. LSTM effectively addresses gradient explosion and vanishing gradient problems faced by earlier RNNs. However, its inherent chain structure precludes parallel processing, leading to computational inefficiencies and slow operation speeds12.

To overcome these efficiency bottlenecks, Convolutional Neural Networks (CNNs) have been widely adopted for their parallel computing capabilities. For instance, Wang et al.13 combined a deep convolution autoencoder with CNN for RUL prediction, while Ma et al.14 proposed a channel attention CNN algorithm. Bai et al.15 further evaluated convolutional structures for time series modeling. Despite these advancements, traditional CNNs suffer from two significant disadvantages: (1) They rely on fixed geometric structures, lacking the flexibility to adapt to varying time window sizes; (2) They require deep stacking to achieve sufficient receptive fields for global feature modeling, which significantly increases the risk of gradient vanishing in deep networks.

Beyond conventional CNNs, advanced deep learning architectures leveraging attention mechanisms and graph neural networks (GNNs) have recently emerged to capture more complex dependencies. For instance, Zhang et al.16 proposed a self-supervised graph feature enhancement method with scale attention to improve node-level representation. Zhang et al.17 developed a graph feature dynamic fusion framework based on multi-head attention, which realizes effective fault diagnosis of large rotating machinery with multi-sensor data. Furthermore, graph-driven calibration frameworks18,19 have been introduced to address distribution shifts in RUL prediction across different operating conditions. While these methods offer strong global modeling capabilities, they often entail high computational complexity and require explicit topology construction.

To address these limitations, Temporal Convolutional Networks (TCNs)20,21,22,23,24 introduced dilated convolutions to exponentially expand the receptive field without increasing depth excessively, utilizing residual structures to mitigate gradient vanishing. Shang et al.24 further enhanced TCNs by incorporating multiple self-attention layers to adaptively assign feature weights. However, existing TCN-based approaches still face challenges. Standard convolutions are translation-invariant and cannot explicitly perceive the global sequence order between distant time steps. Furthermore, most methods lack an effective mechanism to adaptively calibrate feature responses from both local (time-step) and global (channel) perspectives to filter noise in complex degradation signals.

Inspired by gated convolutional networks25, this study proposes a novel RUL prediction framework integrating Gated Dilated Causal Convolution (GDCC) with Multi-Scale Encoding Units (MSEU). The main contributions of this paper are summarized as follows:

  1. 1.

    A novel Gated Dilated Causal Convolution (GDCC) network is proposed. We constructed a GDCC module where the linear Dilated Causal Convolution (DCC) provides a direct linear path for gradients, thereby significantly reducing the probability of gradient vanishing in deep networks. Moreover, a gating mechanism is introduced to control information flow, preserving the network’s nonlinear modeling capability and allowing it to dynamically adapt to information changes during the degradation process. This structure differs from prior TCN variants by integrating gating directly into the dilated causal branch while maintaining stability under large receptive fields.

  2. 2.

    A Multi-Scale Encoding Unit (MSEU) is designed for adaptive feature calibration. We designed a specific MSEU cell that establishes correlations between multi-feature variables from both local and global perspectives. It assesses the information content of each feature map and adaptively recalibrates feature responses. This mechanism effectively suppresses useless features and uncorrelated noise, enhancing the predictive network’s ability to resolve informative degradation patterns. Unlike conventional channel attention blocks, the MSEU is strategically positioned after feature compression to better target multi-sensor fusion.

  3. 3.

    Sinusoidal positional encoding is introduced to enforce global temporal order. Addressing the limitation that standard convolution can only perceive local order information, we introduce sinusoidal positional encoding to inject explicit location information into the features. This artificially adds global order relationships between distant time steps, thereby increasing the accuracy of time series network predictions. We further place positional encoding after the MSEU (rather than at the input) to avoid information loss under low channel counts and to align effectively with the downstream stacked GDCC modeling.

  4. 4.

    State-of-the-art performance is demonstrated. Comprehensive comparative experiments and ablation studies on the XJTU-SY dataset verify the effectiveness of the proposed method. The results demonstrate that our framework outperforms existing advanced baselines, including TCN-SA and CNN-BiLSTM, in terms of prediction accuracy and robustness under various operating conditions.

Prerequisites

Description of the problem

This article defines the problem of predicting a multivariate time series prediction as a series pair problem. A time series signal \(X_{I} = (x_{1} ,x_{2} ,...,x_{T} )\) is given as an input series to predict the output \(Y_{O} = (y_{1} ,y_{2} ,...,y_{T} )\) for each time. The sequence modeling network z is designed to establish a mapping relationship between the monitoring data and the mechanical degradation process:

$$(y_{1} ,y_{2} ,...,y_{T} ) = z(x_{1} ,x_{2} ,...,x_{T} )$$
(1)

Causal convolution and dilated convolution

Causal convolution is a technique that follows the principles of “the input and output scales remain unchanged” and “future data is not visible.” This framework shows that the known condition is all time steps before t, and then the prediction \({\text{P}}(x_{t} |x_{t - 1} ,x_{t - 2} ,...,x_{1} )\) given by the t-moment model cannot depend on any future time steps. The causal convolution at \(x_{t}\) is as follows:

$$(F * X)(x_{t} ) = \sum\limits_{k = 1}^{K} {f_{k} x_{t - K + k} }$$
(2)

where \(F(f_{1} ,f_{2} ,...,f_{K} )\) is the filter; Ki is the filter size; \(*\) is the convolution operation.

However, causal convolution has a drawback, requiring more network layers as the information is traced longer. This issue may lead to the disappearance of the network gradient. Dilated convolution solves the above problems.

Dilated convolution works by skipping the partial input, allowing the convolution kernel to be applied to regions more extensive than the convolution kernel’s length. With dilated convolution, convolutional networks can sense a large receptive field with fewer layers. At \(x_{t}\), the dilated factor is d dilated convolution is revealed as follows:

$$(F * {}_{d}X)(x_{t} ) = \sum\limits_{k = 1}^{K} {f_{k} x_{t - (K - k)d} }$$
(3)

When the dilated factor \(d = 1\), it is an ordinary convolution. However, both K and d can theoretically increase the receptive field (size is \((K - 1)d + 1\)). In practice, as the number of layers in the network increases, the dilated factor usually increases by an exponential of 2.

Squeeze excitation block

The squeeze excitation block shows the correlation between feature channels modeled geographically26. It emphasizes the importance of learning the characteristics of different channels, and each output channel is weighted after a constant weight is predicted. The block is shown in Fig. 1.

Fig. 1
Fig. 1
Full size image

Squeeze excitation block.

Figure 1 presents the squeeze as follows:

$$Z = S(T) = \frac{1}{H \times W}\sum\limits_{i = 1}^{H} {\sum\limits_{j = 1}^{W} {T_{C} (i,j)} }$$
(4)

The excitation is expressed as follows:

$$E\left( {Z,W} \right) = \sigma \left( {W_{2} \delta (W_{1} Z)} \right)$$
(5)

where T is the feature representation; H, W, and C are the height, width, and channel of T, respectively; \(\delta\) and \(\sigma\) represent the rectified linear unit (ReLU) and sigmoid activation functions, respectively; \(W_{1}\) and \(W_{2}\) are fully connected layers; S is the squeeze operation; E is the excitation operation.

Proposed methodology

This study introduces a framework, as illustrated in Fig. 2, which includes several components that enhance the accuracy of the prediction network. Specifically, feature compression fusion reduces the feature size and compacts the extracted features. MSEU is designed to recalibrate the prediction network and enhance its resolution adaptively and for information processing. Sinusoidal position encoding is incorporated into hidden layer features to add global order relationships. The GDCC module is built and stacked further to learn the sequence features of the front layer. Finally, the network is trained using an optimization algorithm to obtain the optimal network weights.

Fig. 2
Fig. 2
Full size image

This paper proposes the overall framework of the rolling bearing RUL prediction algorithm.

A new method called MSEU has been proposed to effectively identify the differences between feature variables without a precise learning mechanism during feature compression fusion. The process in the C structure in Fig. 2 accurately pays attention to valid variables and noise reduction to uncorrelated clutter by refining the input features twice, effectively helping information flow through the network. In the first step, the method learns how dependent the feature variables are at each time step from a local perspective. The second step aggregates the feature variables under the total time step from a global perspective and learns the degree of dependence between channels. Finally, the method adjusts the features and feature maps. The specific implementation details of MSEU are summarized in Algorithm 1.

Algorithm 1
Algorithm 1
Full size image

MSEU Process.

where N is the number of convolutional cores of the previous convolutional layer. Moreover, the convolutional kernel \(F_{1,N/2}\). \(F_{1,N}\) is used to model the correlation between characteristic variables at each time step. M is the average pooling. In convolutional kernel \(F_{2,N/2}\),\(F_{2,N}\) is used to model correlations between channels under global distribution.

It is worth noting that the conventional one-dimensional convolution operations used in steps 1 and 2 are not DCC because they do not consider the association between the time steps involved. Generally, there are not too many characteristic variables for a single time step, so there is no need to use a sizeable sensory field.

Sinusoidal position encoding

The order relationship between time steps in time series data often affects the final prediction results because the information carried by the latest time step is more conducive to prediction than the early information. Convolutional kernels can extract local sequential information from sequences but cannot obtain information between distant time steps.

This article also adds a position vector to \(X^{\prime}_{f}\) to enhance the “position sense” of GDCC. The construction method of the position vector follows the usage schemas discussed in a study27:

$$PE = \left\{ \begin{gathered} PE_{(pos,2i)} = \sin \left( {\frac{pos}{{10000^{{\frac{2i}{{d_{m} }}}} }}} \right) \hfill \\ PE_{(pos,2i + 1)} = \cos \left( {\frac{pos}{{10000^{{\frac{2i}{{d_{\text{m}} }}}} }}} \right) \hfill \\ \end{gathered} \right.$$
(9)

where the integer of \(pos \in (0,T_{f} - 1)\); \(T_{f}\) is the total time step of \(X^{\prime}_{f}\); \(d_{m}\) is the dimension of the position vector, the same as the \(X^{\prime}_{f}\) dimension value. The integer value of \(i \in \left( {0,\frac{{d_{m} }}{2} - 1} \right)\) .

In natural language processing, the position encoding is usually applied immediately after the input. However, position encoding is placed after the MSEU for two main reasons. First, channel C of the input data is equal to 2. As shown in Fig. 3, only a 2-dimensional position vector will inevitably lead to a significant loss of location information. Second, the MSEU does not involve sequential relationships between time steps, unlike the stacked GDCC module, which learns between time series time steps. Taken together, this paper places the sinusoidal position encoding after the MSEU and before the stacking of GDCC modules. Figure 2 presents the D structure showing the sinusoid position encoding.

Fig. 3
Fig. 3
Full size image

Sinusoidal position encoding (pos = 2560, \(d_{m}\) = 16).

The output of the sinusoidal position encoding module is as follows:

$$X_{PE} = X^{\prime}_{f} + PE$$
(10)

Stack GDCC modules and predict RUL

The vector sequence to be processed is \(X_{PE}\) using DCC. Inspired by the gate mechanism, this section adds a gating unit to the DCC, as indicated in (11):

$$\begin{gathered} g(X_{PE} ) = \sigma (DCC_{1} (X_{PE} )) \\ G(X_{PE} ) = DCC_{2} \times {\text{g}}(X_{PE} ) \\ \end{gathered}$$
(11)

This structure is referred to as GDCC. Moreover, \(DCC_{1}\) and \(DCC_{2}\) have the same in convolutional cores, window size, and dilated factor size. However, they are independent, indicating their weights are not shared. The use of the \(\sigma\) function preserves the nonlinearity of the network making the network not a complex linear problem. The value field of \(\sigma\) is (0,1). Intuitively, each output of \(DCC_{2}\) is added to a “valve” to control the information flow and passed to the next neuron. During training, useful data that match historical degradation data will be saved.

One advantage of GDCC is that it hardly has to worry about the gradient disappearing problem because \(DCC_{2}\) does not add any activation function. Therefore, the differentiation of this part of \(DCC_{2}\) is constant. However, a linear path is provided for the gradient, alleviating the problem of disappearing gradients in deep networks.

Therefore, we add the input to it using the residual structure, as indicated in (12):

$$G(X_{PE} ) = G(X_{PE} ) + X_{PE}$$
(12)

Therefore, the dimension sizes of the input and output do not match, as indicated in (13):

$$G(X_{PE} ) = G(X_{PE} ) + DCC_{3} (X_{PE} )$$
(13)

We combine (11) and (13) to rewrite into a more vivid equivalent form to see information flows clearly, as indicated in (14):

$$\begin{aligned} G(X_{{PE}} ) = & DCC_{2} (X_{{PE}} ) \times g(X_{{PE}} ) + DCC_{3} (X_{{PE}} ) \\ & = (DCC_{2} (X_{{PE}} ) - DCC_{3} (X_{{PE}} )) \times g(X_{{PE}} ) + DCC_{3} (X_{{PE}} ) \\ & = DCC_{2} (X_{{PE}} )g(X_{{PE}} ) + DCC_{3} (X_{{PE}} )(1 - g(X_{{PE}} )) \\ \end{aligned}$$
(14)

\(DCC_{2} (X_{PE} )\) is a linear transformation. Therefore, in (14), \(DCC_{2} (X_{PE} ) - DCC_{3} (X_{PE} )\) is equivalent to \(DCC_{2} (X_{PE} )\). In other words, learning can occur in the training process \(DCC_{2} (X_{PE} ) - DCC_{3} (X_{PE} )\) while gaining learning knowledge \(DCC_{2} (X_{PE} )\).

In summary, this section proposed a gated DCC and designed a GDCC module, as shown in Fig. 4. This section uses the gating mechanism, unlike the full pre-activation structure, which uses the ReLU activation function.The specific implementation of the E structure is detailed in Algorithm 2.

Algorithm 2
Algorithm 2
Full size image

E structure Process.

Figure 2 shows the specific implementation of the E structure, shown as follows:

Fig. 4
Fig. 4
Full size image

GDCC module structure.

Experimental analysis

Experimental instructions

Performance metrics

This study selects the mean absolute error (MAE) and the root mean square error (RMSE) to measure the distance between prediction and true value. MAE and RMSE are defined in (15) and (16):

$$\begin{gathered} er_{t} = RUL_{t}^{act} - RUL_{t}^{pre} \\ MAE = \frac{1}{T}\sum\limits_{t = 1}^{T} {\left| {er} \right|} \\ \end{gathered}$$
(15)
$$RMSE = \sqrt {\frac{1}{T}\sum\limits_{t = 1}^{T} {(er_{t} )^{2} } }$$
(16)

In (15), the error between the actual RUL (\(RUL_{t}^{act}\)) and the predicted RUL (\(RUL_{t}^{{pre}}\)) when the time is t is indicated; T is the total time.

coefficient of determination (R2) measures the amount of information captured by the data model. Therefore, we determine whether the model fits information other than the values, as indicated in (17):

$$R2 = 1 - \frac{{\sum\limits_{t = 1}^{T} {(er_{t} )^{2} } }}{{\sum\limits_{t = 1}^{T} {(RUL_{t}^{act} - \sum\limits_{t = 1}^{T} \frac{1}{T} RUL_{t}^{act} )^{2} } }}$$
(17)

\(A_{t}\) is an asymmetric function, as indicated in (18). As shown in Fig. 5, undervaluation is punished more severely than overvaluation because the worst possible outcome of underestimation is to replace equipment early. However, overestimation can lead to severe mechanical accidents. Therefore, overvaluation should be punished more severely than undervaluation. Finally, the predictive model scoring function is defined, as shown in (19):

$$A_{t} = \left\{ {\begin{array}{*{20}l} {\exp \left( { - \ln (0.6) \times \left( {\frac{{er_{t} }}{10}} \right)} \right),} \hfill & {\quad er_{t} \le 0} \hfill \\ {\exp \left( {\ln (0.6) \times \left( {\frac{{er_{t} }}{40}} \right)} \right),} \hfill & {\quad er_{t} \ge 0} \hfill \\ \end{array} } \right.$$
(18)
$$Score = w_{1} \frac{1}{m}\sum\limits_{t = 1}^{m} {A_{t} + } w_{2} \frac{1}{T - m}\sum\limits_{m + 1}^{T} {A_{t} }$$
(19)
Fig. 5
Fig. 5
Full size image

\(A_{t}\) function.

In (19), m is the percentage of early stages; \(w_{1}\) and \(w_{2}\) are early and late option weights, respectively.

In real scenarios, the accuracy of late RUL predictions is more critical than early days. As a result, more weight should be assigned to the predicted outcome evaluation at a later stage. In the formula \(w_{1} = 0.35\),\(w_{2} = 0.65\),\(m = \frac{T}{2}\).

Dataset and parameter setting

The study used experimental to analyze bearings for faults using a dataset published by the PHM 2012 Data Challenge28. The dataset was collected on an accelerated aging platform to gather vibration data, as shown in Fig. 6. The experiment in this article used data from two operating conditions, as listed in Table 1.

Fig. 6
Fig. 6
Full size image

PRONOSTIA bearing accelerated degradation test data acquisition platform.

Table 1 Operating conditions and bearing numbers.

Train/Test Split Protocol. We adopt a leave-one-out protocol within the same operating condition. When testing a bearing in condition 1 (e.g., B1-1), the training set consists exclusively of the remaining condition-1 bearings (B1-2–B1-7); no condition-2 bearings are used. Likewise, when testing a bearing in condition 2 (e.g., B2-4), the training set consists of the remaining condition-2 bearings (B2-1–B2-3 and B2-5–B2-7). We run seven leave-one-out experiments per condition (14 in total) and report the best checkpoint per run.

This article utilizes vertical and horizontal vibration signals, and the a records vibration signal is split into samples with a duration of 0.1s. The output label is the percentage of each bearing’s life with the actual RUL of each bearing standardized within the range of 0–100%. Therefore, (20) is used to normalize the actual RUL:

$$RUL_{t}^{act} = \frac{{y_{t} }}{T} \times 100$$
(20)

In the above equation,\(y_{t}\) is the essentially real RUL at time t.

This experiment trained the entire neural network in a Keras framework. The mean squared error (MSE) is chosen as the training loss function. The specific learning process of the network is as follows:

Inputs: Training data X and label Y.

Output: Network weight \(W_{network}\).

Step 1 The feature compression fusion module initially extracts X, which reduces the feature size to obtain a compact feature \(X_{f}\);

Step 2 We adaptively adjust the feature response \(X_{f}\) to obtain an attention-weighted feature \(X^{\prime}_{f}\) ;

Step 3 \(X^{\prime}_{f}\) adds sinusoidal position coding to obtain the feature \(X_{PE}\) with global position information.

Step 4 The stacked GDCC module further learns \(X_{PE}\) and uses it for the final RUL prediction.

Step 5 We update the network weight \(W_{network}\) using the gradient descent algorithm until the iteration terminates. The algorithm ends.

Table 2 lists the parameter settings.

Table 2 Model structure parameter settings.

Table 2 presents information about filters, strides, and l2 representing the number of convolution kernels, convolution step size, and l2 regularization, respectively. Moreover, pool_size represents the pooled window size. The use_bias = False indicates no bias term is used, and lr is the learning rate.

Experimental design and analysis

We experimented with the following four perspectives, analyzing and comparing the experimental results. Each experiment ensures the data are consistently divided across the training and test sets. We performed 7 experiments for each working conditions, and there are 14 experiments in total. During training, the best weights were retained.Specifically, we use a within-condition leave-one-out split: when one bearing is the test set, the remaining bearings under the same condition form the training set; data from other conditions are not used.

  1. 1.

    We verify the method’s effectiveness and compare it with other advanced methods29,30.

  2. 2.

    Experiment 1 lacks a valid MSEU and sinusoidal position coding design analysis. Therefore, quantitatively, ablation studies are used to analyze MSEU and sinusoidal position coding.

  3. 3.

    In addition, GDCC is compared with benchmark methods mentioned in the literature30, such as CNN and TCN.

  4. 4.

    Visualization module output.

Comparative experiment

This paper presents a bearing RUL prediction algorithm compared with advanced algorithms, as shown in Table 3.

Table 3 Predictive evaluation of each test bearing.

The proposed algorithm’s results are visually presented using bearings B1-1, B1-3, B2-4, and B2-6 data. Figure 7 shows the RUL prediction results of the proposed algorithm under two different working conditions.. The figure also displays the true label value and predicted value of RUL along the error bounds of 15% above and below the real label value.

Fig. 7
Fig. 7
Full size image

The proposed method predicts the result on the bearing. (A) B1-1; (B) B1-2; (C) B2-4; (D) B2-6.

Table 3 shows that the prediction model presented in this paper outperforms several other advanced RUL prediction algorithms in various performance evaluation indicators. At the same time, the performance of the latest method CNN-BiLSTM31 is also significantly lower than that of the method proposed in this paper.For test bearing B1-1, although the optimal method in Score and literature30 is the same as the proposed method, MAE and RMSE are significantly lower than other models, and R2 achieves good results. As shown in Fig. 7, the predicted value of the bearing RUL follows its true label value. All this evidence indicates that the presented prediction algorithm provides more accurate RUL prediction results.

Ablation experiment

The ablation experiments are proposed to quantify the impact of different components on the predicted results.

Table 4 presents the performance evaluation results of each ablation network in predicting the RUL of different bearing data after sequentially removing the sinusoidal position code and MSEU from the network. The test bearing data B1-1, B1-3, B2-4, and B2-6 were used as an example.

Table 4 Ablation experiments.

Table 4 lists the sinusoidal position coding and MSEU improving the RUL prediction accuracy to varying degrees. For bearing B1-1, the network first removed the sinusoidal position code, with MAE up 52.5% and RMSE up 52.3%. Then, MSEU is removed, with MAE increasing by 26.2% and RMSE by 19.8%. Table 4 shows that the RUL prediction error gradually increases as the ablation experiment progresses. This increase suggests that the network’s bearing RUL prediction is positively impacted by adding the global order relationship of timing features through sinusoidal position coding. The local and global adaptive proofreading feature responds through MSEU.

Figure 8 shows the RUL prediction results when the network removes the sinusoidal position code. Figure 9 shows the results of the RUL prediction when the network removes the sinusoidal position coding and MSEU.

Fig. 8
Fig. 8
Full size image

Remove the sinusoidal position code. (A) B1-1; (B) B1-2; (C) B2-4; (D) B2-6.

Fig. 9
Fig. 9
Full size image

Remove sinusoidal position coding and MSEU. (A) B1-1; (B) B1-2; (C) B2-4; (D) B2-6.

Benchmark model experiment

This study compared GDCC with two popular time series forecasting benchmark models, conventional CNN and TCN, as shown in Fig. 10. The benchmark models followed the traditional supervised learning model. The above experiments show that the prediction error obtained by the benchmark model is large, and the GDCC performance outperforms the benchmark model. These results indicate that DCC combined with gating mechanisms can flexibly increase the time series’s perceived field of view. This process allows information to flow more efficiently in the deep network.

Fig. 10
Fig. 10
Full size image

Evaluation of RUL prediction results for GDCC and benchmark models. (A) Mean absolute error; (B) Score.

Visualize and analyze

Visualizing and analyzing the features extracted by key modules within this network is vital to further confirm the GDCC network’s validity based on MSEU and sinusoidal position coding. Figure 11 shows the visualization feature maps for test set B1-1, taking the first sample as an example. For graphical aesthetics, only the first four features of each module’s output are drawn. As shown in Fig. 11, the features are relatively concentrated, and uncorrelated clutter is filtered after MSEU adaptive proofreading. The extracted features are gradually separated after sinusoidal position coding with the deepening of the model, which gradually aggregates after GDCC module 1. Our features are gradually ordered to GDCC module 2 output and finally to GDCC module 3 output, reflecting the strong feature extraction power of the network.

Fig. 11
Fig. 11
Full size image

Output characteristics of the visualization network module. (A) Feature compression fusion; (B) MSEU; (C) Sinusoidal position encoding; (D) GDCC module 1; (E) GDCC module 2; (F) GDCC module 3.

Generalization experiment on XJTU-SY dataset

To further verify the generalization capability and robustness of the proposed GDCC-MSEU framework, we conducted supplementary experiments on the XJTU-SY bearing dataset32. Unlike the PHM 2012 dataset, the XJTU-SY dataset contains real-world run-to-failure data collected under three different operating conditions (35Hz/12kN, 37.5Hz/11kN, 40Hz/10kN), characterized by significantly higher noise levels and complex degradation patterns. This poses a greater challenge for RUL prediction models. We adopted the same leave-one-out cross-validation protocol and compared our method with the advanced baselines: DANN29, TCN-SA30, and CNN-BiLSTM31.

The comparative results for all 15 test bearings are summarized in Table 5 . As observed, the proposed method consistently achieves the lowest MAE and RMSE across most test cases. Specifically, under the high-speed condition of 40Hz (Bearings 3–1 to 3–5), where rapid degradation leads to limited training samples, our method outperforms the second-best model (CNN-BiLSTM) by reducing the MAE by approximately 15% on average. Although the overall prediction errors on the XJTU-SY dataset are higher than those on the PHM 2012 dataset due to the inherent complexity and noise of the data, the proposed framework maintains a relatively high Score and R2, demonstrating its superior ability to capture the underlying degradation trend amidst severe noise. The experimental results confirm that the integration of the gated mechanism and multi-scale encoding effectively enhances the model’s feature extraction capability, ensuring robust performance even under varying and harsh operating conditions.

Table 5 Predictive evaluation of each test bearing.

As shown in Table 5 , the proposed GDCC framework achieves the best overall performance with an average MAE of 26.15 , outperforming DANN (23.92, though less stable), CNN-BiLSTM (30.43), and TCN-SA (32.16). Notably, our method demonstrates superior stability in complex operating conditions (e.g., Condition 2 and 3), achieving the highest R2 values on challenging bearings such as B1-2 (0.68) and B2-4 (0.64). In contrast, baseline methods often exhibit negative R2 values, indicating poor fitting on unseen data. These results confirm that the proposed GDCC network possesses strong generalization ability across different datasets and working conditions.

Conclusion

This study proposes a new deep prediction network called a GDCC network based on MSEU and sinusoidal position coding. The feature compression fusion mechanism lacks an effective attention mechanism to identify differences between feature variables. MSEU is designed to proofread multiple feature variables from two different perspectives, local and global, to effectively determine the differences between features after feature compression and fusion. In addition, the traditional convolutional structure had a poor ability to adapt to various time windows, and the deep convolutional structure is prone to gradient vanishing when processing long-term series. Therefore, a new GDCC module is proposed to address this issue. Dilated factors are used to accommodate various time window sizes flexibly. Moreover, stacking GDCC modules obtains sufficient receptive fields without quickly vanishing gradients. Sinusoidal position coding embedded in the network allowed neural networks to obtain information separated over long steps to enhance the GDCC “position sense.” This framework helps improve the accuracy of bearing RUL prediction. Finally, the correctness of this method is verified through comparative experimental analysis, ablation study experiments, and visual analysis. However, the method described here may result in the loss of time series feature information due to the dilated factor. In future research, we plan to address this issue by exploring ways to prevent information loss in time series modeling that uses dilated convolution.