Introduction

With the rapid advancement of modern industry, rotating machinery is increasingly utilized across various fields. Bearings, as key components of these systems, operate under harsh conditions, including high loads, temperatures, and speeds. They are widely employed in industries such as wind power, aerospace, and railroad transportation1. However, in these extreme environments, bearings are highly susceptible to vibration, impact, and wear, making them the most vulnerable part of the machinery. Bearing failures not only lead to equipment damage but can also result in significant economic losses and safety risks. Therefore, timely fault diagnosis is critical to enhance equipment reliability and ensure safety. Since vibration signals in industrial environments are often contaminated by noise, and fault features are weak and difficult to extract, achieving accurate and robust fault diagnosis in such complex settings remains a significant challenge2,3. The issue of fault diagnosis has attracted considerable attention from researchers. Currently, bearing fault diagnosis methods are primarily divided into signal-based and data-driven approaches4.

Signal-based fault diagnosis

Signal analysis methods extract bearing fault features by processing the raw signal in the time domain, frequency domain, or time-frequency domain, thus enabling fault diagnosis. Especially in complex noise environments, many advanced signal processing techniques are used to achieve more accurate fault diagnosis, such as Short-Time Fourier Transform (STFT), Wavelet Transform (WT), Empirical Mode Decomposition (EMD), Variational Mode Decomposition (VMD), and Singular Value Decomposition (SVD)5,6,7,8,9,10.These methods effectively reduce noise interference and separate sensitive features, thereby reflecting the fault condition of the bearing in strong noise environments.

For example, Chen et al. proved their effectiveness in fault diagnosis through an improved integrated EMD and Hilbert square demodulation method11; Bin et al. proposed a method combining wavelet packet decomposition and EMD for extracting the fault eigenfrequencies12; and Jiang et al. introduced the neighboring singular value ratio based on the singular value decomposition concept and combined it with Hidden Markov Model (HMM) to realize the identification of bearing faults13; Although the methods based on signal analysis can improve the accuracy of fault diagnosis to a certain extent, they have limitations in dealing with high-noise signals under complex operating conditions due to the fact that the bearing signals are usually highly nonlinear and nonsmooth. In addition, they usually rely on manual feature extraction, which makes it difficult to comprehensively capture fault information and affects the robustness and generalization ability of diagnosis.

Data-driven fault diagnosis

With the development of deep learning technology, data-driven fault diagnosis methods based on data have gradually become a research hotspot.These methods can automatically learn complex features in vibration signals and show excellent performance in fault classification and identification14,15. Compared with traditional methods, data-driven methods do not rely on excessive a priori knowledge and have strong adaptive capability.Zhang et al.proposed a subset-based deep self-encoder feature learning model, developed an adaptive fine-tuning operation hey enhanced feature learning, and used a particle swarm algorithm to optimize the key parameters16. Wang et al.proposed an industrial motor bearing fault diagnosis algorithm based on multi-local model decision conflict resolution (MLMF-CR), which, after the initial characteristic signal selection and cleanup of industrial motor bearing vibration and current signals, digs deeper into the characteristics of the bearing signals in each fault state through the local fault diagnosis model based on bidirectional long- and short-term memory network (Bi-LSTM). Information to form a local diagnosis, and after making a decision, use the evidence theory for fusion17. Ma et al.proposed a new deep neural network by combining the advantages of CNN and long-term memory. The contribution of this method is the use of CNNs to process signals in the joint time-frequency domain, preserving feature information18. Dibaj et al. were able to identify unanticipated and untrained composite fault states19 by using a CNN model trained with three classes of healthy states, single bearing faults, and single gear faults in conjunction with probabilistic conditions.

Despite the significant results of deep learning-based methods in bearing fault diagnosis, they still face the impact of noisy environments on model robustness. Traditional convolutional neural networks mainly process one-dimensional signals, which are limited by a small receptive field and may lose critical information. In addition, current methods mainly rely on single-modal data and cannot fully utilize multimodal information for diagnosis, resulting in poor generalization ability in strong noise environments20,21,22,23,24. To address these challenges, researchers have developed various dual-/multi-channel and multimodal fusion models.For instance,Cross-domain time-frequency adaptive fusion network(CDTFAFN)25 introduced a coarse-to-fine dual-scale attention mechanism to fuse raw acoustic and vibration signals, achieving robust performance in noisy environments. Chen et al.26 proposed a self-supervised framework that jointly leverages original time-series signals and their Fourier-transformed counterparts to enhance performance under few-label conditions. These studies highlight the effectiveness of dual-stream architectures and attention-based modeling for robust and generalizable fault diagnosis.Multi-information fusion deep ensemble learning network(MIFDELN)27, which employs weighted sensor signal fusion and cross-scale attention modules to improve feature discriminability under noisy conditions; and Multi-sensor residual convolutional fusion network(MRCFN)28, which combines residual modules and global perception mechanisms to effectively extract features in small sample and high-noise scenarios. For instance, Jiang et al.29 proposed a deep convolutional multi-adversarial adaptation network with correlation alignment for cross-condition fault diagnosis, while Zhang et al.30 developed a multi-scale deep feature memory and recovery network tailored for multi-sensor fault diagnosis under channel-missing scenarios. These works not only present advanced architectures but also provide detailed parameter configurations, which inspire the way we present and clarify the setup of TFDFNet in this study. In addition, dual-channel technology has demonstrated great potential in other domains. For example, Chen et al.31 constructed a dual-channel SE-3DUNet-based detection model for cerebral aneurysm screening in the field of clinical medicine, which demonstrated a lower false alarm rate and better sensitivity in a noisy environment. In the field of instrumentation, Zhang et al.32 realized multi-signal detection and analysis by constructing a sensor technology based on dual-channel signal reading, which effectively reduces false-positive signals and improves the sensitivity and accuracy of detection.

Although dual-branch models have demonstrated strong performance by leveraging multimodal features, many existing approaches suffer from limitations such as inadequate interaction between branches, rigid fusion strategies (e.g., direct concatenation or addition), and insufficient adaptability to varying signal characteristics. Moreover, some methods rely heavily on manual architecture design without exploring lightweight or flexible attention mechanisms. These limitations motivate the development of a more dynamic and effective cross-modal fusion strategy. Motivated by the strengths and limitations of existing approaches, this paper proposes a novel dual-branch fault diagnosis framework,Time-frequency and time-series dual-branch fusion network(TFDFNet),which integrates both time-frequency image features and one-dimensional time-series features in a parallel structure. Unlike existing methods that focus on a single modality or use simple feature concatenation, TFDFNet adopts a lightweight cross-attention fusion mechanism and a GABlock-based signal encoder to better capture complementary and discriminative representations. The proposed model aims to improve robustness and classification accuracy in real-world, high-noise environments.

The primary contributionMotivated by the strengths and limitations of existings of this paper are as follows:

  • Proposes TFDFNet, a parallel dual-branch model for mechanical fault classification that leverages multimodal fusion of time-frequency image features and one-dimensional time-series signal features to fully exploit their complementary information and enhance fault recognition accuracy.

  • Designs a specialized signal feature extraction module, Gated attention block(GABlock), which accurately captures key information from sequential data and significantly improves the model’s representational power and classification performance.

  • Extensive experiments on multiple public datasets validate the effectiveness of the proposed method and demonstrate its strong robustness under various noise conditions.

Preliminaries

Swin Transformer

Swin Transformer33 is a hierarchical visual transformer architecture based on the sliding window mechanism, which demonstrates excellent performance in image feature extraction tasks.Swin Transformer introduces the Shifted Window Attention mechanism, which significantly reduces the computational complexity of the global self-attention by dividing the input features into fixed Swin Transformer introduces the Shifted Window Attention mechanism, by dividing the input features into fixed-size windows and locally calculating the self-attention within the windows, the computational complexity of the global self-attention is significantly reduced, which is especially suitable for the processing of high-resolution images. Meanwhile, the sliding window strategy effectively captures the contextual relationship between local and global through the information interaction across windows. In addition, Swin Transformer adopts a hierarchical feature extraction structure to extract multi-scale features while reducing the resolution of the feature map layer by layer, which makes it perform well in tasks such as image classification, target detection, and instance segmentation.

In terms of time-frequency map feature extraction, Swin Transformer’s local-awareness and multi-scale modeling capabilities have significant advantages. As a two-dimensional image representation, the time-frequency diagram can intuitively reflect the time-frequency distribution characteristics of the signal. By utilizing the layered architecture of Swin Transformer, key local and global features in the time-frequency diagram can be fully captured, while retaining the rich information of the signal in the time and frequency dimensions. Compared with traditional convolutional neural networks, Swin Transformer not only has stronger modeling capabilities, but also captures the long-distance dependencies between features in the time-frequency diagram through the self-attention mechanism, which improves the comprehensiveness and robustness of feature representation and provides a more efficient solution for signal processing and analysis tasks.

Gated Recurrent Unit

The Gated Recurrent Unit (GRU) is a simplified version of the Long Short-Term Memory (LSTM). The GRU combines the input and forget gates in the LSTM by integrating them into a single update gate and contains two key gate structures: the reset gate and the update gate. The update gate is responsible for controlling the relationship between the current hidden layer and the hidden layer at the previous moment. The larger the value of the update gate, the more influence the previous moment hidden layer has on the current hidden layer. On the contrary, the reset gate determines the extent to which the previous moment’s hidden layer information is ignored. The smaller the value of the reset gate, the more information from the previous moment is ignored. Specifically, the reset gate mainly controls how the previous hidden state and the current input information are combined, while the update gate determines how much a priori information should be retained at the current moment34. The gating structure of the GRU is shown in Fig. 1.

Fig. 1
figure 1

Gated recurrent unit structure.

In Fig. 1, \(x_t\) denotes the input data at time t, \(h_t\) denotes the output of the GRU unit, and \(r_t\) and \(z_t\) are the reset gate and update gate, respectively. With \(r_t\) and \(z_t\), the GRU computes a new hidden state \(h_t\) based on the previous hidden state \(h_{t-1}\). The calculation process can be expressed as follows:

$$\begin{aligned} z_t= & \sigma (W_z x_t + U_z h_{t-1} + b_z) \end{aligned}$$
(1)
$$\begin{aligned} r_t= & \sigma (W_r x_t + U_r h_{t-1} + b_r) \end{aligned}$$
(2)
$$\begin{aligned} \tilde{h}_t= & \tanh \left( W_h x_t + U_h (r_t \odot h_{t-1}) + b_h \right) \end{aligned}$$
(3)
$$\begin{aligned} h_t= & (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t \end{aligned}$$
(4)

Here, \(\sigma (\cdot )\) denotes the sigmoid activation function, \(\tanh (\cdot )\) denotes the hyperbolic tangent activation, \(W_z, W_r, W_h\) are the input weight matrices, \(U_z, U_r, U_h\) are the recurrent weight matrices, and \(b_z, b_r, b_h\) are the bias terms. The update gate \(z_t\) controls the degree to which the previous hidden state contributes to the current hidden state, while the reset gate \(r_t\) decides how much past information should be ignored when computing the candidate hidden state \(\tilde{h}_t\). Finally, the hidden state \(h_t\) is obtained by interpolating between the previous hidden state \(h_{t-1}\) and the candidate state \(\tilde{h}_t\) according to the update gate.

Proposed methods

Overall framework

This study proposes a novel two-branch fault diagnosis model that incorporates time-frequency features and time-sequence features to achieve accurate fault classification. As shown in Fig. 2, the model mainly consists of a time-frequency feature extraction branch and a time-sequence feature extraction branch. These two branches respectively extract complementary information from the original signal and subsequently perform feature fusion to enhance the model’s characterization capability. In the time-frequency feature extraction branch, the original signal is first subjected to continuous wavelet transform to convert it into time-frequency images. These time-frequency images are then fed into Swin Transformer for feature extraction, which serves as an efficient image encoder capable of capturing multi-scale spatial dependencies while reducing computational complexity. Finally, the time-frequency features extracted in this branch are denoted as \(F_1\). In the time-series feature extraction branch, the original signal is first segmented using sliding window sampling to preserve its time-series information. The segmented signal is then fed into the feature extraction module (GABlock), which is used to capture the local and global information of the signal, and the extracted time-series features are noted as \(F_2\). Before feature fusion, a feature alignment strategy is applied to unify the feature dimensions of \(F_1 \in \mathbb {R}^{M \times d_1}\) and \(F_2 \in \mathbb {R}^{N \times d_2}\). Specifically, we apply a linear transformation:

$$\begin{aligned} F_1' = F_1 W_1,\quad F_2' = F_2 W_2 \end{aligned}$$
(5)

where \(W_1 \in \mathbb {R}^{d_1 \times d}\) and \(W_2 \in \mathbb {R}^{d_2 \times d}\) are learnable projection matrices, projecting both modalities into the same feature space \(\mathbb {R}^d\).

In order to effectively integrate the features extracted from the two branches, we introduce a feature fusion mechanism that enhances the cross-modal feature interaction. Through this mechanism, \(F_1\) and\(F_2\) are fused, so that the time-frequency features and the time-sequence features can complement each other, and the completeness and discriminative power of the feature representation are enhanced. Compared with simple fusion methods, cross-attention adaptively focuses on salient features from the other modality, defined as:

$$\begin{aligned} \textrm{CrossAtt}(Q, K, V) = \textrm{softmax}\left( \frac{QK^T}{\sqrt{d}}\right) V \end{aligned}$$
(6)

In this equation, Q, K, and V represent the query, key, and value matrices derived from the input features, and d is the feature dimension used for scaling the dot product.This operation allows the model to dynamically emphasize key parts of the counterpart signal, enabling finer discrimination of similar fault types. Finally, the fused feature representation F passes through the full connectivity layer (FC) and is classified by the Softmax classifier to obtain the final fault diagnosis results. The framework makes full use of the complementary nature of different types of features, improves the classification accuracy, and demonstrates good robustness under multiple operating conditions.To provide a clearer understanding of the processing steps in TFDFNet, Algorithm 1 summarizes the overall workflow of the proposed model from data preprocessing to fault diagnosis.

Algorithm 1
figure a

TFDFNet: Dual-Branch Fault Diagnosis with Cross-Attention Fusion

Fig. 2
figure 2

Architecture of the proposed dual-branch fault diagnosis model.

Multimodal feature extraction module

Time-frequency feature extraction branch

In order to fully utilize the time-frequency information of the signal, this study adopts the continuous wavelet transform to transform the one-dimensional signal into a time-frequency map, and extracts the time-frequency features using a deep network. Although CNNs have commonly been used in prior work for feature extraction on such maps, the Swin Transformer provides significant advantages. It utilizes window-based local self-attention to reduce computational complexity, while its shifted window mechanism enables effective cross-region modeling and enhances global context awareness.In addition, the hierarchical structure and patch merging strategy of the Swin Transformer allow for multi-scale feature modeling, which is particularly beneficial in capturing fault patterns at different temporal and frequency resolutions.

Let the input time-frequency map be denoted as \(S \in \mathbb {R}^{H \times W \times C}\), where \(H\) and \(W\) denote the frequency and time dimensions, respectively, and \(C\) is the number of channels. First, a linear projection layer is used to divide the input into a series of non-overlapping patches as the input feature representation of the Transformer:

$$\begin{aligned} Z_0 = W_{emb} S + b_{emb} \end{aligned}$$
(7)

where \(W_{emb}\) and \(b_{emb}\) are learnable parameters for feature mapping. The projected features are then passed through multiple Swin Transformer layers for feature extraction. Each Swin Transformer layer consists of Window Self-Attention (W-MSA) and Shift Window Self-Attention (SW-MSA), where the attention is computed as follows:

$$\begin{aligned} \textrm{Attention}(Q, K, V) = \textrm{softmax} \left( \frac{QK^T}{\sqrt{d}} \right) V \end{aligned}$$
(8)

where \(Q, K, V\) is the query, key, and value matrix, and \(d\) is the scaling factor. W-MSA computes attention within local windows, while SW-MSA shifts the window positions to facilitate cross-region feature interactions, thereby enabling the network to model both local and global dependencies efficiently. This enables the model to maintain a balance between representation richness and computational efficiency, which is particularly important for large-scale industrial fault data.

Finally, after processing in multiple Transformer layers,the extracted time-frequency features are represented as:

$$\begin{aligned} H_{TF} \in \mathbb {R}^{M \times d_1} \end{aligned}$$
(9)

where \(M\) is the feature sequence length and \(d_1\) is the feature dimension. The feature will be fused with the timing features in the subsequent branch fusion module to improve the fault diagnosis performance.

Time-series feature extraction branch

The temporal signal feature extraction method is in extracting deep temporal patterns from one-dimensional sensing signals to enhance the representation of technical fault characteristics. Given the original sensing signal \(X = \{x_1, x_2, \dots , x_R\}\), the signal is first reconstructed by applying the sliding window technique to obtain successive localized timing segments:

$$\begin{aligned} X^{sw} = \{X_1, X_2, \dots , X_N\}, \quad X_i = \{x_i, x_{i+1}, \dots , x_{i+w-1}\} \end{aligned}$$
(10)

where \(N\) denotes the number of sequences after window division, \(w\) is the length of the window, and each \(X_i\) is passed as an independent input to the temporal feature extraction module.

To efficiently extract temporal patterns, we construct a feature extraction module named GABlock, which integrates the strengths of recurrent neural networks and attention mechanisms.Specifically, GABlock combines GRU-based modeling of sequential dependencies with a gated attention mechanism to adaptively emphasize critical time steps. Let \(H_0 \in \mathbb {R}^{N \times d}\) be the initialized representation of the input signals, where \(d\) denotes the feature dimensions of each window segment, and GABlock generates deep time-series features through a series of mapping functions \(f(\cdot )\):

$$\begin{aligned} H = f(H_0) \end{aligned}$$
(11)

in this process, the implicit representation \(h_t\) of each time step \(t\) is calculated from the dynamics of the signals before and after the correlation:

$$\begin{aligned} h_t = \varphi (W_h H_t + b_h) \end{aligned}$$
(12)

where \(W_h\) and \(b_h\) are trainable parameters and \(\varphi (\cdot )\) denotes the nonlinear activation function. Subsequently, the importance coefficient \(\alpha _t\) is computed for each time step using the feature weighting mechanism:

$$\begin{aligned} \alpha _t = \frac{\exp (W_a h_t)}{\sum _{j=1}^{T} \exp (W_a h_j)} \end{aligned}$$
(13)

the global temporal feature representation is obtained after weighting:

$$\begin{aligned} H_{TS} = \sum _{t=1}^{T} \alpha _t h_t \end{aligned}$$
(14)

In implementation, GABlock consists of two 1D convolution layers (kernel size = 5, stride = 1) with GELU activations and BatchNorm for robust feature learning. A global average pooling layer aggregates contextual cues, and the attention mechanism dynamically adjusts feature importance across time.This allows the model to capture both transient and long-term dependencies in the input signal.We also provide a comparative analysis in Sect. "Comparison of temporal feature extractors" to validate the effectiveness of GABlock against classical temporal models such as GRU and LSTM.

Finally, the timing features \(H_{TS}\) generated by GABlock serve as the time-series representation of the input signal and are passed to the subsequent fusion module.

Branch fusion

After completing the time-frequency feature extraction and the time-sequence feature extraction, the features from the two branches need to be fused to fully leverage their complementary information and improve fault classification performance. Let the output features of the time-frequency feature extraction branch be denoted as \(H_{TF} \in \mathbb {R}^{M \times d_1}\), and the output features of the time-sequence feature extraction branch as \(H_{TS} \in \mathbb {R}^{N \times d_2}\), where \(M\) and \(N\) represent the feature lengths of the two branches, \(d_1\) and \(d_2\) represent their feature dimensions, respectively. To effectively utilize the complementary information from both branches, we introduce the Cross-Attention mechanism, which enhances the interaction between the time-frequency and time-sequence features, allowing them to focus on each other’s important information.

First, the time-frequency features \(H_{TF}\) and time-sequence features \(H_{TS}\) are mapped to Query, Key, and Value representations via learnable linear transformation matrices:

$$\begin{aligned} Q_{TF}= & W_Q H_{TF}, \quad K_{TS} = W_K H_{TS}, \quad V_{TS} = W_V H_{TS} \end{aligned}$$
(15)
$$\begin{aligned} Q_{TS}= & W'_Q H_{TS}, \quad K_{TF} = W'_K H_{TF}, \quad V_{TF} = W'_V H_{TF} \end{aligned}$$
(16)

where \(W_Q, W_K, W_V, W'_Q, W'_K, W'_V\) are trainable parameter matrices used to project the features into the same attention space. Then, the attention scores between the two branches are computed to quantify the similarity between the feature sets:

$$\begin{aligned} A_{TF \rightarrow TS}= & \textrm{softmax} \left( \frac{Q_{TF} K_{TS}^{\textrm{T}}}{\sqrt{d}} \right) \end{aligned}$$
(17)
$$\begin{aligned} A_{TS \rightarrow TF}= & \textrm{softmax} \left( \frac{Q_{TS} K_{TF}^{\textrm{T}}}{\sqrt{d}} \right) \end{aligned}$$
(18)

where \(d\) is the scaling factor to prevent excessively large attention scores. The computed attention scores are used to weight the corresponding values to generate the enhanced feature representations:

$$\begin{aligned} H_{TF}^{cross}= & A_{TF \rightarrow TS} V_{TS} \end{aligned}$$
(19)
$$\begin{aligned} H_{TS}^{cross}= & A_{TS \rightarrow TF} V_{TF} \end{aligned}$$
(20)

which capture the important information from the complementary modality.

Finally, the final fused feature representation is obtained by concatenating the enhanced features and applying a nonlinear transformation:

$$\begin{aligned} H_{fusion} = \sigma (W_f \left[ H_{TF}^{cross}, H_{TS}^{cross} \right] ) + b_f \end{aligned}$$
(21)

where \(W_f\) and \(b_f\) are learnable parameters, and \(\sigma (\cdot )\) represents a nonlinear activation function, such as Rectified linear unit(ReLU) or Gaussian error linear unit(GELU). The resulting \(H_{fusion}\) is then passed to the final classification layer for fault classification. After the final fully connected layer and softmax classifier, the probability distribution over \(C\) fault classes is obtained:

$$\begin{aligned} \hat{y} = \textrm{softmax}(W_c H_{fusion} + b_c) \end{aligned}$$
(22)

where \(W_c\) and \(b_c\) are trainable parameters, and \(\hat{y} \in \mathbb {R}^{C}\) represents the predicted probability distribution. This cross-attention fusion mechanism enables the model to adaptively focus on the most salient and informative regions of the complementary modality, significantly enhancing the robustness and discriminability of the learned feature representations. The overall process of cross-attention feature interaction and fusion is illustrated in Fig. 3.

Fig. 3
figure 3

Detailed flowchart of the cross-attention-based feature fusion process.

Experiments

In order to assess the validity, progress and robustness of the proposed model, experimental analyses are carried out in this paper using CWRU and Ottawa bearing failure datasets.

Dataset description

The Case Western Reserve University Bearing Data Center provides the CWRU dataset35, which serves as a widely recognized benchmark for bearing fault diagnosis research. As depicted in Fig. 4, the experimental setup includes a drive motor, torque sensor, dynamometer, and associated control electronics. The test bearings are mounted on the motor shaft. Single-point defects of varying severities were introduced into the bearings using electro-discharge machining (EDM), with fault diameters of 0.007 inches, 0.014 inches, 0.021 inches, and 0.028 inches. Owing to incomplete measurements for the 0.028-inch faults, only data corresponding to the other three fault sizes are utilized in this study.The complete dataset comprises 161 samples, categorized into four groups: 48k normal baseline, 48k drive-end fault samples, 12k drive-end fault samples, and 12k fan-end fault samples. Each category includes faults of different types, namely ball defects, inner race defects, and outer race defects. Furthermore, the outer race faults are classified according to their angular positions relative to the load zone: ‘centre’ (6 o’clock position), ‘orthogonal’ (3 o’clock), and ‘reverse’ (12 o’clock).The experiments were conducted under four distinct load conditions: 0 hp, 1 hp, 2 hp, and 3 hp. The rotational speed of the motor varied between 1797 rpm and 1730 rpm. Vibration signals were collected from three accelerometers mounted on the drive end, fan end, and motor base. Each sample was recorded over a duration of 10 seconds, with a consistent sampling frequency.

Fig. 4
figure 4

The CWRU bearing failure experiment platform.

In this study, the experimental dataset is derived from the vibration signals collected at the drive end bearing. The data acquisition was performed using an accelerometer operating at a sampling frequency of 12kHz, under a load torque of 0hp and a rotational speed of 1797rpm. The fault location corresponds to the outer race at the 6 o’clock position. Detailed parameter settings of the dataset are summarized in Table 1, and the corresponding time-domain waveform is illustrated in Fig. 5.

Fig. 5
figure 5

The time domain waveform of CWRU dataset.

Table 1 Description of sample size and labels (CWRU dataset).

The Ottawa dataset is generated through experimental studies using the SpectraQuest machinery fault simulator (MFS-PK5M)36. As illustrated in Fig. 6, vibration signals were acquired by accelerometers installed on the bearing housing, with a maximum sampling rate of 200 kHz. This dataset includes two core experimental factors: bearing health conditions and rotational speed variations. The health conditions encompass five types: healthy bearings, inner race (IR) faults, outer race (OR) faults, ball faults, and compound faults involving simultaneous defects in the IR, OR, and ball. For the speed variations, four scenarios are considered: monotonically increasing speed, monotonically decreasing speed, increasing then decreasing speed, and decreasing then increasing speed. By combining these two experimental factors, the dataset covers a total of 20 distinct operating conditions. In this study, we select the ramp-up speed condition as the experimental dataset. Detailed configuration parameters are listed in Table 2, and the time-domain waveform of the vibration signals is presented in Fig. 7.

Fig. 6
figure 6

The Ottawa bearing failure experiment platform.

Fig. 7
figure 7

The time domain waveform of Ottawa dataset.

Table 2 Description of sample size and labels (Ottawa dataset).

Data preprocessing

Slide window resampling

The sliding window resampling technique is commonly employed in time-series analysis and signal processing to extract local features and capture temporal dependencies. It involves sliding a fixed-size window over the input data, where each window represents a subset of the sequence used for further analysis or model training. The primary advantage of this method lies in its ability to maintain the sequential structure of the data while reducing the impact of noise by focusing on localized segments. Mathematically, the sliding window approach can be described as follows: Given a time-series data\(\{x_1, x_2, \dots , x_T\}\),where \(T\) represents the total number of time steps,a sliding window of size \(W\) is defined.The window slides across the data with a step size \(S\),which determines the overlap between consecutive windows. The i-th window \(W_i\) can be expressed as:

$$\begin{aligned} W_i = \{x_{iS}, x_{iS+1}, \dots , x_{iS+W-1}\} \end{aligned}$$
(23)

Gaussian noise

In real industrial environments, noise usually originates from the superposition of a number of factors, including insufficient lubrication due to improper installation and external objects, friction due to indentation or rust, and irregular noise caused by cracks. Each of these noise sources has different probability distribution characteristics. Based on the central limit theorem, the sum of several independent random variables tends to follow a Gaussian distribution, regardless of the individual distributions of these variables. As such, by superimposing multiple random variables with different statistical characteristics, the resulting noise closely approximates the complex noise encountered in real industrial environments. In this study, Gaussian noise is introduced into the original vibration signals to emulate such real-world disturbances, enabling the evaluation of the diagnostic model’s robustness under noisy conditions. The noise intensity relative to the signal is quantified using the signal-to-noise ratio (SNR), which is mathematically defined as:

$$\begin{aligned} \textrm{SNR}_{\textrm{dB}} = 10 \log _{10} \left( \frac{P_s}{P_n} \right) \end{aligned}$$
(24)

where \(P_s\) and \(P_n\) represent the power of the signal and the noise, respectively. As the level of noise interference increases, the signal-to-noise ratio decreases accordingly. In the special case where the SNR equals zero, the noise power and the signal power are identical.

The quantities \(P_s\) and \(P_n\) denote the power of the signal and noise, respectively. As the degree of noise interference increases, the corresponding signal-to-noise ratio (SNR) decreases. Notably, when the SNR value reaches zero, the noise power equals the signal power. Let \(x_i\) denote the signal sequence, where \(i = 1, 2, \dots , T\). The signal power \(P_s\) can be computed using the following formula:

$$\begin{aligned} P_s = \frac{1}{T} \sum _{i=1}^T x_i^2 \end{aligned}$$
(25)

based on different SNR levels, the noise power \(P_n\) can be determined using:

$$\begin{aligned} P_n = \frac{P_s}{10^{\frac{\textrm{SNR}}{10}}} \end{aligned}$$
(26)

Continuous wavelet transform

In fault diagnosis, sensor signals containing temporal information are usually recorded as time series of data points37. Due to the time-varying and non-stationary nature of the sensor signals, their global characteristics in the time or frequency domains are not sufficient for effective analysis and usually fail to reveal the intrinsic laws of the fault states. Therefore, Joint Time-Frequency Analysis (JTFA) is needed to reveal the evolution of the signal spectrum over time. The time-frequency diagrams obtained by JTFA calculation can provide comprehensive information about the signal. The continuous wavelet transform is one of the most commonly applied techniques within the framework of joint time-frequency analysis (JTFA). It possesses several important properties, including superposition, invariance under time shifts, and scale covariance. By adjusting the scale and translation parameters of the mother wavelet, CWT enables multi-resolution analysis, allowing the extraction of features from signals across a range of time and frequency domains. Specifically, larger scales correspond to broader time windows for analyzing low-frequency components, while smaller scales provide narrower time windows for capturing high-frequency details. This facilitates a comprehensive decomposition of the signal into its time-frequency representations.

Set \(\psi (t) \in L^2(\mathbb {R})\), \(\psi (t)\) is a basic wavelet or mother wavelet,then

$$\begin{aligned} \psi _{a, \tau }(t) = \frac{1}{\sqrt{a}} \psi \left( \frac{t - \tau }{a}\right) \end{aligned}$$
(27)

where \(a \in \mathbb {R}\), \(a> 0\), is the scale parameter controlling dilation, and \(\tau \in \mathbb {R}\) is the translation parameter controlling time shift. For a square-integrable signal \(f(t) \in L^2(\mathbb {R})\), the CWT is mathematically expressed as:

$$\begin{aligned} CWT_{f}(a, \tau )=\frac{1}{\sqrt{a}} \int f(t) \psi ^{\prime }\left( \frac{t-\tau }{a}\right) d t=\int f(t) \psi _{a, \tau }^{\prime }(t) d t=\left\langle f(t), \psi _{a, \tau }(t)\right\rangle \end{aligned}$$
(28)

When the scale parameter \(a\) increases, the transform captures low-frequency components; conversely, smaller \(a\) values emphasize high-frequency components in the signal.

A critical aspect of CWT is the selection of an appropriate wavelet basis function, as this choice directly impacts both the accuracy and efficiency of the transform. Common wavelet bases include the Haar wavelet, Coiflet wavelet, Morlet wavelet, and the complex Morlet (cmor) wavelet. Among these, the cmor wavelet, a complex extension of the Morlet wavelet, is frequently preferred due to its superior adaptability. By applying CWT to time-series signals, a two-dimensional time-frequency representation can be obtained, effectively revealing fault-related features within the signal.

Data preprocessing

In the original dataset, the raw vibration signals are represented as long discrete time series, characterized by a limited number of samples and a high temporal resolution within each sample. Such characteristics pose challenges for deep learning models: the small sample size may compromise training accuracy and increase the risk of overfitting, while the excessive time dimension in individual samples can negatively impact training efficiency. To address these issues, we first apply the sliding window sampling method described in Sect. “Gaussian noise”, selecting a window length of 1024. This window size is determined by balancing the sampling rates of the respective datasets with the need to effectively capture fault features. For instance, in the CWRU dataset, the drive-end sampling rate is 12 kHz and the motor rotates at 1797 rpm, producing approximately 403 data points per revolution. A window of 1024 thus spans roughly 2.54 rotations, sufficiently covering the complete vibration signature of the bearing. In contrast, the Ottawa dataset operates at a sampling rate of 200 kHz and a maximum rotational speed of 28.5 Hz under ramp-up conditions, yielding about 7018 points per rotation. Under these conditions, a 1024-point window corresponds to roughly 0.15 rotations, reducing computational complexity while still preserving key feature information. The use of sliding windows not only increases the total number of samples but also reduces the time dimension of each sample, thereby enhancing the model’s generalization performance. An overlap rate of 50% is applied during windowing to mitigate the loss of edge features in the original vibration signals. Following window segmentation, the data is divided into training, validation, and testing subsets in a ratio of 7:2:1. Finally, the continuous wavelet transform method introduced in Sect. “Gaussian noise” is employed to convert the segmented signals into two-dimensional time-frequency representations. Example time-frequency maps for four distinct locations are shown in Fig. 8.

Fig. 8
figure 8

Time-frequency diagram after CWT.

To evaluate the bearing fault diagnosis performance under noise interference, we created datasets under three different operating conditions:

Condition 1: The training and validation sets are corrupted with Gaussian noise at SNRs of –4 dB and 4 dB, respectively. The model is then tested on signals with SNR of –8 dB, –6 dB, 0 dB, 6 dB, and 8 dB. This setting evaluates the model’s generalization ability under varying noise levels.

Condition 2: The original dataset is divided into three equal parts. Gaussian noise with an SNR of −8 dB is added to the first third, 0 dB to the middle third, and 8 dB to the final third. After noise injection, sliding window sampling is applied to construct the training, validation, and test sets.

Condition 3: The training and validation sets are segmented into three equal portions. The first third is corrupted with 4 dB noise, the middle third with 0 dB, and the final third with 8 dB. The model is then tested on a test set with an SNR of −4 dB to evaluate its generalization capability in unseen noisy environments.

Figure 9 shows the original vibration signal, Gaussian noise signals with SNRs of 8 dB, 0 dB, and −8 dB, and the mixed signal of the original signal and Gaussian noise. The original vibration signal corresponds to the ball defect fault bearing condition from the Ottawa dataset. It is evident that after adding Gaussian noise, the original vibration signal becomes significantly distorted, with the distortion becoming more severe as the SNR decreases. Therefore, accurately identifying bearing fault patterns in a noisy environment is challenging.

Fig. 9
figure 9

The time domain waveforms of the raw vibration signal, the composite noisy signal with SNR=8db, 0db, −8db.

Experimental setup

Baseline systems

We have selected three high-level models as the baseline for our comparison experiments based on their excellent performance in fault diagnosis.These models will help us evaluate the effectiveness, progress and robustness of the proposed TFDFNet.In the following, we briefly describe the baseline models.For comprehensive details on their structure and implementation, please refer to the cited sources.

  1. (1)

    Efficient channel attention network (ECANet)38 is a neural network architecture designed for image processing tasks. It improves feature representation by efficiently capturing inter-channel relationships in images while maintaining high efficiency.ECANet achieves superior performance by introducing a channel attention mechanism to capture relationships between different channels. The weights of the channel features are adaptively adjusted to enhance the representation of the network without adding excessive parameters and computational cost.

  2. (2)

    The dense connection resNet (DResnet)39 utilizes ResNet18 as the backbone network and introduces dense connections between the residual blocks. In addition, a transition layer is used to adjust the feature channel dimensions, and a pre-activation layer is used to interact with the combined features across the channel information. Dense connections are introduced between the residual blocks to transfer the shallow feature information into the deeper feature information. The final classifier fuses the shallow features to enhance the utilization of the extracted features.

  3. (3)

    The Multi-sensor Residual Convolutional Fusion Network (MRCFN)28 is a deep learning model designed for robust fault diagnosis under noisy and small-sample conditions. By integrating vibration and acoustic signals, MRCFN enhances feature representation through a double ring residual structure, which captures local discriminative features efficiently. A spatial channel reconstruction module is introduced to suppress redundant information and highlight salient features, while a global interactive fusion mechanism enables effective cross-modal feature integration. This architecture achieves high diagnostic accuracy and strong robustness, outperforming existing multi-sensor fusion methods with relatively low computational overhead.

Implementation details

In our image branch, we use \(swin\_base\_patch4\_window7\_224\) as the image encoder. All input images are resized to \(224 \times 224\) and normalized with mean (0.5, 0.5, 0.5) and standard deviation (0.5, 0.5, 0.5) before being fed into the network. In the signal branch, each signal is reconstructed into a sequence of 32 time steps with 32 features, which are then processed by a two-layer bidirectional GRU. The hidden sizes of the BiGRU layers are 32 and 64, respectively, resulting in a final output dimension of 128 due to bidirectionality. To further emphasize informative temporal patterns, a global attention module with a hidden size of 64 is applied on top of the BiGRU outputs.

The operating system of the running environment is Ubuntu 20.04, equipped with an RTX 4060Ti GPU with 24GB memory, Python 3.10, and PyTorch 2.0.1. To ensure fairness, all comparative experiments are conducted on the same hardware platform. The model is trained with the Adam optimizer, an initial learning rate of \(1 \times 10^{-3}\) (reduced by a factor of 0.1 if the validation performance does not improve for 10 consecutive epochs), and the cross-entropy loss function. The batch size is set to 32, and training runs for 50 epochs with early stopping applied using a patience of 15 epochs. To ensure compatibility between the two feature branches during fusion, all features are projected into a unified dimension of 128 through learnable linear transformations before entering the attention module.

Result and analysis

Experimental results on the CWRU dataset

First we validate the diagnostic model for the bearing problem using the dataset without added noise and the results are shown in Table 3. The proposed model and the baseline model perform very well in the task of fault diagnosis on the test data without noise, with fault diagnosis accuracies exceeding 85% in all cases. ours exhibits excellent performance, achieving state-of-the-art results, and its fault diagnosis accuracy remains stable at 100% over multiple validations, which is significantly higher than that of the other baseline models. Compared with MRCFN (97.68%) and ECANet (95.74%), TFDFNet has a higher classification accuracy, which indicates that TFDFNet is able to learn the feature information in the bearing signals more fully through the dual-channel architecture in combination with the GABlock module, which effectively reduces the loss of information and achieves higher classification accuracy.

Table 3 The fault diagnosis accuracy of each model on CWRU testing dataset without noise.

Subsequently, in order to evaluate the noise resistance of the models, experiments were conducted under three working conditions, and as can be seen from Table 4, under the first working condition, there is a significant difference in the fault diagnosis accuracy of the different models under different signal-to-noise ratio (SNR) conditions. Overall, the diagnostic accuracy of the models generally improves as the SNR increases. When \(SNR> 6\) dB, the accuracy of all models is above 85%, indicating that they are robust in weak noise environments. However, at SNR = −8 dB and −6 dB, the performance of DResNet and ECANet is lower, with accuracies of only 51.69% and 53.85% (DResNet) and 53.38% and 59.69% (ECANet), respectively, indicating that these two models are less adaptable to strong noise environments. MRCFN, on the other hand, still performs well under lower SNR conditions, achieving an accuracy of 81.99% at SNR = −8 dB, demonstrating some noise resistance. The proposed method (OURS) exhibits excellent fault diagnosis capability at all noise levels, especially at SNR = 0 dB, 6 dB, and 8 dB, where its accuracy is consistently 100%, far exceeding other baseline models.

From the results in Table 5, in Condition 2, the fault diagnosis accuracy of each model is improved compared to that of Condition 1, but there are still some differences.The accuracy of DResNet in this condition is 77.52%, and that of ECANet is 86.06%, which is improved compared to that of the low SNR test in Condition 1 improved.The performance of MRCFN is more stable at 95.87%, indicating that it still has strong feature extraction capability under different noise distributions. However, our accuracy is as high as 99.48%, which is again better than the other methods, indicating its high robustness in dealing with complex noise environments.

In the experiments of condition 3, we further evaluated the performance of each model under more complex noise conditions. As shown in Table 6, the fault diagnosis accuracy drops to 69.12% for DResNet and 80.01% for ECANet, indicating that these two methods are less adaptable to non-uniform noise. MRCFN, on the other hand, although still maintains a high accuracy rate (93.58%), it decreases compared to Case II. In this case, our model still maintained a high accuracy of 98.66%, which is much better than the other methods. This shows that the proposed method can still effectively capture the fault characteristics when facing the non-uniform noise with different signal-to-noise ratios, which demonstrates a strong noise-resistant capability.

Table 4 The fault diagnosis accuracy of each model on CWRU testing dataset under condition1.
Table 5 The fault diagnosis accuracy of each model on CWRU testing dataset under condition2.
Table 6 The fault diagnosis accuracy of each model on CWRU testing dataset under condition3.
Fig. 10
figure 10

Confusion matrix on the CWRU testing dataset with SNR = \(-8\,\textrm{dB}\). (a) DResnet, (b) ECANet, (c) MRCFN, (d) ours.

Figure 10 shows the confusion matrix for the four models predicting the ten fault types in the CWRU test dataset at SNR = −8 dB. From the confusion matrix, it can be observed that DResNet has a poor ability to discriminate between fault categories, especially between faults C2(inner race fault 0.007), C3(ball fault 0.007), C8(inner race fault 0.021), and C9(ball fault 0.021), where the models show a high confusion rate. Particularly, the confusion between the fault categories C2 and C9 is such that DResNet can hardly distinguish them accurately, leading to more misclassifications. In addition, the C1 (normal) state is also misclassified into multiple fault types, which further shows the weak robustness of the model at low signal-to-noise ratios. The classification performance of ECANet is slightly better than that of DResNet, and from the confusion matrix, ECANet still has more obvious confusions although it is improved than DResNet. For example, the confusion between faults C2 and C9 is more serious, indicating that the model’s feature extraction capability needs to be further enhanced when dealing with these fault types. In addition, the C1 state is also misclassified as C3 and C9, which further proves that ECANet has some classification instability under noise interference. In contrast,MRCFN achieves more satisfactory classification results on most of the categories, especially the C1, C6 (ball fault 0.014), and C5 (inner race fault 0.014) categories with very high accurate classification rates. However, it can still be seen that there is some confusion between C2 and C9, which shows that the MRCFN has some errors in dealing with some similar types of faults. From the analysis of the confusion matrix, it can be seen that there is almost no misclassification in our model, and almost all fault categories can be classified accurately. Especially on the categories C1 and C10 (outer race fault 0.021), the correct classification rate is 100%.

Fig. 11
figure 11

Visualization of the extracted features of four models for the CWRU testing dataset with SNR = \(-8\,\textrm{dB}\). (a) DResnet, (b) ECANet, (c) MRCFN, (d) ours.

To further investigate the feature extraction capability of the four comparative models in noisy environments, we performed a graphical demonstration of representative features for fault classification at SNR = −8 dB as shown in Fig. 11. We used a T-distributed stochastic neighbor embedding (t-SNE) dimensionality reduction method40 to map the feature vectors into a two-dimensional space. According to the t-SNE visualization graph, the performance of the four models in handling the test dataset varies significantly.DResNet (Fig. 11a) shows strong category confounding, in particular, the distinction between normal and fault categories is not obvious, which indicates that the model is weak in feature extraction under high noise conditions.ECANet (Fig. 11b), although improved, with a reduction in overlap between categories, still has some confusion between fault types, especially between “de_7_outer” and “de_14_inner”. MRCFN (Fig.11c) demonstrates a more distinct category separation, especially between normal and fault categories, showing stronger feature extraction. distinction, showing a stronger feature learning capability.OURS (Fig.11d), on the other hand, performs the best, with almost no category overlap and a clear distribution of data points across categories, highlighting the model’s strong robustness and classification accuracy in noisy environments.

Comprehensively, TFDFNet is able to achieve 100% classification accuracy in a noiseless environment, and still maintains a clear performance advantage under high noise conditions and complex working conditions, far exceeding existing methods such as DResNet, ECANet and MRCFN. Especially, under the extreme noise environment with SNR = −8 dB, the classification accuracy of TFDFNet is still as high as 96.48%, while the accuracy of DResNet and ECANet drops to 50.32% and 58.74%, respectively. This result indicates that TFDFNet is still able to extract stable and effective fault features under high noise conditions, improving the model’s anti-interference ability. This performance improvement is attributed to the introduction of the GAblock temporal feature extraction module, which enhances the modeling capability of temporal information through the global attention mechanism, making the dependence of fault features in the time dimension clearer and maintaining high-precision classification even under complex working conditions.

Experimental results on the Ottawa dataset

As shown in Table 7, the fault diagnosis accuracies of all models on the Ottawa test set are generally high under noise-free conditions. Specifically, DResNet achieves an accuracy of 87.43%, ECANet reaches 93.24%, and MRCFN attains 96.67%. Our proposed model achieves the highest accuracy of 99.44%, indicating its strong ability to learn discriminative fault features and achieve near-perfect classification performance in ideal environments.

In Table 8, the performance of all models declines after introducing Gaussian noise with varying signal-to-noise ratios (SNRs). DResNet and ECANet exhibit poor robustness at SNR = −8 dB, with accuracies dropping to 50.76% and 54.41%, respectively, highlighting their vulnerability to strong noise. As the SNR increases, both models show gradual performance improvement, reaching 88.31% and 95.06% at SNR = 8 dB, respectively. In contrast, MRCFN maintains relatively high accuracy even under severe noise, achieving 80.03% at SNR = −8 dB and 98.26% at SNR = 8 dB, demonstrating moderate noise resistance. Our proposed model exhibits remarkable stability across all noise levels, achieving 98.82% at SNR = −8 dB and maintaining accuracy above 98% at SNRs of 0 dB, 6 dB, and 8 dB, significantly outperforming all other methods.

Table 9 presents the results under Condition 2. In this scenario, DResNet, ECANet, and MRCFN achieve accuracies of 75.33%, 86.87%, and 94.68%, respectively. Although MRCFN shows better noise robustness than DResNet and ECANet, it still fails to completely mitigate the effects of complex noise distributions. In contrast, our model maintains a high accuracy of 98.36%, further demonstrating its superior capability in handling intricate noisy environments.

Under Condition 3, which involves a more complex, non-uniform noise distribution (Table 10), the overall performance of the models deteriorates further. DResNet and ECANet experience significant accuracy drops to 67.21% and 78.17%, respectively, indicating their Limited adaptability in complex noise scenarios. MRCFN continues to exhibit a certain level of noise resilience with an accuracy of 94.25%. Notably, our proposed model still achieves a high accuracy of 98.41%, exceeding MRCFN by more than 4%, which further confirms its exceptional robustness and effectiveness in challenging noise conditions.

Table 7 The fault diagnosis accuracy of each model on Ottawa testing dataset without noise.
Table 8 The fault diagnosis accuracy of each model on Ottawa testing dataset under condition1.
Table 9 The fault diagnosis accuracy of each model on Ottawa testing dataset under condition2.
Table 10 The fault diagnosis accuracy of each model on Ottawa testing dataset under condition3.

Figure 12 shows the confusion matrix for the four models predicting the five fault types in the ottawa test dataset at SNR = −8 dB. From the confusion matrix, it can be seen that DResNet has significant confusion between several categories, especially in the distinction between the C1 (ball fault) category and the other fault categories is more difficult. In addition, DResNet fails to show strong robustness in the identification of fault categories, resulting in a high misclassification rate, which indicates that the feature extraction capability and classification accuracy of the model have limited performance under complex noise conditions. Compared to DResNet, ECANet shows an improvement in classification over DResNet, but the confusion matrix still reveals the model’s confusion in some fault categories, especially between C2(composite fault) and C4(inner race fault). This indicates that despite ECANet’s optimization in feature learning, the model still has difficulty in distinguishing some similar faults in a low SNR environment.MRCFN significantly outperforms both DResNet and ECANet in this task, as can be seen from the confusion matrix, which shows that the MRCFN performs better on most of the categories, especially between C1 and C3(healthy) with a The classification accuracy is very high. However, despite the MRCFN’s better feature extraction ability when dealing with complex data, its confusion between categories C2 and C4 is still more significant, suggesting that the model’s robustness needs to be improved in the face of certain fault types. From the confusion matrix, ours shows almost no misclassification, and almost all categories are accurately classified, especially between categories C2 and C3, where the classification is almost error-free.From the confusion matrix, ours shows almost no misclassification, and almost all categories are accurately classified, especially between categories C2 and C3, where the classification is almost error-free. Despite the similarity between composite faults (C2) and inner race faults (C4), TFDFNet achieves nearly perfect classification, whereas other models suffer from category confusion, further demonstrating TFDFNet’s advantage in resolving closely related fault types.

According to the t-SNE visualization graph, the performance of the four models (DResNet, ECANet, MRCFN, and OURS) on the test dataset varies significantly.DResNet (Fig. 13a) shows strong category overlap, especially the distinction between fault categories and normal states is more ambiguous, which suggests that the model is weak in discriminating when dealing with complex tasks. ECANet (Fig. 13b) shows improvement compared to DResNet, but there is still some overlap between different categories, especially the distinction between C-A and H-A categories is not obvious.MRCFN (Fig. 13c) demonstrates better category separation, with less overlap between B-A and C-A categories, showing the effectiveness of the model in feature differentiation.OURS (Fig. 13d), on the other hand, shows the best performance that the distribution of all categories in 2D space is almost completely separated, showing its superior ability in feature extraction and classification accuracy.

Overall, the experimental results on the two datasets fully demonstrate the excellent performance of TFDFNet in different noise environments and complex working conditions. Its two-branch structure can effectively fuse timing and time-frequency information, and the cross-attention mechanism further optimizes the feature fusion strategy, thus improving the robustness and accuracy of fault diagnosis. TFDFNet significantly outperforms existing baseline models in terms of noise immunity in low SNR environments and stability under complex operating conditions, showing great potential for application in industrial environments.

Fig. 12
figure 12

Confusion matrix on the Ottawa testing dataset with SNR = \(-8\,\textrm{dB}\). (a) DResnet, (b) ECANet, (c) MRCFN, (d) ours.

Fig. 13
figure 13

Visualization of the extracted features of four models for the Ottawa testing dataset with SNR = \(-8\,\textrm{dB}\). (a) DResnet, (b) ECANet, (c) MRCFN, (d) ours.

Ablation experiments

To assess the contributions of key architectural components in TFDFNet, we conduct a series of ablation studies on the CWRU and Ottawa datasets. The analysis examines the effects of branch configurations, core modules, fusion mechanisms, temporal feature extractors, and image encoder choices on overall diagnostic performance.

Effect of dual-branch architecture

To verify the contribution of each input modality, we compare the performance of the model using only the raw time-series signal branch, only the time-frequency image branch, and the proposed dual-branch configuration. Table 11 summarizes the results.

Table 11 Fault classification accuracy of different branching models on CWRU and Ottawa datasets (%).

From the experimental results, it is obvious that the model achieves a high accuracy rate (93.3%) on the CWRU dataset when only raw signal branching is used, which is significantly higher than the accuracy rate (89.73%) when only time-frequency image branching is used. This phenomenon suggests that the timing information of the original signal has a strong discriminative ability for fault diagnosis on the CWRU dataset. Furthermore, on the Ottawa dataset, the situation is different, using time-frequency image branching performs slightly better (96.57%) than the original signal branching (95.65%), which may be related to the signal characteristics of this dataset and the effectiveness of its time-frequency representation. Importantly, when we combine the image feature extraction branch and the signal feature extraction branch of the two-branch model, the performance is improved compared to any single-branch model. By fusing multimodal features, the model is able to capture key information from the signal timing and time-frequency representations more comprehensively, which improves the overall fault classification accuracy. The two-branch model on the CWRU dataset and the Ottawa dataset achieved 100% and 99.44% accuracy,(see Table 3 and Table 7 for specific results), which is a significant improvement in performance compared to the model that uses only the original signal branch and the time-frequency image branch. models, the performance is significantly improved. This demonstrates the advantage of multimodal information fusion, where the raw signal contains information about the temporal variation of the faults, while the time-frequency image reveals the frequency component of the signal, which is particularly expressive when dealing with nonlinear and complex fault modes.

Effect of key components in TFDFNet

To further investigate the contribution of each feature interaction mechanism in our model to fault diagnosis performance, we conducted ablation experiments by selectively removing specific modules from TFDFNet and evaluating the impact on classification accuracy using the CWRU and Ottawa datasets. The experimental results are summarized in Table 12.

Table 12 TFDFNet component ablation results (%).

The results demonstrate that each component contributes significantly to model performance. Removing the cross-attention fusion module–responsible for integrating time-frequency and time-series features–leads to a noticeable performance drop, confirming its effectiveness in enhancing inter-modality feature interaction. Eliminating the global attention mechanism from the temporal branch also results in decreased accuracy, highlighting the importance of dynamic temporal weighting in capturing long-term dependencies. When both components are removed, performance further degrades, though it remains better than removing either one individually. These findings underscore that the two mechanisms are complementary, and their joint use is key to achieving optimal performance in fault diagnosis.

Comparison of fusion strategies

To validate the superiority of the proposed cross-attention mechanism over simpler alternatives, we compare it with concatenation and weighted averaging fusion methods. The results are shown in Table13.

Table 13 Comparison of different fusion strategies on CWRU and Ottawa datasets (%).

Cross-attention consistently outperforms simpler fusion strategies by a large margin, illustrating its effectiveness in selectively emphasizing relevant features across modalities.

Comparison of temporal feature extractors

We also compare GABlock with standard sequence modeling approaches such as GRU and LSTM. Each variant replaces the GABlock module in the time-series branch with a two-layer GRU or LSTM. Table14 shows the results.

Table 14 Comparison of temporal extractors on CWRU and Ottawa datasets (%).

GABlock outperforms both GRU and LSTM, as its hybrid structure–integrating convolutional processing, global average pooling, and attention-based weighting–provides enhanced capacity to capture both local and global temporal dependencies. This design enables more discriminative and robust representation of fault-related patterns in time-series signals.

Comparison of image encoders

To further verify the effectiveness of using Swin Transformer as the image encoder for time-frequency representations, we conducted comparative experiments against classical CNN-based backbones, including ResNet18 and a standard shallow CNN. For fair comparison, only the image branch is used, and all models share the same input resolution (224\(\times\)224) and training settings. The results are shown in Table15.

Table 15 Comparison of different image encoders on CWRU and Ottawa datasets (%).

The results demonstrate that the Swin Transformer achieves the best performance among the evaluated encoders. Compared to the traditional CNN and ResNet18, Swin Transformer shows a clear advantage in modeling time-frequency images, benefiting from its ability to capture both local and global contextual information through shifted window self-attention mechanisms.

Complexity analysis

In addition to classification accuracy, it is essential to evaluate the computational complexity of different models, as complexity directly affects the feasibility of deployment in real-world industrial environments. A model with excessive parameters may achieve high accuracy but could be limited by hardware constraints in terms of memory consumption and inference efficiency. Therefore, we further compare the proposed TFDFNet with representative methods in terms of parameter size, training speed, and inference speed. The results are summarized in Table 16.

Table 16 Comparison of network complexity in terms of parameter size, training speed, and inference speed.

From Table 16, we observe that TFDFNet has a larger number of parameters (88.15M) compared with the other models, due to the introduction of the dual-branch architecture and attention mechanisms. As a result, both its training and inference speeds are relatively lower than those of Lightweight models such as MRCFN. Nevertheless, the inference speed of TFDFNet still reaches 166 FPS (6 ms per sample), which is sufficient for real-time fault diagnosis in most industrial scenarios. This result highlights a trade-off between model accuracy and computational complexity: TFDFNet achieves significantly higher diagnostic accuracy while maintaining acceptable efficiency, making it a practical solution for engineering applications where robustness and reliability are of paramount importance.

Conclusions and future work

In this paper, we propose a novel fault diagnosis model that integrates raw signal features with time-frequency image features to achieve more accurate fault classification. The model adopts a parallel two-branch architecture: one branch extracts features from time-frequency representations, while the other learns temporal features directly from the raw signals. To effectively fuse these heterogeneous modalities, we introduce a feature interaction mechanism that enhances information exchange between the two branches, thereby fully exploiting the complementarity of different feature types.Extensive experiments conducted on publicly available datasets demonstrate that the proposed dual-branch model outperforms traditional single-branch architectures, confirming the effectiveness of multimodal feature fusion. Furthermore, ablation studies verify the individual contributions of each model component, highlighting the significant role of combining diverse feature representations in boosting diagnostic performance. Notably, our model maintains high classification accuracy even under various noise conditions, indicating its strong robustness and practical potential for real-world fault diagnosis applications.

Despite these promising results, several limitations still exist when applying the proposed method in real-world industrial environments. The dual-branch architecture increases computational complexity, which may hinder real-time performance and deployment on resource-constrained devices. In addition, the model assumes consistent distributions between training and testing data, which is often not the case due to domain shifts caused by different machines, sensors, or operating conditions. The method also heavily relies on labeled training data, which are often limited or costly to obtain in practical scenarios.To address these limitations, we aim to enhance the practicality and adaptability of TFDFNet in future work. We will explore lightweight model designs to reduce computational cost and enable deployment on edge devices. Online learning mechanisms will be considered to allow the model to adapt dynamically to new data during operation. We also plan to investigate transfer learning and domain adaptation techniques to improve robustness across varying machines and conditions. Moreover, to alleviate reliance on large-scale annotated data, we will study few-shot learning and self-supervised learning strategies. These efforts will collectively improve the generalization ability, efficiency, and real-time applicability of the proposed framework.