TFDFNet: a dual-branch fault diagnosis model for bearings under noisy and complex industrial environments

Tang, Wuxue; Chen, Mingxuan

doi:10.1038/s41598-025-19258-2

Download PDF

Article
Open access
Published: 09 October 2025

TFDFNet: a dual-branch fault diagnosis model for bearings under noisy and complex industrial environments

Wuxue Tang¹ &
Mingxuan Chen¹

Scientific Reports volume 15, Article number: 35371 (2025) Cite this article

1082 Accesses
Metrics details

Subjects

Abstract

Unexpected failures in rotating machinery can cause costly downtime and safety hazards in industrial systems, highlighting the need for accurate and robust fault diagnosis. However, fault-related signals are often weak and easily obscured by noise, making reliable diagnosis challenging in real-world environments. To address this, we propose Time-frequency and time-series dual-branch fusion network(TFDFNet), a novel dual-branch deep learning model designed to improve fault classification performance under noisy and complex conditions. The model combines two complementary types of information: time-frequency representations derived from continuous wavelet transform and raw time-sequence data captured through sliding-window sampling. A Swin Transformer is used to extract deep features from time-frequency images, while a specially designed module called Gated attention block(GABlock) learns key temporal patterns from the sequence data. These features are fused using a cross-attention mechanism to enhance fault-related information. Extensive experiments on two public bearing fault datasets (CWRU and Ottawa) show that TFDFNet achieves outstanding accuracy, even under severe noise interference. The model reaches up to 100% accuracy on CWRU and 99.44% on Ottawa, and consistently outperforms existing convolutional neural network (CNN) baselines. These results demonstrate the practical potential and robustness of TFDFNet for intelligent fault diagnosis in industrial applications.

One-dimensional time-frequency dual-channel visual transformer for bearing fault diagnosis under strong noise and limited data conditions

Article Open access 20 July 2025

Intelligent fault diagnosis based on multi-source information fusion and attention-enhanced networks

Article Open access 16 October 2025

Implementation of MF block in CNN for advanced REB fault diagnosis

Article Open access 25 May 2025

Introduction

With the rapid advancement of modern industry, rotating machinery is increasingly utilized across various fields. Bearings, as key components of these systems, operate under harsh conditions, including high loads, temperatures, and speeds. They are widely employed in industries such as wind power, aerospace, and railroad transportation¹. However, in these extreme environments, bearings are highly susceptible to vibration, impact, and wear, making them the most vulnerable part of the machinery. Bearing failures not only lead to equipment damage but can also result in significant economic losses and safety risks. Therefore, timely fault diagnosis is critical to enhance equipment reliability and ensure safety. Since vibration signals in industrial environments are often contaminated by noise, and fault features are weak and difficult to extract, achieving accurate and robust fault diagnosis in such complex settings remains a significant challenge^2,3. The issue of fault diagnosis has attracted considerable attention from researchers. Currently, bearing fault diagnosis methods are primarily divided into signal-based and data-driven approaches⁴.

Signal-based fault diagnosis

Signal analysis methods extract bearing fault features by processing the raw signal in the time domain, frequency domain, or time-frequency domain, thus enabling fault diagnosis. Especially in complex noise environments, many advanced signal processing techniques are used to achieve more accurate fault diagnosis, such as Short-Time Fourier Transform (STFT), Wavelet Transform (WT), Empirical Mode Decomposition (EMD), Variational Mode Decomposition (VMD), and Singular Value Decomposition (SVD)^5,6,7,8,9,10.These methods effectively reduce noise interference and separate sensitive features, thereby reflecting the fault condition of the bearing in strong noise environments.

For example, Chen et al. proved their effectiveness in fault diagnosis through an improved integrated EMD and Hilbert square demodulation method¹¹; Bin et al. proposed a method combining wavelet packet decomposition and EMD for extracting the fault eigenfrequencies¹²; and Jiang et al. introduced the neighboring singular value ratio based on the singular value decomposition concept and combined it with Hidden Markov Model (HMM) to realize the identification of bearing faults¹³; Although the methods based on signal analysis can improve the accuracy of fault diagnosis to a certain extent, they have limitations in dealing with high-noise signals under complex operating conditions due to the fact that the bearing signals are usually highly nonlinear and nonsmooth. In addition, they usually rely on manual feature extraction, which makes it difficult to comprehensively capture fault information and affects the robustness and generalization ability of diagnosis.

Data-driven fault diagnosis

With the development of deep learning technology, data-driven fault diagnosis methods based on data have gradually become a research hotspot.These methods can automatically learn complex features in vibration signals and show excellent performance in fault classification and identification^14,15. Compared with traditional methods, data-driven methods do not rely on excessive a priori knowledge and have strong adaptive capability.Zhang et al.proposed a subset-based deep self-encoder feature learning model, developed an adaptive fine-tuning operation hey enhanced feature learning, and used a particle swarm algorithm to optimize the key parameters¹⁶. Wang et al.proposed an industrial motor bearing fault diagnosis algorithm based on multi-local model decision conflict resolution (MLMF-CR), which, after the initial characteristic signal selection and cleanup of industrial motor bearing vibration and current signals, digs deeper into the characteristics of the bearing signals in each fault state through the local fault diagnosis model based on bidirectional long- and short-term memory network (Bi-LSTM). Information to form a local diagnosis, and after making a decision, use the evidence theory for fusion¹⁷. Ma et al.proposed a new deep neural network by combining the advantages of CNN and long-term memory. The contribution of this method is the use of CNNs to process signals in the joint time-frequency domain, preserving feature information¹⁸. Dibaj et al. were able to identify unanticipated and untrained composite fault states¹⁹ by using a CNN model trained with three classes of healthy states, single bearing faults, and single gear faults in conjunction with probabilistic conditions.

Despite the significant results of deep learning-based methods in bearing fault diagnosis, they still face the impact of noisy environments on model robustness. Traditional convolutional neural networks mainly process one-dimensional signals, which are limited by a small receptive field and may lose critical information. In addition, current methods mainly rely on single-modal data and cannot fully utilize multimodal information for diagnosis, resulting in poor generalization ability in strong noise environments^{20,21,22,23,24}. To address these challenges, researchers have developed various dual-/multi-channel and multimodal fusion models.For instance,Cross-domain time-frequency adaptive fusion network(CDTFAFN)²⁵ introduced a coarse-to-fine dual-scale attention mechanism to fuse raw acoustic and vibration signals, achieving robust performance in noisy environments. Chen et al.²⁶ proposed a self-supervised framework that jointly leverages original time-series signals and their Fourier-transformed counterparts to enhance performance under few-label conditions. These studies highlight the effectiveness of dual-stream architectures and attention-based modeling for robust and generalizable fault diagnosis.Multi-information fusion deep ensemble learning network(MIFDELN)²⁷, which employs weighted sensor signal fusion and cross-scale attention modules to improve feature discriminability under noisy conditions; and Multi-sensor residual convolutional fusion network(MRCFN)²⁸, which combines residual modules and global perception mechanisms to effectively extract features in small sample and high-noise scenarios. For instance, Jiang et al.²⁹ proposed a deep convolutional multi-adversarial adaptation network with correlation alignment for cross-condition fault diagnosis, while Zhang et al.³⁰ developed a multi-scale deep feature memory and recovery network tailored for multi-sensor fault diagnosis under channel-missing scenarios. These works not only present advanced architectures but also provide detailed parameter configurations, which inspire the way we present and clarify the setup of TFDFNet in this study. In addition, dual-channel technology has demonstrated great potential in other domains. For example, Chen et al.³¹ constructed a dual-channel SE-3DUNet-based detection model for cerebral aneurysm screening in the field of clinical medicine, which demonstrated a lower false alarm rate and better sensitivity in a noisy environment. In the field of instrumentation, Zhang et al.³² realized multi-signal detection and analysis by constructing a sensor technology based on dual-channel signal reading, which effectively reduces false-positive signals and improves the sensitivity and accuracy of detection.

Although dual-branch models have demonstrated strong performance by leveraging multimodal features, many existing approaches suffer from limitations such as inadequate interaction between branches, rigid fusion strategies (e.g., direct concatenation or addition), and insufficient adaptability to varying signal characteristics. Moreover, some methods rely heavily on manual architecture design without exploring lightweight or flexible attention mechanisms. These limitations motivate the development of a more dynamic and effective cross-modal fusion strategy. Motivated by the strengths and limitations of existing approaches, this paper proposes a novel dual-branch fault diagnosis framework,Time-frequency and time-series dual-branch fusion network(TFDFNet),which integrates both time-frequency image features and one-dimensional time-series features in a parallel structure. Unlike existing methods that focus on a single modality or use simple feature concatenation, TFDFNet adopts a lightweight cross-attention fusion mechanism and a GABlock-based signal encoder to better capture complementary and discriminative representations. The proposed model aims to improve robustness and classification accuracy in real-world, high-noise environments.

The primary contributionMotivated by the strengths and limitations of existings of this paper are as follows:

Proposes TFDFNet, a parallel dual-branch model for mechanical fault classification that leverages multimodal fusion of time-frequency image features and one-dimensional time-series signal features to fully exploit their complementary information and enhance fault recognition accuracy.
Designs a specialized signal feature extraction module, Gated attention block(GABlock), which accurately captures key information from sequential data and significantly improves the model’s representational power and classification performance.
Extensive experiments on multiple public datasets validate the effectiveness of the proposed method and demonstrate its strong robustness under various noise conditions.

Preliminaries

Swin Transformer

Swin Transformer³³ is a hierarchical visual transformer architecture based on the sliding window mechanism, which demonstrates excellent performance in image feature extraction tasks.Swin Transformer introduces the Shifted Window Attention mechanism, which significantly reduces the computational complexity of the global self-attention by dividing the input features into fixed Swin Transformer introduces the Shifted Window Attention mechanism, by dividing the input features into fixed-size windows and locally calculating the self-attention within the windows, the computational complexity of the global self-attention is significantly reduced, which is especially suitable for the processing of high-resolution images. Meanwhile, the sliding window strategy effectively captures the contextual relationship between local and global through the information interaction across windows. In addition, Swin Transformer adopts a hierarchical feature extraction structure to extract multi-scale features while reducing the resolution of the feature map layer by layer, which makes it perform well in tasks such as image classification, target detection, and instance segmentation.

In terms of time-frequency map feature extraction, Swin Transformer’s local-awareness and multi-scale modeling capabilities have significant advantages. As a two-dimensional image representation, the time-frequency diagram can intuitively reflect the time-frequency distribution characteristics of the signal. By utilizing the layered architecture of Swin Transformer, key local and global features in the time-frequency diagram can be fully captured, while retaining the rich information of the signal in the time and frequency dimensions. Compared with traditional convolutional neural networks, Swin Transformer not only has stronger modeling capabilities, but also captures the long-distance dependencies between features in the time-frequency diagram through the self-attention mechanism, which improves the comprehensiveness and robustness of feature representation and provides a more efficient solution for signal processing and analysis tasks.

Gated Recurrent Unit

The Gated Recurrent Unit (GRU) is a simplified version of the Long Short-Term Memory (LSTM). The GRU combines the input and forget gates in the LSTM by integrating them into a single update gate and contains two key gate structures: the reset gate and the update gate. The update gate is responsible for controlling the relationship between the current hidden layer and the hidden layer at the previous moment. The larger the value of the update gate, the more influence the previous moment hidden layer has on the current hidden layer. On the contrary, the reset gate determines the extent to which the previous moment’s hidden layer information is ignored. The smaller the value of the reset gate, the more information from the previous moment is ignored. Specifically, the reset gate mainly controls how the previous hidden state and the current input information are combined, while the update gate determines how much a priori information should be retained at the current moment³⁴. The gating structure of the GRU is shown in Fig. 1.

In Fig. 1, $x_t$ denotes the input data at time t, $h_t$ denotes the output of the GRU unit, and $r_t$ and $z_t$ are the reset gate and update gate, respectively. With $r_t$ and $z_t$, the GRU computes a new hidden state $h_t$ based on the previous hidden state $h_{t-1}$. The calculation process can be expressed as follows:

$$\begin{aligned} z_t= & \sigma (W_z x_t + U_z h_{t-1} + b_z) \end{aligned}$$

(1)

$$\begin{aligned} r_t= & \sigma (W_r x_t + U_r h_{t-1} + b_r) \end{aligned}$$

(2)

$$\begin{aligned} \tilde{h}_t= & \tanh \left( W_h x_t + U_h (r_t \odot h_{t-1}) + b_h \right) \end{aligned}$$

(3)

$$\begin{aligned} h_t= & (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t \end{aligned}$$

(4)

Here, $\sigma (\cdot )$ denotes the sigmoid activation function, $\tanh (\cdot )$ denotes the hyperbolic tangent activation, $W_z, W_r, W_h$ are the input weight matrices, $U_z, U_r, U_h$ are the recurrent weight matrices, and $b_z, b_r, b_h$ are the bias terms. The update gate $z_t$ controls the degree to which the previous hidden state contributes to the current hidden state, while the reset gate $r_t$ decides how much past information should be ignored when computing the candidate hidden state $\tilde{h}_t$. Finally, the hidden state $h_t$ is obtained by interpolating between the previous hidden state $h_{t-1}$ and the candidate state $\tilde{h}_t$ according to the update gate.

Proposed methods

Overall framework

This study proposes a novel two-branch fault diagnosis model that incorporates time-frequency features and time-sequence features to achieve accurate fault classification. As shown in Fig. 2, the model mainly consists of a time-frequency feature extraction branch and a time-sequence feature extraction branch. These two branches respectively extract complementary information from the original signal and subsequently perform feature fusion to enhance the model’s characterization capability. In the time-frequency feature extraction branch, the original signal is first subjected to continuous wavelet transform to convert it into time-frequency images. These time-frequency images are then fed into Swin Transformer for feature extraction, which serves as an efficient image encoder capable of capturing multi-scale spatial dependencies while reducing computational complexity. Finally, the time-frequency features extracted in this branch are denoted as $F_1$. In the time-series feature extraction branch, the original signal is first segmented using sliding window sampling to preserve its time-series information. The segmented signal is then fed into the feature extraction module (GABlock), which is used to capture the local and global information of the signal, and the extracted time-series features are noted as $F_2$. Before feature fusion, a feature alignment strategy is applied to unify the feature dimensions of $F_1 \in \mathbb {R}^{M \times d_1}$ and $F_2 \in \mathbb {R}^{N \times d_2}$. Specifically, we apply a linear transformation:

$$\begin{aligned} F_1' = F_1 W_1,\quad F_2' = F_2 W_2 \end{aligned}$$

(5)

where $W_1 \in \mathbb {R}^{d_1 \times d}$ and $W_2 \in \mathbb {R}^{d_2 \times d}$ are learnable projection matrices, projecting both modalities into the same feature space $\mathbb {R}^d$.

In order to effectively integrate the features extracted from the two branches, we introduce a feature fusion mechanism that enhances the cross-modal feature interaction. Through this mechanism, $F_1$ and$F_2$ are fused, so that the time-frequency features and the time-sequence features can complement each other, and the completeness and discriminative power of the feature representation are enhanced. Compared with simple fusion methods, cross-attention adaptively focuses on salient features from the other modality, defined as:

$$\begin{aligned} \textrm{CrossAtt}(Q, K, V) = \textrm{softmax}\left( \frac{QK^T}{\sqrt{d}}\right) V \end{aligned}$$

(6)

In this equation, Q, K, and V represent the query, key, and value matrices derived from the input features, and d is the feature dimension used for scaling the dot product.This operation allows the model to dynamically emphasize key parts of the counterpart signal, enabling finer discrimination of similar fault types. Finally, the fused feature representation F passes through the full connectivity layer (FC) and is classified by the Softmax classifier to obtain the final fault diagnosis results. The framework makes full use of the complementary nature of different types of features, improves the classification accuracy, and demonstrates good robustness under multiple operating conditions.To provide a clearer understanding of the processing steps in TFDFNet, Algorithm 1 summarizes the overall workflow of the proposed model from data preprocessing to fault diagnosis.

Multimodal feature extraction module

Time-frequency feature extraction branch

In order to fully utilize the time-frequency information of the signal, this study adopts the continuous wavelet transform to transform the one-dimensional signal into a time-frequency map, and extracts the time-frequency features using a deep network. Although CNNs have commonly been used in prior work for feature extraction on such maps, the Swin Transformer provides significant advantages. It utilizes window-based local self-attention to reduce computational complexity, while its shifted window mechanism enables effective cross-region modeling and enhances global context awareness.In addition, the hierarchical structure and patch merging strategy of the Swin Transformer allow for multi-scale feature modeling, which is particularly beneficial in capturing fault patterns at different temporal and frequency resolutions.

Let the input time-frequency map be denoted as $S \in \mathbb {R}^{H \times W \times C}$, where $H$ and $W$ denote the frequency and time dimensions, respectively, and $C$ is the number of channels. First, a linear projection layer is used to divide the input into a series of non-overlapping patches as the input feature representation of the Transformer:

$$\begin{aligned} Z_0 = W_{emb} S + b_{emb} \end{aligned}$$

(7)

where $W_{emb}$ and $b_{emb}$ are learnable parameters for feature mapping. The projected features are then passed through multiple Swin Transformer layers for feature extraction. Each Swin Transformer layer consists of Window Self-Attention (W-MSA) and Shift Window Self-Attention (SW-MSA), where the attention is computed as follows:

$$\begin{aligned} \textrm{Attention}(Q, K, V) = \textrm{softmax} \left( \frac{QK^T}{\sqrt{d}} \right) V \end{aligned}$$

(8)

where $Q, K, V$ is the query, key, and value matrix, and $d$ is the scaling factor. W-MSA computes attention within local windows, while SW-MSA shifts the window positions to facilitate cross-region feature interactions, thereby enabling the network to model both local and global dependencies efficiently. This enables the model to maintain a balance between representation richness and computational efficiency, which is particularly important for large-scale industrial fault data.

Finally, after processing in multiple Transformer layers,the extracted time-frequency features are represented as:

$$\begin{aligned} H_{TF} \in \mathbb {R}^{M \times d_1} \end{aligned}$$

(9)

where $M$ is the feature sequence length and $d_1$ is the feature dimension. The feature will be fused with the timing features in the subsequent branch fusion module to improve the fault diagnosis performance.

Time-series feature extraction branch

The temporal signal feature extraction method is in extracting deep temporal patterns from one-dimensional sensing signals to enhance the representation of technical fault characteristics. Given the original sensing signal $X = \{x_1, x_2, \dots , x_R\}$, the signal is first reconstructed by applying the sliding window technique to obtain successive localized timing segments:

$$\begin{aligned} X^{sw} = \{X_1, X_2, \dots , X_N\}, \quad X_i = \{x_i, x_{i+1}, \dots , x_{i+w-1}\} \end{aligned}$$

(10)

where $N$ denotes the number of sequences after window division, $w$ is the length of the window, and each $X_i$ is passed as an independent input to the temporal feature extraction module.

To efficiently extract temporal patterns, we construct a feature extraction module named GABlock, which integrates the strengths of recurrent neural networks and attention mechanisms.Specifically, GABlock combines GRU-based modeling of sequential dependencies with a gated attention mechanism to adaptively emphasize critical time steps. Let $H_0 \in \mathbb {R}^{N \times d}$ be the initialized representation of the input signals, where $d$ denotes the feature dimensions of each window segment, and GABlock generates deep time-series features through a series of mapping functions $f(\cdot )$:

$$\begin{aligned} H = f(H_0) \end{aligned}$$

(11)

in this process, the implicit representation $h_t$ of each time step $t$ is calculated from the dynamics of the signals before and after the correlation:

$$\begin{aligned} h_t = \varphi (W_h H_t + b_h) \end{aligned}$$

(12)

where $W_h$ and $b_h$ are trainable parameters and $\varphi (\cdot )$ denotes the nonlinear activation function. Subsequently, the importance coefficient $\alpha _t$ is computed for each time step using the feature weighting mechanism:

$$\begin{aligned} \alpha _t = \frac{\exp (W_a h_t)}{\sum _{j=1}^{T} \exp (W_a h_j)} \end{aligned}$$

(13)

the global temporal feature representation is obtained after weighting:

$$\begin{aligned} H_{TS} = \sum _{t=1}^{T} \alpha _t h_t \end{aligned}$$

(14)

In implementation, GABlock consists of two 1D convolution layers (kernel size = 5, stride = 1) with GELU activations and BatchNorm for robust feature learning. A global average pooling layer aggregates contextual cues, and the attention mechanism dynamically adjusts feature importance across time.This allows the model to capture both transient and long-term dependencies in the input signal.We also provide a comparative analysis in Sect. "Comparison of temporal feature extractors" to validate the effectiveness of GABlock against classical temporal models such as GRU and LSTM.

Finally, the timing features $H_{TS}$ generated by GABlock serve as the time-series representation of the input signal and are passed to the subsequent fusion module.

Branch fusion

After completing the time-frequency feature extraction and the time-sequence feature extraction, the features from the two branches need to be fused to fully leverage their complementary information and improve fault classification performance. Let the output features of the time-frequency feature extraction branch be denoted as $H_{TF} \in \mathbb {R}^{M \times d_1}$, and the output features of the time-sequence feature extraction branch as $H_{TS} \in \mathbb {R}^{N \times d_2}$, where $M$ and $N$ represent the feature lengths of the two branches, $d_1$ and $d_2$ represent their feature dimensions, respectively. To effectively utilize the complementary information from both branches, we introduce the Cross-Attention mechanism, which enhances the interaction between the time-frequency and time-sequence features, allowing them to focus on each other’s important information.

First, the time-frequency features $H_{TF}$ and time-sequence features $H_{TS}$ are mapped to Query, Key, and Value representations via learnable linear transformation matrices:

$$\begin{aligned} Q_{TF}= & W_Q H_{TF}, \quad K_{TS} = W_K H_{TS}, \quad V_{TS} = W_V H_{TS} \end{aligned}$$

(15)

$$\begin{aligned} Q_{TS}= & W'_Q H_{TS}, \quad K_{TF} = W'_K H_{TF}, \quad V_{TF} = W'_V H_{TF} \end{aligned}$$

(16)

where $W_Q, W_K, W_V, W'_Q, W'_K, W'_V$ are trainable parameter matrices used to project the features into the same attention space. Then, the attention scores between the two branches are computed to quantify the similarity between the feature sets:

$$\begin{aligned} A_{TF \rightarrow TS}= & \textrm{softmax} \left( \frac{Q_{TF} K_{TS}^{\textrm{T}}}{\sqrt{d}} \right) \end{aligned}$$

(17)

$$\begin{aligned} A_{TS \rightarrow TF}= & \textrm{softmax} \left( \frac{Q_{TS} K_{TF}^{\textrm{T}}}{\sqrt{d}} \right) \end{aligned}$$

(18)

where $d$ is the scaling factor to prevent excessively large attention scores. The computed attention scores are used to weight the corresponding values to generate the enhanced feature representations:

$$\begin{aligned} H_{TF}^{cross}= & A_{TF \rightarrow TS} V_{TS} \end{aligned}$$

(19)

$$\begin{aligned} H_{TS}^{cross}= & A_{TS \rightarrow TF} V_{TF} \end{aligned}$$

(20)

which capture the important information from the complementary modality.

Finally, the final fused feature representation is obtained by concatenating the enhanced features and applying a nonlinear transformation:

$$\begin{aligned} H_{fusion} = \sigma (W_f \left[ H_{TF}^{cross}, H_{TS}^{cross} \right] ) + b_f \end{aligned}$$

(21)

where $W_f$ and $b_f$ are learnable parameters, and $\sigma (\cdot )$ represents a nonlinear activation function, such as Rectified linear unit(ReLU) or Gaussian error linear unit(GELU). The resulting $H_{fusion}$ is then passed to the final classification layer for fault classification. After the final fully connected layer and softmax classifier, the probability distribution over $C$ fault classes is obtained:

$$\begin{aligned} \hat{y} = \textrm{softmax}(W_c H_{fusion} + b_c) \end{aligned}$$

(22)

where $W_c$ and $b_c$ are trainable parameters, and $\hat{y} \in \mathbb {R}^{C}$ represents the predicted probability distribution. This cross-attention fusion mechanism enables the model to adaptively focus on the most salient and informative regions of the complementary modality, significantly enhancing the robustness and discriminability of the learned feature representations. The overall process of cross-attention feature interaction and fusion is illustrated in Fig. 3.

Experiments

In order to assess the validity, progress and robustness of the proposed model, experimental analyses are carried out in this paper using CWRU and Ottawa bearing failure datasets.

Dataset description

The Case Western Reserve University Bearing Data Center provides the CWRU dataset³⁵, which serves as a widely recognized benchmark for bearing fault diagnosis research. As depicted in Fig. 4, the experimental setup includes a drive motor, torque sensor, dynamometer, and associated control electronics. The test bearings are mounted on the motor shaft. Single-point defects of varying severities were introduced into the bearings using electro-discharge machining (EDM), with fault diameters of 0.007 inches, 0.014 inches, 0.021 inches, and 0.028 inches. Owing to incomplete measurements for the 0.028-inch faults, only data corresponding to the other three fault sizes are utilized in this study.The complete dataset comprises 161 samples, categorized into four groups: 48k normal baseline, 48k drive-end fault samples, 12k drive-end fault samples, and 12k fan-end fault samples. Each category includes faults of different types, namely ball defects, inner race defects, and outer race defects. Furthermore, the outer race faults are classified according to their angular positions relative to the load zone: ‘centre’ (6 o’clock position), ‘orthogonal’ (3 o’clock), and ‘reverse’ (12 o’clock).The experiments were conducted under four distinct load conditions: 0 hp, 1 hp, 2 hp, and 3 hp. The rotational speed of the motor varied between 1797 rpm and 1730 rpm. Vibration signals were collected from three accelerometers mounted on the drive end, fan end, and motor base. Each sample was recorded over a duration of 10 seconds, with a consistent sampling frequency.

In this study, the experimental dataset is derived from the vibration signals collected at the drive end bearing. The data acquisition was performed using an accelerometer operating at a sampling frequency of 12kHz, under a load torque of 0hp and a rotational speed of 1797rpm. The fault location corresponds to the outer race at the 6 o’clock position. Detailed parameter settings of the dataset are summarized in Table 1, and the corresponding time-domain waveform is illustrated in Fig. 5.

Table 1 Description of sample size and labels (CWRU dataset).

Subjects

Abstract

Similar content being viewed by others

One-dimensional time-frequency dual-channel visual transformer for bearing fault diagnosis under strong noise and limited data conditions

Intelligent fault diagnosis based on multi-source information fusion and attention-enhanced networks

Implementation of MF block in CNN for advanced REB fault diagnosis

Introduction

Signal-based fault diagnosis

Data-driven fault diagnosis

Preliminaries

Swin Transformer

Gated Recurrent Unit

Proposed methods

Overall framework

Multimodal feature extraction module

Time-frequency feature extraction branch

Time-series feature extraction branch

Branch fusion

Experiments

Dataset description

Data preprocessing

Slide window resampling

Gaussian noise

Continuous wavelet transform

Data preprocessing

Experimental setup

Baseline systems

Implementation details

Result and analysis

Ablation experiments

Effect of dual-branch architecture

Effect of key components in TFDFNet

Comparison of fusion strategies

Comparison of temporal feature extractors

Comparison of image encoders

Complexity analysis

Conclusions and future work

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Quick links