Introduction

Existing studies1,2 have demonstrated that mechanical faults are typically preceded by characteristic precursors in acoustic vibration signals. This study focuses on fault monitoring of conveyor idlers, establishing a non-invasive real-time monitoring system through mechanistic analysis of the coupling between fiber-optic scattering light fields and mechanical vibrations.

Depending on the scattering generation mechanisms3,4, optical wave propagation in fibers simultaneously produces three distinct scattering phenomena with different central frequencies: Rayleigh scattering, Brillouin scattering, and Raman scattering, as shown in Fig. 1. Rayleigh scattering, as an elastic scattering phenomenon, maintains identical frequency to the incident light while demonstrating the highest scattering intensity. Its distinct correlation with external vibrations has established Rayleigh scattering as the predominant mechanism for distributed vibration sensing applications.

Fig. 1
figure 1

Classification of optical scattering phenomena.

In recent years, accurate and effective identification of fiber optic sensing events has attracted significant research attention globally5,6. Lin et al.7 proposed a recognition framework for fiber optic vibration signals based on wavelet packet Shannon entropy for feature extraction. Combined with a radial basis function neural network, their method achieved an average recognition rate of 82.67% for three vibration events: climbing, walking, and knocking. Ma et al.8 developed a classification algorithm using Mel-frequency cepstral coefficients (MFCCs) and support vector data description (SVDD), achieving 86.67% average accuracy for rain, trampling, and climbing events. Sun et al.9 introduced a Hilbert-Huang transform algorithm with complementary ensemble empirical mode decomposition, attaining 85% recognition accuracy for four typical fiber optic vibration signals. Xu et al.10 analyzed idler fault signals under locked, bearingless, and fractured conditions. Jia11 designed an idler fault detection system for major failure modes, while Roy et al.12 proposed a hybrid cepstrum method combining MFCCs and inverted MFCCs with long short-term memory (LSTM) networks to diagnose normal idlers, rolling element faults, and eccentric rotation faults. Although these methods significantly improve time efficiency in feature extraction, their limited capability to precisely characterize time–frequency information hinders effective differentiation of complex fiber optic vibration events.

To address the efficiency degradation caused by high-dimensional manual feature vectors, this study proposes a deep learning framework based on adaptive feature extraction. Leveraging signal two-dimensionalization preprocessing, we exploit GoogLeNet13,14,15—an optimized convolutional neural network (CNN) with multi-scale hierarchical architecture—to effectively capture both local and global spatial features while reducing computational complexity. The proposed method converts one-dimensional time-series signals from Mel-distributed fiber optic vibration monitoring systems into two-dimensional image representations, preserving complete temporal dependencies and signal integrity.

As demonstrated in prior studies16,17,18,19, GoogLeNet excels at adaptively extracting spatial correlation features through convolutional and pooling operations, whereas LSTM networks specialize in capturing temporal correlation features via precise memory cell control. Unlike conventional deep learning models, the integrated GoogLeNet-LSTM architecture jointly learns spatial–temporal patterns from two-dimensional inputs, enabling automated extraction of time-domain, frequency-domain, and spatial-domain characteristics for enhanced classification.

For practical validation, vibration signals were collected from belt conveyor idlers under typical failure scenarios in open-pit mines, including dust-induced overheating, bearing wear, and shaft fracture. After preprocessing with Mel filtering, the signals were used to train a GoogLeNet-LSTM classifier and establish an idler fault database. To address structural damage-induced intrinsic vibrations in harsh mining environments—where altered system stiffness and damping generate anomalous vibrations—we further propose a dynamics model-driven fault diagnosis method. This approach integrates a conveyor system dynamics model into the fiber optic vibration processing framework. Field-acquired optical vibration signals are compared with theoretically predicted values, and qualified data (within predefined dynamic amplitude thresholds) undergo short-time Fourier transform (STFT) to generate time–frequency images. These images are processed via Mel filtering and input into the trained database for final fault diagnosis, as detailed in Fig. 2.

Fig. 2
figure 2

Flowchart of dynamics model-driven fiber optic sensing for idler fault diagnosis.

Considering the differences in structure and load force in the conveyor conveying area, the conveyor is divided into zones 1 to N. Considering the special circumstances such as rain cover noise and fiber optic damage in the on-site working conditions, which may interfere with fault identification. To this end, the theoretical vibration signal amplitude M solved by the dynamic model will be compared with the actual measured S on site. When 0.8Sm < Mm < 1.2Sm is satisfied, it approximately proves that there are no obvious problems in the on-site measurement, and further signal feature processing and analysis can be carried out. Otherwise, the characteristic signals of the area will be extracted from the system and alerted to on-site workers for review through an alarm, as shown in Fig. 3.

Fig. 3
figure 3

Schematic diagram of fiber optic vibration fault detection driven by dynamic model.

Fiber optic vibration sensing-based fault location analysis in conveyor idlers

Assuming the frequency of the laser pulse signal is \(\nu\) with a pulse duration of W, the backward Rayleigh scattering light at the input end of the sensing fiber after propagating through the entire sensing fiber of length l can be expressed as20,21:

$$e(t) = \sum\limits_{{i = 1}}^{{N \times M}} {a_{i} } \exp \left( { - \alpha \frac{{c\tau _{i} }}{n}} \right)\exp \{ j2\pi (t - \tau _{i} )\} rect\left( {\frac{{t - \tau _{i} }}{W}} \right)$$
(1)

In the equation, \(a_{i}\) is the amplitude of the i-th scatterer, \(c\) represents the speed of light, \(\alpha\) represents the attenuation coefficient of the sensing fiber, \(\tau_{i}\) represents the total round-trip time of the i-th scatterer, and n represents the effective refractive index of the fiber, where22,23,24:

$$rect\left( {\frac{{t - \tau_{i} }}{W}} \right) = \left\{ {\begin{array}{*{20}l} {1,} \hfill & {0 \le t - \tau_{i} \le 1} \hfill \\ {0,} \hfill & {{\text{Otherwise}}{.}} \hfill \\ \end{array} } \right.$$
(2)

Based on Eq. (2), the optical power of the backward Rayleigh scattering light can be specifically calculated as:

$$\begin{aligned} p(t) = & |e(t)|^{2} = \sum\limits_{{i = 1}}^{{N \times M}} {a_{i}^{2} } \exp \left( { - 2\alpha \frac{{c\tau _{i} }}{n}} \right)rect\left( {\frac{{t - \tau _{i} }}{W}} \right) \\ & + 2\sum\limits_{{i = 1}}^{{N \times M}} {\sum\limits_{{j = i}} {a_{i} } } a_{j} \cos \varphi _{{i,j}} \exp [ - \alpha \frac{{c(\tau _{i} + \tau _{j} )}}{n}]rect\left( {\frac{{t - \tau _{i} }}{W}} \right)rect\left( {\frac{{t - \tau _{j} }}{W}} \right) \\ \end{aligned}$$
(3)

In Eq. (3), \(\varphi_{i,j}\) denotes the phase difference between the i-th and j-th scattered light waves. The first term on the right-hand side (RHS) represents the summation of power contributions from all independent scattered light waves within the sensing fiber, which remains unaffected by the source frequency and external vibration events. The second term on the RHS corresponds to the interference-derived power from backward Rayleigh scattered waves within the pulse width. When the distance between the i-th scattering event and the input end of the sensing fiber is denoted as d, the following relationship holds25,26:

$$d_{i} = \frac{{c\tau_{i} }}{2n}$$
(4)

Therefore, the phase difference formed between the i-th and j-th scattered light waves can be calculated as:

$$\varphi_{i,j} = 2\pi \nu (\tau_{i} - \tau_{j} ) = \frac{4\pi \nu n}{c}(d_{i} - d_{j} )$$
(5)

When a vibration event with amplitude A acts on the sensing fiber, the strain effect and elasto-optic effect induce simultaneous changes in both the fiber length and refractive index at the affected position. Assuming the vibration occurs near the i-th and j-th scattering light waves, the phase difference \(\Phi_{i,j}\) between them can be expressed as27,28,29:

$$\begin{aligned} \Phi_{i,j} = & \varphi_{i,j} + \Delta \varphi_{i,j} = \frac{4\pi \nu n}{c}(d_{i} - d_{j} ) + \frac{4\pi \nu }{c}(d_{i} - d_{j} )(n + \Delta n) \cdot \Delta A \\ = & \frac{4\pi \nu }{c}(d_{i} - d_{j} )[n + (n + \Delta n) \cdot \Delta A] \\ \end{aligned}$$
(6)

According to Eq. (6), the phase variation of backward Rayleigh scattered light is directly proportional to the amplitude of vibration events applied to the sensing fiber. Therefore, the specific vibration location can be demodulated through a differential processing algorithm.

Dynamic modeling of belt conveyor systems

Through holistic mechanical analysis of the conveyor system, it is evident that the idler frames are longitudinally distributed along the belt with intricate mass distribution and complex kinematic characteristics. Therefore, formulating dynamic equations for the conveyor system using Lagrangian equations proves particularly effective. The dynamic model of the conveyor system is illustrated in Fig. 4. The complete Lagrangian equations can generally be expressed as follows30,31,32,33,34:

$$\frac{d}{{dt}}\left( {\frac{{\partial T}}{{\partial \dot{x}_{j} }}} \right) - \frac{{\partial T}}{{\partial x_{j} }} + \frac{{\partial V}}{{\partial x_{j} }} + \frac{{\partial D}}{{\partial \dot{x}_{j} }} = F_{j} \left( t \right)$$
(7)

where:

Fig. 4
figure 4

Dynamic model of the belt conveyor system.

\(F_{j} (t)\) is the external excitation force, j = 1,2,3,…;

\(\partial x_{j}\) is the generalized displacement, j = 1,2,3,…;

\(\partial \dot{x}_{j}\) is the generalized velocity, j = 1,2,3,…;

\(T\) is the kinetic energy of the system;

\(V\) is the potential energy of the system;

\(D\) is the energy dissipation function of the system.

m1 ~ m4: Masses of the support frame, lower roller, left roller, and right roller, respectively (kg).

x1 ~ x4: Displacements of the support frame, lower roller, left roller, and right roller, respectively (m).

k11,c11 and k12,c12: Stiffness and damping coefficients between the left/right sides of the support frame and the ground, respectively.

k21,c21 and k22,c22: Stiffness and damping coefficients between the left/right sides of the lower roller and the support frame, respectively.

k31,c31 and k41,c41: Stiffness and damping coefficients between the lower roller and the left/right rollers, respectively.

k32,c32 and k42,c42: Stiffness and damping coefficients between the left/right rollers and the support frame, respectively.

J: Moment of inertia of the support frame (kg·m2/s2).

a: Horizontal displacement between the center of the support frame and the lower support point (m).

b: Thickness of the support frame (m).

c: Horizontal displacement from the center of the support frame (point O) to points B and D (m).

θ: Small angular displacement of the support frame (rad).

α: Angle between the left roller and the horizontal plane (rad).

β: Angle between the right roller and the horizontal plane (rad).

Ft1,Ft2: Left and right support forces acting on the support frame (N).

F2,F3,F4: Forces acting on the lower roller, left roller, and right roller, respectively (N).

Using the energy method, the kinetic energy of the conveyor system can be determined as:

$$T = \frac{1}{2}\left[ {m_{1} \left( {\dot{x}_{1} } \right)^{2} + m_{2} \left( {\dot{x}_{2} } \right)^{2} + m_{3} \left( {\dot{x}_{3} } \right)^{2} + m_{4} \left( {\dot{x}_{4} } \right)^{2} + + J_{1} \dot{\theta }^{2} } \right]$$
(8)

In the equation,\(\dot{x}_{i} (i = 1\sim 10)\) represents the velocity parameters of each mass block.

During actual operation of the conveyor system, uneven material distribution—particularly at the chute discharge port—can induce significant lateral vibrations. Considering the relatively small vibration amplitude of the support frame, it is assumed that \(\sin \theta \approx \theta\). Consequently, the corresponding displacements and velocities at the two ends beneath the support are as follows:

$$\left\{ \begin{aligned} x_{11} = & x_{1} - OA\sin \theta = x_{1} - OA\theta \\ x_{12} = & x_{1} + OA\sin \theta = x_{1} + OA\theta \\ OA = & l = \sqrt {a^{2} + (b/2)^{2} } \\ \end{aligned} \right.$$
(9)
$$\left\{ \begin{aligned} \dot{x}_{11} = & \dot{x}_{1} - OA\sin \dot{\theta } = \dot{x}_{1} - OA\dot{\theta } \\ x_{12} = & \dot{x}_{1} + OA\sin \dot{\theta } = \dot{x}_{1} + OA\dot{\theta } \\ \end{aligned} \right.$$
(10)

According to the relative form of Newton’s second law, it can be concluded that:

$$\left\{ \begin{aligned} x_{21} = & x_{B} - x_{2} = x_{1} - x_{2} - OB\theta \\ x_{22} = & x_{D} - x_{2} = x_{1} - x_{2} + OD\theta \\ OB = & OD = e \\ \end{aligned} \right.$$
(11)
$$\left\{ \begin{aligned} \dot{x}_{21} = & \dot{x}_{B} - \dot{x}_{2} = \dot{x}_{1} - \dot{x}_{2} - OB\dot{\theta } \\ \dot{x}_{22} = & \dot{x}_{D} - \dot{x}_{2} = \dot{x}_{1} - \dot{x}_{2} + OD\dot{\theta } \\ \end{aligned} \right.$$
(12)

Furthermore, based on the positional relationship between the left and right idlers, it can be deduced that their respective displacements and velocities are:

$$\left\{ \begin{aligned} x_{32} = & x_{3} \sin \alpha \\ x_{31} = & x_{2} \sin \alpha - x_{32} = (x_{2} - x_{3} )\sin \alpha \\ x_{42} = & x_{4} \sin \beta \\ x_{41} = & x_{2} \sin \beta - x_{42} = (x_{2} - x_{4} )\sin \beta \\ \end{aligned} \right.$$
(13)
$$\left\{ \begin{aligned} \dot{x}_{32} = & \dot{x}_{3} \sin \dot{\alpha } \\ \dot{x}_{31} = & \dot{x}_{2} \sin \dot{\alpha } - \dot{x}_{32} = (\dot{x}_{2} - \dot{x}_{3} )\sin \dot{\alpha } \\ \dot{x}_{42} = & \dot{x}_{4} \sin \dot{\beta } \\ \dot{x}_{41} = & \dot{x}_{2} \sin \dot{\beta } - \dot{x}_{42} = (\dot{x}_{2} - \dot{x}_{4} )\sin \dot{\beta } \\ \end{aligned} \right.$$
(14)
$$\begin{aligned} V = & \frac{1}{2}k_{11} (x_{1} - l\theta )^{2} + \frac{1}{2}k_{12} \left( {x_{1} + l\theta } \right)^{2} + \frac{1}{2}k_{21} \left( {x_{1} - x_{2} - e\theta } \right)^{2} \\ & + \frac{1}{2}k_{22} \left( {x_{1} - x_{2} + e\theta } \right)^{2} + \frac{1}{2}k_{31} \left[ {\left( {x_{2} - x_{3} } \right)\sin a} \right]^{2} + \frac{1}{2}k_{32} (x_{3} \sin a)^{2} \\ & + \frac{1}{2}k_{41} \left[ {\left( {x_{2} - x_{4} } \right)\sin \beta } \right]^{2} + \frac{1}{2}k_{42} \left( {x_{4} \sin \beta } \right)^{2} \\ \end{aligned}$$
(15)
$$\begin{aligned} D = & \frac{1}{2}c_{11} (\dot{x}_{1} - l\dot{\theta })^{2} + \frac{1}{2}c_{12} \left( {\dot{x}_{1} + l\dot{\theta }} \right)^{2} + \frac{1}{2}c_{21} \left( {\dot{x}_{1} - \dot{x}_{2} - e\dot{\theta }} \right)^{2} \\ & + \frac{1}{2}c_{22} \left( {\dot{x}_{1} - \dot{x}_{2} + e\dot{\theta }} \right)^{2} + \frac{1}{2}c_{31} \left[ {\left( {\dot{x}_{2} - \dot{x}_{3} } \right)\sin a} \right]^{2} + \frac{1}{2}c_{32} (\dot{x}_{3} \sin a)^{2} \\ & + \frac{1}{2}c_{41} \left[ {\left( {\dot{x}_{2} - \dot{x}_{3} } \right)\sin \beta } \right]^{2} + \frac{1}{2}c_{42} \left( {\dot{x}_{3} \sin \beta } \right)^{2} \\ \end{aligned}$$
(16)
$$\left\{ \begin{aligned} m_{1} \ddot{x}_{1} = & - [(k_{11} + k_{12} + k_{13} + k_{14} )x_{1} + (k_{12} - k_{11} )l\theta - (k_{21} + k_{22} )x_{2} + (k_{22} - k_{21} )e\theta ] \\ & - [(c_{11} + c_{12} + c_{13} + c_{14} )\dot{x}_{1} + (c_{12} - c_{11} )l\theta - (c_{21} + c_{22} )\dot{x}_{2} + (c_{22} - c_{21} )e\dot{\theta }] \\ m_{2} \ddot{x}_{2} = & - [( - k_{21} - k_{22} )x_{1} + (k_{21} + k_{22} + k_{31} \sin^{2} \alpha + k_{41} \sin^{2} \beta )x_{2} - k_{31} \sin^{2} \alpha x_{3} - k_{41} \sin^{2} \beta )x_{4} + (k_{21} - k_{22} )e\theta ] \\ & - [( - c_{21} - c_{22} )x_{1} + (c_{21} + c_{22} + c_{31} \sin^{2} \alpha + c_{41} \sin^{2} \beta )\dot{x}_{2} - c_{31} \sin^{2} \alpha \dot{x}_{3} - c_{41} \sin^{2} \beta )\dot{x}_{4} + (c_{21} - c_{22} )e\dot{\theta }] \\ m_{3} \ddot{x}_{3} = & - [(k_{31} \sin^{2} \alpha + k_{32} \sin^{2} \alpha )x_{3} - k_{31} \sin^{2} \alpha x_{2} ] - [(c_{31} \sin^{2} \alpha + c_{32} \sin^{2} \alpha )\dot{x}_{3} - c_{31} \sin^{2} \alpha \dot{x}_{2} ] \\ m_{4} \ddot{x}_{4} = & - [(k_{41} + k_{42} )\sin^{2} \beta x_{4} - k_{41} \sin^{2} \beta x_{2} ] - [(c_{41} + c_{42} )\sin^{2} \beta \dot{x}_{4} - c_{41} \sin^{2} \beta \dot{x}_{2} ] \\ J\ddot{\theta } = & (k_{12} l - k_{11} l + k_{22} e - k_{21} e)x_{1} + (k_{21} e - k_{22} )x_{2} + (k_{11} l^{2} + k_{12} l^{2} + k_{21} e^{2} + k_{22} e^{2} )\theta \\ \end{aligned} \right.$$
(17)

Signal feature extraction and recognition methodology

Research on two-dimensional feature extraction method based on Mel spectrogram

To enhance the effectiveness and real-time performance of feature vector construction in adaptive feature extraction for deep learning models, this paper proposes a two-dimensional transformation method based on Mel spectrograms. This approach enables deep learning models to efficiently extract time-domain, frequency-domain, and spatial-domain features from sensing signals in real-time.

By directly applying the Short-Time Fourier Transform (STFT) to a one-dimensional time-series sensing signal, its corresponding time–frequency spectrogram can be obtained. This spectrogram can then be directly fed into a two-dimensional convolutional neural network (2D-CNN) for recognition and classification. If the sensing signal acquired by a distributed optical fiber vibration sensing (DOVS) system is denoted as \(x(t)\), its STFT-processed spectrum \(y(f,\tau )\) can be expressed as35,36,37:

$$y(f,\tau ) = \int_{ - \infty }^{ + \infty } h (t - \tau )x(t)e^{ - 2\pi ft} dt$$
(18)

In the equation, \(h(t - \tau )\) represents the Hamming window. When the length of the Hamming window function is set to 2048 with a step size of 1024, performing the Short-Time Fourier Transform (STFT) directly on the sensing signal yields a two-dimensional time–frequency spectrum of dimensions 1025 × 28 × 3. According to the classification principles of deep learning models, directly inputting this 1025 × 28 × 3 two-dimensional time–frequency spectrum into a 2D convolutional neural network (CNN) for processing would significantly degrade the training efficiency and computational performance of recognition and classification tasks, especially under standard host computer configurations with limited.

To effectively address the issue of excessive dimensionality in the time–frequency diagrams generated by the STFT of sensing signals, this paper employs the Mel time–frequency spectrum method to accurately characterize the frequency energy distribution of the sensing signals across different time scales. This approach not only reduces the dimensionality of the time–frequency representation but also preserves the salient time–frequency features of the original signal38,39.

First, the Short-Time Fourier Transform (STFT) is performed on the sensing signal to obtain its corresponding time–frequency spectrum. Subsequently, this spectrum is multiplied by a predefined Mel filter bank to derive a reduced time–frequency spectrum matrix. Finally, a logarithmic operation is applied to the reduced matrix to obtain the final Mel time–frequency spectrum matrix40, as illustrated in Fig. 5.

Fig. 5
figure 5

Signal processing workflow for Mel-scale time–frequency spectrum conversion.

If the frequency of the sensing signal in the Mel scale is denoted as \(f_{Mel}\),its corresponding relationship with the linear frequency f is expressed as:

$$f_{Mel} = 2595\log_{10} \left( {1 + \frac{f}{700}} \right)$$
(19)

According to the conversion principle of the Mel time–frequency spectrum, the Mel spectrum is generated by applying the Mel filter bank to amplify the low-to-mid frequency components and attenuate the mid-to-high frequency components in the original time–frequency representation. This ensures the preservation of critical time–frequency information while significantly reducing dimensionality. If the Mel filter bank contains M equal-height triangular filter functions, denoted as \(H_{Mel} (f)\), then:

$$H_{Mel} (f) = \left\{ {\begin{array}{*{20}l} {0,} \hfill & {f < f_{c} (m - 1)} \hfill \\ {\frac{{f - f_{c} (m - 1)}}{{f_{c} (m) - f_{c} (m - 1)}},} \hfill & {f_{c} (m - 1) \le f \le f_{c} (m)} \hfill \\ {\frac{{f_{c} (m + 1) - f}}{{f_{c} (m + 1) - f_{c} (m)}},} \hfill & {f_{c} (m) < f \le f_{c} (m + 1)} \hfill \\ {0,} \hfill & {f > f_{c} (m + 1)} \hfill \\ \end{array} } \right.$$
(20)

In the equation,\(f_{c} (m)\) represents the center frequency of the m-th equal-height triangular filter, where \(1 \le m \le M\). When M = 28 is set, the resulting Mel time–frequency spectrum after the transformation has dimensions of 28 × 28 × 3. Therefore, compared to the time–frequency spectrum obtained via STFT, the Mel spectrum significantly enhances both the training efficiency and recognition performance of deep learning models.

Construction of a hybrid deep learning model based on GoogLeNet-LSTM

The GoogLeNet architecture proposed in this paper draws inspiration from Inception v3, incorporating three types of convolutional kernels: 1 × 1, 5 × 5, and 7 × 7. Specifically, the 5 × 5 kernel is decomposed into two cascaded 3 × 3 kernels, and the 7 × 7 kernel is implemented using three stacked 3 × 3 kernels. To further reduce model parameters, each 3 × 3 convolutional layer is itself factorized into a combination of two 1 × 3 kernels and one 3 × 1 kernel41,42. The final design stacks ten such Inception modules, omits the conventional auxiliary classifier components, retains average pooling functionality43,44,45, and forms the complete GoogLeNet architecture as illustrated in Fig. 6.

Fig. 6
figure 6

Architecture of the GoogLeNet network.

Building on the adaptive feature extraction capabilities of GoogLeNet and LSTM neural networks in signal processing, this paper proposes an enhanced GoogLeNet-LSTM hybrid deep learning model specifically designed for distributed optical fiber vibration sensing systems46,47,48. This architecture enables adaptive simultaneous extraction of both spatial and temporal features embedded in sensing signals49,50, as illustrated in Fig. 7 below.

Fig. 7
figure 7

Hybrid deep learning model: GoogLeNet-LSTM.

Data acquisition and experimental validation

A field-deployed experimental test platform was established, where a fiber-optic vibration monitoring host is housed within the laboratory. Vibration-sensitive optical fibers are deployed bilaterally along the conveyor line, enabling the host to acquire substantial field-operational acoustic-vibration data. This data is subsequently uploaded to a mobile server, where embedded algorithmic modules perform real-time feature extraction and pattern recognition on the acquired data, as illustrated in Fig. 8.

Fig. 8
figure 8

Field testing setup for fiber optic sensing.

During on-site data collection, four types of support rollers were selected and installed in different preset positions for data collection and training, as shown in Fig. 9.

Fig. 9
figure 9

Physical states of idlers.

As shown in Fig. 10(a–d), the time-domain waveform plots obtained through on-site vibration optical fiber data collection and processing demonstrate distinct characteristics across roller conditions: normal support rollers exhibit stable waveform fluctuations with amplitudes near 0.04mm; Dust-induced overheating rollers show localized energy surges featuring periodic concentrations peaking near 0.06mm; Bearing-worn rollers demonstrate shaft-drum separation phenomena with more pronounced periodic fluctuations than overheated rollers, reaching maximum amplitudes near 0.08mm; While shaft-fractured rollers manifest periodic energy oscillations with peak vibration amplitudes near 0.15mm.

Fig. 10
figure 10

Time-domain waveforms under different idler states.

As illustrated in Fig. 11(a–d), the time–frequency images derived from STFT and Mel transformation of four support roller types reveal distinct signatures: normal rollers concentrate energy primarily in low-frequency ranges with negligible high-frequency fluctuations; dust-induced overheating rollers exhibit broadband energy fluctuations during 9–10s intervals with significant energy accumulation near 600Hz; bearing-worn rollers demonstrate peak energy within the 300–800Hz frequency band accompanied by a secondary 1500Hz energy zone; while shaft-fractured rollers display pervasive broadband oscillations throughout 0 ~ 8s periods, exhibiting maximum energy intensity centered near 600Hz.

Fig. 11
figure 11

Time–frequency distributions under different idler states.

The experimental configuration is set as follows: sampling rate at 16 kHz, STFT feature extraction parameters optimized via orthogonal experiments (Hanning window length 1024 points, overlap 512 points, FFT points 1024, Mel filter bank 40 channels, retaining the first 13 cepstral coefficients), training employs the Adam optimizer (initial learning rate 0.001 with cosine decay strategy), 40 training epochs, batch size 64, and cross-validation performed every 30 iterations. Data preprocessing includes Z-score normalization and random sample shuffling to ensure model generalization capability.

As shown in Figs. 12, 13, 14, 15, the training iteration graphs of three models—CNN, LSTM, GoogLeNet-LSTM, and the Dynamically Driven GoogLeNet-LSTM (DY-GoogLeNet-LSTM)—demonstrate distinct performance characteristics. The CNN algorithm shows slow improvement in iteration accuracy, stabilizing only around the 15th iteration round despite significant fluctuations in the training set. The LSTM model achieves stabilization by the 10th iteration round, yet its training set consistently underperforms the validation set. The GoogLeNet-LSTM model stabilizes at the 3rd iteration round while maintaining a training set that consistently outperforms the validation set. The DY-GoogLeNet-LSTM model stabilizes by the 5th iteration round with its training set consistently approaching the validation set’s performance.

Fig. 12
figure 12

Convergence curve of traditional CNN model.

Fig. 13
figure 13

Training iteration diagram of the LSTM model.

Fig. 14
figure 14

Training iteration diagram of the improved GoogLeNet-LSTM model.

Fig. 15
figure 15

Convergence analysis of dynamics-driven GoogLeNet-LSTM.

Figure 16 presents post-test confusion matrices, revealing substantial improvements: the LSTM model’s accuracy rises from 83.7% to 85.7%, the enhanced GoogLeNet-LSTM model reaches 95.3%, and the further optimized DY-GoogLeNet-LSTM achieves 96.7% accuracy. These results validate the effectiveness of the improved model in acoustic vibration fault detection, providing a reliable technical solution for conveyor fault diagnosis.

Fig. 16
figure 16

Comparative confusion matrices of network models.

Conclusions

Based on theoretical research, model construction, and experimental validation, this study forms the following conclusions regarding the proposed fiber-optic sensing fault diagnosis method integrating dynamics model-driven and improved hybrid deep learning for fault identification of open-pit mine conveyor rollers under extreme operating conditions:

(1) Normal rollers exhibit stable vibration amplitudes of 0.04 mm, whereas dust-induced overheating, bearing-worn, and axle fracture faults show peak amplitudes increasing to 0.06 mm, 0.08 mm, and 0.15 mm respectively, with axle fracture amplitude reaching 3.75 times that of normal conditions.

(2) Normal rollers concentrate energy in low-frequency bands, while axle fractures demonstrate broadband energy oscillations across 0–8 s with significant energy density intensification near 600 Hz. Bearing-worn states exhibit characteristic energy bands within 300 ~ 800 Hz and 1500 Hz ranges, and dust-induced overheating displays transient broadband energy aggregation during 9–10 s intervals.

(3) The proposed DY-GoogLeNet-LSTM model achieves a test accuracy of 96.7%, outperforming the baseline CNN model (83.7%) by 13%. By leveraging dynamics model-driven strategies, it effectively suppresses the fluctuation error observed in the CNN model at the 30th training iteration, enabling stable and efficient fault feature extraction and classification.