Introduction

The conveyor belt is an essential piece of equipment in modern secure production and transport, providing continuous transit of bulk materials. It primarily consists of belts, rollers, racks, tensioning devices, transmission devices, and other auxiliary devices. The conveyor belt offers several advantages, including large conveying volume, long transport distance, stable operation, low power consumption, and easy loading and unloading. It is widely used in various industries such as coal mining, construction sites, power supply, and metallurgy. With the expansion of industrial production scale, the carrying capacity of the conveyor belt has increased, bringing substantial economic benefits to the material handling industry. As the conveyor belt accounts for 40–60% of the total cost of the conveyor, continuous improvement in its design and performance is crucial. The maximum length of the conveyor can reach several kilometers, and its running status directly affects the stability of the entire production and transportation process1.

The harsh and complicated working environment of conveyor belts often leads to deviation faults2. Conveyor belt deviation refers to the belt running process where the conveyor belt surface center line deviates from the frame center line, gradually running to one side. According to statistics, conveyor belt stoppage, production stoppage, and other accidents caused by belt deviation account for about 10–30%. The fault of conveyor belt deviation results in belt wear, material spillage, wasted resources, and environmental pollution. Severe cases can cause fires, injuries, and property damage3. Moreover, if the conveyor belt deviation problem is not well solved, it restricts the wider application of belt conveyors, impeding their development.

With the advancement of state detection technology, conveyor belt runout detection methods have evolved from contact methods to non-contact methods. The contact type is mainly divided into three: manual inspection, sensor-based, and mechanical measurement4. The manual inspection method is labor-intensive, and detection accuracy decreases over time. Sensor-based methods typically require deviation sensors5, symmetrically mounted on the frame on either side of the conveyor belt to activate travel switches. When the conveyor belt tilts, the edge pushes the deviation sensors to trigger the travel switches to send out deviation correction signals, which are fed back to the terminals to complete conveyor belt deviation correction detection. Mechanical measurement methods include installing correction rollers or emergency switches for the conveyor belt, but they only work when the conveyor belt touches the vertical rollers or emergency switches. The structure and principle of this method are simple, but it causes wear on the conveyor belt edges and poor detection accuracy.

Non-contact detection methods, which incorporate machine vision6,7,8 algorithms into conveyor belt fault prediction, obtain relevant fault information from the conveyor’s operating state and predict future fault trends9. Many high-performing methods utilize the Canny edge detection algorithm10 for traditional machine vision techniques, extracting belt edge features. Liu11 proposed an approach using edge detection and line fitting to extract the belt edge, proposing an analysis method to quickly determine if the belt has deviated from its path. Wang12divided the conveyor belt into regions of interest, extracted straight line information of the conveyor belt edges through image enhancement, wavelet transformation, Canny edge detection, and Hough transform, and conducted belt deviation detection with a reference line of a normal running conveyor belt. In recent years, In recent years, deep learning has been broadly applied to machine vision for image recognition, target detection, and classification13. More scholars are using deep learning’s target detection to solve conveyor belt runout detection in complex environments14,15.

Wlodarczyk-Sielicka16introduced a novel method for tape edge detection utilizing deep convolutional networks, FCN (Fully Convolutional Networks)17, DeepLab18, HED (Holistically-Nested Edge Detection)19, This method offers reliable target detection and anti-jamming capabilities, addressing the shortcomings of untimely processing in traditional mechanical anti-bias methods and the diminished accuracy of machine vision for strip edge detection. However, this approach incurs a relatively large deviation degree (DD) error. Unlu20 employed a real-time belt detection algorithm based on multi-scale feature fusion networks, which improves accuracy and real-time performance through the use of deep divisible convolution to lightweight the model and feature pyramid networks (FPNs) for information fusion from different feature layers. Still, given image segmentation’s pixel-level operation, performance tops out at 13.4 FPS. Zhang21 proposed a new deep learning-based method for tape deviation monitoring, integrating CSPDarknet and Spatial Pyramid Pooling (SPP) as the backbone extraction network of YOLOv522,23,24, boosting straight line detection and improving belt edge detection accuracy. This addresses complex background strip edge detection challenges. Sun25 developed a system for curved belt deviation assessment using an ARIMA-LSTM combined prediction model and the OC-SVM algorithm, enabling real-time detection, prediction, correction, and warning of strip abnormal deviation. Wang26 introduced a variable step size row anchor segmentation method employing the UFLD algorithm, incorporated a Convolutional Block Attention Module (CBAM) for enhanced feature extraction, and improved the convolutional and downsampling operations in the ResNet-18 Stem and residual modules to better detect conveyor belt edges.

In summary, this paper aims to accurately detect conveyor belt runout issues in complex environments and ensure safe material transport. It utilizes an improved YOLOv827,28,29 algorithm combined with a conveyor belt runout discrimination criterion to determine transport state safety. Firstly, the paper adopts the EffectiveSE mechanism based on the YOLOv8 target detection network structure, enhancing attention to target region feature information and improving detection accuracy for fuzzy and small target regions. Second, the BiFPN_DoubleAttention module is introduced for bidirectional feature fusion and weighted rectification on multiple feature scales, enabling better combination of low-level detail and high-level semantic information to improve detection accuracy and robustness. Finally, the P5/32-Large is equipped with the MHSA Multi-Head Self-Attention Mechanism, allowing better understanding of global structure and relative positional relationships for small (rollers) and larger (conveyor belt area) target objects, improving detection performance. Compared to the traditional YOLOv8 algorithm, this method performs better for roller and conveyor belt detection, enhancing model accuracy, robustness, and application performance.

Materials and methods

YOLOv8 Model Architecture

YOLOv8 is a cutting-edge SOTA model, built upon previous versions of the YOLO series, introducing significant improvements and innovations to enhance performance and flexibility, making it the best choice for tasks such as image classification, object detection, and instance segmentation. The model proposes a new backbone network, a new Anchor-Free Detection header, and a new loss function, operating efficiently on a wide range of hardware platforms. The network structure of YOLOv8 consists mainly of a backbone, a neck, and a head.

Backbone also adopts the concept of the CSP module and replaces the C3 module in Yolov5 with the C2f module, which combines C3 and ELAN30 in YOLOv731,32, adopting the gradient diversion connection, enriching the information flow of the feature extraction network while maintaining its lightness. The SPPF (Spatial Pyramid Pooling Fast) module in Yolov5 is utilized, connecting each layer in series by sequentially connecting three sizes and 3 max pools, improving the accuracy of object detection at different scales and reducing the computational workload and latency.

The neck utilizes the PAN-FPN33structure to enhance the fusion of features of differing dimensions and generate a feature pyramid. This structure comprises two parts: the feature pyramid network (FPN) and the path aggregation network (PAN). The FPN enables the bottom layer of the feature map to contain stronger image semantic information via top-down upsampling, while the PAN enables the top layer to contain image position information via bottom-up downsampling. The two parts of the features are finally fused, combining the FPN and PAN structures to realize the complementarity of shallow position information and deep semantic information, achieving feature diversity and completeness, and improving recognition performance.

The head is the predicted output part of the whole network. Unlike the coupled head of YOLOv5, YOLOv8 adopts a decoupled head design. The decoupled head uses two independent branches for target classification and prediction bounding box regression, with different loss functions for each task. Binary cross-entropy loss (BCE loss) is used for classification, while distribution focal loss (DFL)34and CioU35are used for bounding box regression. This detection structure improves accuracy and accelerates model convergence. Additionally, YOLOv8 is an anchorless detection model that directly predicts the object center, rather than predicting the distance from a known anchor frame. This reduces the number of frame predictions and accelerates the non-maximum suppression (NMS) process.

Different detection scene requirements divide YOLOv8 into five versions: YOLOv8n36,37,38, YOLOv8s39, YOLOv8m40, YOLOv8l41, and YOLOv8x42. Each version corresponds to different network depths and widths. Considering the conveyor belt runout detection model’s high precision and lightweight, this paper selects the YOLOv8n mesh, which has a small volume and high precision.

Improved methodology

Traditional conveyor belt deflection detection methods were inaccurate, so this paper introduces a YOLOv8-based deflection detection model with improved backbone, neck, and head. The network structure of the improved algorithm is shown in Fig. 1.

Fig. 1
Fig. 1The alternative text for this image may have been generated using AI.
Full size image

The architecture of improved YOLOv8 network structure. The red boxes represent improvement points. C2f is used for concatenating different feature maps and other operations; CBS consists of Conv, BatchNorm, and SiLu activation functions for feature extraction; Spatial Pyramid Pooling Fast (SPPF) is used to increase feature diversity.

Backbone

The C2f module is a vital component of the YOLOv8 backbone network, which enhances feature representation and multi-scale information fusion through branching and connecting operations of the feature map. In this paper, we incorporate the EffectiveSE (Effective Squeeze and Extraction, ESE)43module into the C2f module, further improving the model’s feature selection and weight adjustment capabilities. This results in accurate detection of rollers and belts in complex, dynamically changing real-world environments such as mines, sandy fields, and quarries.

EffectiveSE is a convolutional neural network structure for image classification, an improved version of SENet (Squeeze-and-Excitation Networks)44. The Squeeze-Excitation (SE) channel attention module in SENet was found to inadvertently reduce computational efficiency due to the use of two fully connected (FC) layers in the SE module design. To address this, the SE module is redesigned as ESE by replacing these two FC layers with one that maintains the channel dimension, thus avoiding loss of channel information and improving performance The ESE process is defined as:

$${A_{eSE}}\left( {{X_{div}}} \right)=\upsigma \left( {{W_C}\left( {{F_{gap}}\left( {{X_{div}}} \right)} \right)} \right)$$
(1)
$${X_{refine}}={A_{eSE}}\left( {{X_{div}}} \right) \otimes {X_{div}}$$
(2)

where\({{\rm X}_{div}} \in {R^{C \times W \times H}}\)is the diversified feature map calculated by \(1 \times 1\) conv in the OSA module. As a channel attentive feature descriptor, the \({A_{eSE}} \in {R^{C \times 1 \times 1}}\) is applied to the diversified feature map \({X_{div}}\) to make the diversified feature more informative. Finally, when using the residual connection, the input feature map is added to the refined feature map \({X_{refine}}\) The details of how the eSE module is plugged into the OSA module, are displayed in Fig. 2.

Fig. 2
Fig. 2The alternative text for this image may have been generated using AI.
Full size image

The overall architecture of the ESE module. \(\otimes\)indicates element-wise multiplication and \(\oplus\)denotes element-wise addition.

Neck

In the neck network structure of YOLOv8, we introduced the BiFPN_DoubleAtten module, which combines BiFPN_Concat2 and DoubleAttention. BiFPN_Concat2 enables weighted fusion of multi-layer features, effectively consolidating feature information from different scales. The DoubleAttention mechanism adjusts fused features with attention, enhancing the ability to focus on key regions. Overall, this module optimizes multi-scale feature fusion and significantly improves the capability to accurately detect multi-scale target positions in complex environments, including rollers and conveyor belts.

BiFPN45 is a target detection optimization module that is built upon the PANet structure. It is designed to reduce complexity by discarding small contributions and using a single input edge. Additionally, it assigns increasing input feature weights to differentiate the importance of features in the fusion process, thereby enhancing the multi-scale target feature fusion effect,

as shown in Fig. 3.

Fig. 3
Fig. 3The alternative text for this image may have been generated using AI.
Full size image

The network architecture of PANet and BiFPN. (a)PANet adds an additional bottom-up pathway on top of FPN; (b) BiFPN structure is based on PAN and uses the fast normalized fusion method for fusion with weights with better accuracy and efficiency trade-offs.

The BiFPN module adopts fast normalized fusion to avoid the unstable training caused by boundless fusion in weighted feature fusion, and the significant slowdown of GPU hardware caused by Softmax-based fusion. The calculation formula is as follows:

$$O=\sum\nolimits_{i} {\frac{{{w_i}}}{{ \in +\sum\nolimits_{j} {{w_j}} }}} \cdot {I_i}$$
(3)

where the ReLU activation function is added after each \({w_i}\) to ensure that \({w_i} \geqslant 0\),\(\in =0.0001\)is used to avoid numerical instability, and the value of each normalized weight also falls between 0 and 1. Finally, BiFPN integrates bidirectional cross-scale connectivity and fast normalized fusion to obtain the final weighted feature pyramid network.

The attention mechanism is a process that imitates the human visual system, enabling a neural network to focus on the most relevant parts of the input data as it is processed. In computer vision, this means the network can dynamically select which parts of an image to focus on to improve performance on a task. Double Attention Networks (DAN)46 is a neural network architecture for computer vision tasks designed to efficiently capture both global and local information in an image to improve task performance(as shown in Fig. 4). It achieves this through the introduction of two attention modules, one for global information focusing on the overall structure of the image and one for local information focusing on local details. This architecture makes full use of the different information in the image within a deep neural network and is formulated as follows:

$$\begin{aligned} Z&={F_{dister}}\left( {{G_{gather}}\left( X \right),V} \right) \\ &={G_{gather}}\left( X \right)soft\hbox{max} \left( {\uprho \left( {X;{W_\uprho }} \right)} \right) \\ &=\left[ {\phi \left( {X;{W_\phi }} \right)soft\hbox{max} {{\left( {\uptheta \left( {X;{W_\uptheta }} \right)} \right)}^T}} \right]soft\hbox{max} \left( {\uprho \left( {X;{W_\uprho }} \right)} \right) \\ \end{aligned}$$
(4)
Fig. 4
Fig. 4The alternative text for this image may have been generated using AI.
Full size image

The computational graph of the proposed double attention block.

Head

After the last C2f module in the Head part of the YOLOv8 network structure, the Multi-Head Self-Attention (MHSA)47Mechanism is added to the P5/32-large layer to enhance the model’s ability to express detailed features in complex scenes, enabling better capture of positional features for small target rollers and large target conveyor from complex and dim backgrounds.

The Multi-Head Self-Attention (MHSA) module is a stack of N single-head self-attention modules, with multi-head attention forming a central component in the Transformer encoder structure. The feature mapping sequence is normalized and passed to the multi-head attention layer, which consists of single-head self-attention. Using a cubic linear mapping, each element is weighted according to the values of other elements. The single-head attention mechanism achieves sequence weighting by combining each element into query and key vectors, which undergo matrix multiplication, scaling, masking, and SoftMax operations, and are weighted by value vectors. Figure 5 shows the structure of the MHSA Multihead Self-Attention Mechanism.

Fig. 5
Fig. 5The alternative text for this image may have been generated using AI.
Full size image

The architecture of multi-head attention mechanism.

In the multi-head attention mechanism, the input sequence is first converted into query vectors, key vectors, and value vectors. Each head can learn different query vectors, key vectors, and value vectors. Then, each head computes the similarity between each query vector and each value vector and computes the corresponding weights. These weights are used to perform a weighted sum on the value vectors to generate an output sequence for each head. Finally, these output sequences are spliced together to form the final output sequence. The multi-head attention mechanism can effectively improve model performance and enable better capture of relevant information in input sequences. The MHSA multiple-head attention output formula is:

$$\begin{aligned} MultiHead\left( {Q,K,V} \right)&=Concat\left( {hea{d_1},\ldots,hea{d_h}} \right){W^O} \\ hea{d_i}&=Attention\left( {QW_{i}^{Q},KW_{i}^{K},VW_{i}^{V}} \right) \\ \end{aligned}$$
(5)

where \({W^O},{W_i}^{Q},W_{i}^{K}\) and \(W_{i}^{V}\) is the weight matrix; Satisfy \({W^O} \in {R^{hdk \times {d_{\bmod el}}}},{W_i}^{Q} \in {R^{{d_{\bmod el}} \times {d_Q}}},W_{i}^{K} \in {R^{{d_{\bmod el}} \times {d_K}}}\), and \(W_{i}^{V} \in {R^{{d_{\bmod el}} \times {d_v}}}\) respectively; h is the number of multiple attentions; \({d_{\bmod el}}\) and \({d_v}\) are the dimensions of model and V.

Experiments

Experimental environment

All experimental environments are based on the Linux operating system, with an Intel(R) Xeon(R) CPU E5-2680 v4 CPU, an NVIDIA GeForce RTX 4090 GPU, and 32GB of RAM. The programming environment utilizes the deep learning framework PyTorch version 1.8.0, implemented in Python 3.8. During model training, the number of epochs is set to 300, the batch_size is set to 32, and the initial learning rate is set to 0.01.

Datasets and preprocessing

To validate the effectiveness of the method, experiments were conducted on a self-constructed dataset. A total of 5,800 images were collected, and categorized into three types: underground mines, sand fields, and ore yards. The coal mine data consisted of 4,241 images, with 3,493 images self-collected from several coal mines and 748 images selected from public datasets. Additionally, 1,000 images of sand fields and 559 images of ore yards were selected from public datasets. These data included various working scenarios such as empty and loaded conditions, as well as day and night settings. The data were labeled with “0” for drum labeling and “1” for conveyor belt labeling. The dataset was then divided into training, testing, and validation sets in a 7:2:1 ratio. Figure 6 illustrates a portion of the dataset.

Fig. 6
Fig. 6The alternative text for this image may have been generated using AI.
Full size image

Enumeration of datasets in different scenarios.

Evaluation index

To assess the detection performance of the model, four indicators were used to evaluate the experimental results: Precision (P), Recall (R), F1-Score, and Mean Average Precision (mAP).

In the target detection test, TP (true positive) denotes a positive sample predicted by the model to be in the positive category, FP (false positive) denotes a negative sample predicted by the model to be in the positive category, and FN (false negative) denotes a positive sample predicted by the model to be in the negative category.

Precision is the ratio of true positive predictions to the total number of samples detected, and is used to assess model accuracy. The equation for precision is:

$$Precision=\frac{{TP}}{{TP+FP}}$$
(6)

Recall is the ratio of the number of positive samples correctly predicted by the model to the number of actual positive samples, and is used to assess the comprehensiveness of the model’s detection. The equation for recall is:

$$\operatorname{Re} call=\frac{{TP}}{{TP+FN}}$$
(7)

The F1 score is the weighted average of precision and recall. The equation for the F1 score is defined as follows:

$$F1=\frac{{2 \times \Pr ecision \times \operatorname{Re} call}}{{\Pr ecision+\operatorname{Re} call}}=\frac{{2TP}}{{2TP+FP+FN}}$$
(8)

The mean average precision (mAP) is one of the key performance metrics used to evaluate target detection performance. It is the average of all average precision (AP) values in the dataset and describes the area under the precision-recall (P-R) curve. A higher mAP value indicates better detection performance of the model. The equation for mAP is:

$$mAP=\frac{1}{n}\sum\limits_{{i=1}}^{n} {\int\limits_{0}^{1} {\Pr ecision} } \left( {\operatorname{Re} call} \right)d\left( {\operatorname{Re} call} \right)$$
(9)

where C is the total number of classes and AP is the AP value of the ith class. In this study, two classes of target detection are used, so C = 2. The average accuracy AP is calculated.

Results and analysis

Ablation experiment

To verify the effectiveness of the above improved algorithmic modules, ablation experiments were performed with different combinations of several modules, using the original YOLOv8n as the baseline model, with Precision, F1 score and mAP0.5 as the evaluation indexes. The experimental results are shown in Table 1. Ablation experiment on each improved module, and the training process is shown in Fig. 7.

Table 1 Ablation experiment on each improved module.
Fig. 7
Fig. 7The alternative text for this image may have been generated using AI.
Full size image

Training result of the improved YOLOv8.

The improved model’s training process showed changes in accuracy and loss over iterations. As the number of iterations increased, the model updated its weights, resulting in increasing accuracy and decreasing loss. This indicates that within a certain range of iterations, more iterations led to the model learning more feature information and achieving higher accuracy. In the early stages of iteration, the loss decreased rapidly, and accuracy increased quickly. At around 100 iterations, mAP@0.5 stabilized at approximately 0.99, and the loss function reached a relatively stable state.

Comparison with other algorithms

To verify the superiority of the algorithm proposed in this paper over currently popular conveyor bel runout detection algorithms, the algorithm of this paper is compared with several algorithms, namely Mask R-CNN, DHT, UFLD, YOLOv5, YOLOv8, and improved algorithm, under the same conditions in comparison experiments. The evaluation indices used are Precision, F1, Model size, mAP0.5 and FPS. The experimental results are shown in Table 2.

Table 2 Comparison of the performance of different algorithmic models.

Compared to other conveyor belt detection algorithms, the improved YOLOv8 algorithm proposed in this paper demonstrates superior detection performance on our self-built conveyor belt oscillation dataset. This dataset encompasses diverse scenes, transmission states, and targets from various angles. Despite this diversity, the enhanced algorithm exhibits remarkable detection capabilities, achieving a mean Average Precision (mAP) of 99% at an IoU threshold of 0.5, with a detection speed of 46 frames per second (FPS) and a model size of 19.8 MB. Some test results are showcased in Fig. 8.

Fig. 8
Fig. 8The alternative text for this image may have been generated using AI.
Full size image

The roller and conveyor belt detection effect in different areas.

Conveyor belt offset judgment method

Belt conveyors operate in three states: normal, left deviation, and right deviation. This paper presents a method for detecting conveyor belt misalignment by measuring the distance between the centerline of the trapezoidal area formed by the edge lines of the idlers and the centerline of the two outer edge lines of the roller supports and the centerline of the belt edge line, as shown in Fig. 9. The steps for data acquisition can locate the position of the belt and its edge lines in surveillance video images. Considering that the relative position between the camera and the belt conveyor remains unchanged after installation, and the outer edges of the idlers on both sides of the belt conveyor will form a straight line, it is easy to pre-determine the position of the idler edge lines in the image and store them in the program as reference lines. Under normal circumstances, the two previously mentioned centerlines coincide. Considering actual operation, the belt surface usually does not run strictly in the middle and may have slight left or right deviations. Therefore, a reasonable range of misalignment is allowed, with minor misalignments not considered faults. Threshold selection depends on working conditions and detection accuracy requirements.

Fig. 9
Fig. 9The alternative text for this image may have been generated using AI.
Full size image

Offset judgment method.

Determination of the centerline

To determine the centerline of the edge line of the two roll holders, it is first necessary to clarify the definition of the roller bracket edge line equations. The left edge line equation of the roller holder can be expressed as:

$${y_l}={a_l}x+{b_l}$$
(10)

The equation for the right edge line of the roller support is expressed as::

$${y_r}={a_r}x+{b_r}$$
(11)

Here, x represents the horizontal coordinate in the image, corresponding to the pixel position; \({y_l}\) and \({y_r}\) denote the vertical coordinates of the left and right edge lines of the roller at position x, respectively. Parameters \({a_l}\) and \({a_r}\) are the slopes of the left and right roller edge lines, while \({b_l}\) and \({b_r}\)are the intercepts of the left and right roller edge lines.

The center line of the roller support region can be determined from the edge line equations. The equation of the center line is expressed as:

$${y_{roller\_center}}=\frac{{{a_l}+{a_r}}}{2}x+\frac{{{b_l}+{b_r}}}{2}$$
(12)

\({y_{roller\_center}}\)represents the vertical coordinate of the center line of the roller support area at position x, indicating the central position of the roller support region. This centerline is derived by calculating the midpoint positions of the left and right roller holder outer edge lines and fitting these endpoint positions.

Following the same approach, the centerline equation for the conveyor belt is:

$${y_{belt\_center}}=\frac{{{m_l}+{m_r}}}{2}x+\frac{{{c_l}+{c_r}}}{2}$$
(13)

In this equation, \({y_{belt\_center}}\)represents the vertical coordinate of the centerline of the edge line of the conveyor belt at position x. This vertical coordinate signifies the central position of the conveyor belt, which is derived through the calculation and fitting of the midpoint positions of the edge lines.

Calculate the deviation

In conveyor belt deviation detection, the offset is defined as the perpendicular distance between the conveyor belt centerline and the centerline of the outer edges of the two idler brackets at the same x-coordinate. To calculate this offset, follow these steps: First, for a given x-coordinate, calculate the y-coordinate values of both the conveyor belt centerline and the idler centerline at that position. Then, by comparing the difference between these two y-coordinate values, the corresponding vertical distance is obtained, which is represented by the formula:

$${d_{offset}}=\left| {{y_{belt\_center}} - {y_{roller\_center}}} \right|$$
(14)

Under normal circumstances, one can select multiple x-coordinate points to calculate the offset at different positions, and then take the average as the overall offset. The average offset can be expressed as:

$${\mathop d\limits^{ - } _{offset}}=\frac{1}{n}\sum\limits_{{i=1}}^{n} {\left| {{y_{belt\_center}}\left( {{x_i}} \right) - {y_{roller\_center}}\left( {{x_i}} \right)} \right|}$$
(15)

where n represents the number of x-coordinate points selected, \({x_i}\)represents the i-th x position.

Conveyor belt deviation judgment:

$$if\left\{ {\begin{array}{*{20}{c}} {d>\tau ,deviation} \\ {d \leqslant \tau ,normal} \end{array}} \right.$$
(16)

\(\uptau\)serves as the threshold for determining the deviation of the conveyor belt.

Conclusion

This paper proposes an improved conveyor belt deviation detection algorithm based on YOLOv8, addressing the challenges faced by traditional methods in objection detection applications. In the proposed approach, firstly, to enhance precise localization capabilities across multiple scales in complex environments, an efficient ESE module is integrated. Then a BiFPN_DoubleAttention module is constructed in the neck network of YOLOv8, augmenting the feature extraction capabilities for multi-scale targets in complex environments. Thirdly, the introduction of a MHSA mechanism significantly improves detection accuracy for small target rollers. Finally, a deviation judgment method is then designed to enhance the generalization ability of our model.

Experiments demonstrate that our model has achieved higher detection accuracy than other existing models while meeting real-time requirements and reducing demand for computational and storage resources, making it easily deployable on resource-constrained devices. Future work will continue to expand the self-built dataset presented here, incorporating samples from different transportation environments to enhance industrial application potential.