Abstract
In view of the poor recognition performance of the existing foreign object detection models for coal mine conveyor belts in the complex underground environment, they are prone to false detection and missed detection of slender foreign objects and small target foreign objects. Moreover, the models are large in size, difficult to deploy on edge devices, and the detection methods are slow, have numerous parameters, and involve a large amount of computation. A foreign object detection algorithm for downhole conveyor belts based on the improved YOLOv11 is proposed. Firstly, the ADown downsampling module is incorporated to enhance the detection performance for small defects and reduce the number of parameters. Secondly, the SegNext attention mechanism is integrated to enhance the model’s performance in image segmentation. Thirdly, the C3k2 module is optimized by integrating the Light-weight Context Guided mechanism from the CGNet framework. This enhancement significantly boosts the model’s deployment flexibility and detection speed in complex underground environments. Finally, the lightweight detection head, LSCD, is utilized to enhance the model’s capability in handling multi-scale features. This is achieved through the implementation of shared convolutional layers, which effectively reduce the computational load and parameter size. Moreover, the effectiveness of the enhanced model is further validated through extensive experimental comparisons. The experimental results show that, compared with the original model, the improved model has a 1.5% increase in mAP, a 1.2% increase in Precision, a 2% increase in Recall, a 28% reduction in the number of parameters, a 33% decrease in computational load, and a 29% reduction in the model storage size. This indicates the effectiveness of this method in the detection of foreign objects in coal mine conveyor belts. It has important reference significance for the real-time detection of foreign objects in the conveyor belts of underground coal mines.
Similar content being viewed by others
Introduction
In the modernization process of the coal industry, intelligent construction is not only an urgent need to promote the high-quality development of the coal industry, but also a key technical support for achieving its high-level development. To enhance the production efficiency of coal mines, countries around the world are increasingly emphasizing the deep integration of information technology and the coal industry1,2. As a key component for underground coal transportation, the stable functioning of coal mine conveyor belts is crucial for both coal production and operator safety. However, during the production process, foreign objects such as gangue and ironware are often mixed into the conveyor belt. These foreign objects may cause the conveyor belt to tear. If not detected in time, they may also lead to the accumulation and blockage of materials at the coal drop port or the transfer point, thereby causing problems such as uneven distribution of coal, deviation of the conveyor belt, wear and tear, and seriously threatening production safety. Therefore, by leveraging computer vision technology and deep learning neural networks, achieving rapid and accurate identification of foreign objects in coal mine conveyor belts is of extremely significant importance for the current safe production and intelligent construction of the coal industry3.
Under the complex and harsh environmental conditions underground, the application of cutting-edge technologies such as computer technology and artificial intelligence in the mining field can significantly enhance the efficiency and reliability of coal mining and transportation. Liu Haiqiang Liu et al.4 introduced an enhanced foreign object detection algorithm tailored for coal mine conveyor belts based on YOLOv7. The algorithm enhances image clarity through histogram equalization and incorporates the SOCA module into the YOLOv7 backbone to emphasize critical information. At the same time, an ASPP module is embedded to extract multi-scale context features, thereby optimizing the accuracy of target detection. It effectively addresses the issues of unclear image clarity in underground coal mines and the significant localization errors of YOLOv7. Zonglin Li et al.5 implemented a lightweight optimization of the CSPDarkNet53 backbone network by leveraging GhostNetV2, thereby reducing the number of model parameters and alleviating the computational load. Additionally, the headC2f_CA module was designed and the channel attention mechanism was introduced to accurately extract multi-scale foreign object features and enhance feature expression capability. The improved model not only alleviates the hardware pressure on edge devices but also ensures the accuracy of coal mine safety monitoring. Hong Yan et al.6 proposed a foreign object detection method for coal mine conveyor belts based on the improved YOLOv8. Specifically, the Bottleneck structure of the C2f. module in the YOLOv8 backbone network is reconstructed into a DSBlock by integrating a depth-separable convolutional, compression and excitation (SE) network. Additionally, adaptive average pooling and adaptive maximum pooling operations are applied to the input layer of the ECA module, enabling the method to fully satisfy the requirements for real-time detection of foreign objects on coal mine conveyor belts. Gao Han et al.7 introduced a foreign object detection algorithm named Feature Enhancement and Transformer YOLO (FET-YOLO). This algorithm incorporates low-level feature enhancement and a transformer mechanism to address the challenges of detecting elongated objects and extracting weak semantic features in conveyor belt foreign object detection. Li Bin et al.8 enhanced the YOLOv11n model through several modifications. They optimized the C3k2 component using the RFCBAMConv module, and developed the Dilated Feature Pyramid Convolution (DFPC) module. The CSPOK module and ContextGuidedBlock_Down (CGBD) convolution were introduced, significantly enhancing the feature extraction capability. The aforementioned research findings sufficiently demonstrate the applicability of the YOLO series of models in this domain and offer valuable reference for this study.
At present, the majority of computer vision-based detection methods in the field of underground conveyor belt foreign object detection have yet to incorporate the more efficient YOLOv11 model. In the view of the requirements of real-time detection and actual deployment, this study has comprehensively evaluated the detection accuracy, parameter quality and computational efficiency of various models, and ultimately selected the YOLOv11 model as the benchmark. The model not only ensures detection accuracy but also optimizes recognition speed to effectively fulfilling the requirements for real-time detection of foreign objects in underground conveyor belts. YOLOv11, as one of the advanced real-time object detection models based on deep learning and computer vision, is the latest version of the YOLO series. It inherits the advantages of the previous model and introduces new features and improvements, thereby achieving better performance and detection efficiency.
Based on the above research content, this paper improves the YOLOv11 model and proposes a real-time, efficient and effective foreign object detection algorithm for coal mine conveyor belts that can improve the complex underground environment. Firstly, the ADown downsampling module is added to improve the detection of small defects and reduce the number of parameters; Secondly, the integration of the SegNext9 attention mechanism enables the model to more effectively leverage global contextual information and to precisely focus on regions of interest, thereby enhancing the performance of image segmentation10. Lastly, the application of the Light-weight Context Guided mechanism from CGNet11 to optimize C3k2, along with the introduction of a self-developed Lightweight Shared Convolutional Detection Head (LSCD), significantly enhances the detection efficiency of the network.
Yolov11 algorithm
As the latest advancement in the YOLO series, YOLOv11 has achieved significant improvements in detection accuracy, operational speed and computational efficiency. Firstly, the core innovations of YOLOv11 compared to YOLOv8 involve replacing the traditional C2f. module with the C3K2 module in the backbone network. The advantage of C3K2 over C2F lies in that C3K2 can use the C3K parameter to determine whether to enable the variable convolution kernel size in the C3K module. This feature gives C3K2 an advantage in scenarios where more flexible feature extraction is required. For instance, when dealing with scenarios that need different receptive fields, C3K can be adjusted to meet specific feature extraction requirements. Therefore, YOLOv11 demonstrates greater flexibility in handling complex features, capable of adapting to various task requirements and enhancing the efficiency of feature extraction12; Secondly, the C2PSA module was introduced, which is an extended version of the C2f. module. By integrating the Pointwise Spatial Attention (PSA) block13, the C2PSA module further enhances the attention mechanism. This significantly boosts the model’s capacity to identify critical features and offers more robust support for complex tasks. Furthermore, the C2PSA module not only enhances the capability for multi-scale feature extraction and improves computational efficiency but also refines the spatial pyramid pooling technique14 to enrich the diversity of feature representation. Finally, YOLOv11 incorporates the PAN-FPN structure15 into its neck network, which effectively merges deep and shallow features to achieve a notable enhancement in target localization accuracy. Additionally, two DWConv detection heads are integrated into the classification detection header within the original decoupled header. This design significantly reduces the number of model parameters and computational burden by leveraging the decoupled architecture and depthwise separable convolution (DWConv) operations. YOLOv11 achieves substantial reductions in computational resource demands through optimized network architecture and parameter refinement. This results in faster inference speeds and a more compact model size, making it particularly suitable for devices with limited resources. The YOLOv11 series, as a single-target detection network, is divided into five versions—n, s, m, l, and x—based on model size. These versions feature progressively increasing depth and width, which correspond to higher detection accuracy but also longer training times. This study, in response to the practical demands for foreign object detection in underground conveyor belts, selects YOLOv11n as the basic model. On the basis of ensuring detection accuracy, it also takes into account the recognition speed, which can meet the application requirements of real-time foreign object detection in underground coal mine conveyor belts. The model of YOLOv11 is shown in Fig. 1.
YOLOv11-ASCL detection model
In this paper, the YOLOv11 model is further enhanced based on its original framework. It retains the inherent strengths of the YOLO series and is specifically optimized to address the unique challenges of detecting foreign objects on conveyor belts in underground coal mines, thereby achieving more efficient detection of such objects. Specifically, the following four enhancements are proposed to substantially boost the performance of foreign object detection in underground conveyor belts.
Firstly, the ADown downsampling module is integrated into the model’s backbone network as a substitute for certain conventional Conv layers. This modification enhances the model’s ability to efficiently extract higher-level image features while significantly reducing computational load, thereby markedly boosting its operational efficiency in resource-limited settings. Secondly, to augment the model’s feature extraction capacity, particularly for enhancing the detection precision of small targets, the SegNext_Attention module was incorporated. This module enables the model to more effectively leverage global contextual information and focus more precisely on regions of interest, thereby improving the overall image segmentation performance. Thirdly, C3k2 is refined by incorporating the Light-weight Context Guided module from CGNet. This enhancement endows the model with greater flexibility in capturing contextual information across all stages, while significantly reducing the number of parameters and memory footprint through optimized design. Finally, in the Detection Head part, the LSCD (Lightweight Shared Convolutional Detection Head) was adopted. By leveraging a shared convolutional mechanism, this detection head substantially reduces the number of model parameters, thereby further enhancing the model’s lightweight nature. After the above improvements, the performance of the YOLOv11-ASCL model in the task of detecting foreign objects in underground coal mine conveyor belts has been significantly enhanced. Its structure is shown in Fig. 2.
ADown
During the transportation process of the underground conveyor belt, there are not only large-volume foreign objects such as large stones, but also small-volume foreign objects such as small stones and slender anchor rods. When the traditional convolution operation (Conv) is adopted for downsampling processing, since this module only relies on the convolution operation, it is difficult to effectively capture the features of small targets, which may lead to the image information of fine foreign objects not being accurately detected, and thus result in detection leakage. Thus, ADown is incorporated into the enhancements of YOLOv11 to achieve lightweight design while boosting the model’s detection accuracy16. The ADown downsampling module initially applies a (2 × 2) average pooling operation with a stride of 1 to the input features. This approach helps to retain more spatial information and capture fine-grained features, thereby preventing the loss of small target features during the downsampling process. Subsequently, the feature map is divided into two parts, with the number of channels in each part halved. This effectively decreases the number of parameters in each pathway. Distinct processing methods are then applied to these two parts to enhance the feature representation. One portion undergoes maximum pooling initially to preserve the maximum value within the local region and accentuate the salient features of small targets, followed by further feature extraction using (1 × 1) convolution. The remaining portion directly executes convolutional operations to capture local features and achieve moderate downsampling, thereby reducing computational load. Finally, the two molecular feature maps are concatenated along the channel dimension. The concatenated feature maps can provide more abundant context and detailed information for the subsequent network. The structure of the ADown module structure is shown in Fig. 3.
SegNext_attention
In order to enable the improved model to better utilize the global context information, improve the performance of image segmentation, and further enhance the performance of the model in foreign object detection of the underground conveyor belt, the SegNext Attention mechanism is added to the last part of the Backbone. SegNeXt is a novel convolutional neural network architecture for semantic segmentation that enables efficient and high-performance semantic segmentation by rethinking convolutional attention design. The essence of SegNeXt lies in its convolutional encoder, MSCAN (Multi-Scale Convolution Attention Network), which employs a hierarchical structure to create a pyramid-like framework for capturing multi-scale contextual feature information. The architecture of MSCAN is illustrated in Fig. 4 (a). The decoder utilizes the Hamburger17 structure, which facilitates the extraction of multi-scale contextual features from local to global by further capturing global context information. Among these components, MSCA (Multi-scale Convolutional Attention) serves as the core module of the architecture. It is composed of Depth-wise Convolution, Multi-branch Depth Convolution, and 1 × 1 Convolution. Specifically, depth-wise convolution (denoted as (d)) is employed to extract local feature information. Multi-branch depth convolution is utilized to capture multi-scale contextual feature information. Meanwhile, 1 × 1 convolution is applied to model the inter-channel correlations. Ultimately, the output from the 1 × 1 convolution serves as the attention weighting factor, which is applied to modulate the input features of the MSCA module to produce the final output. The mathematical formulations of the MSCA are presented in Eqs. (1) and (2).
Here, F represents the input feature, Att indicates the attention weight parameter, and Out signifies the output feature. ⊗ represents the element-wise multiplication operation, while DW-Conv indicates the depth-wise convolution operation. \(Scale_{i} ,i \in \{ 0,1,2,3\}\) represents the i-th branch depicted in Fig. 4 (b), where Scale0 denotes the direct connection. In the remaining three branches, a standard depth-wise convolution with a large kernel is approximated by employing two consecutive depth-wise strip convolutions in different orientations. The kernel sizes for these three depth-wise strip convolutions are set to 7, 11, and 21, respectively. The Attention block is constructed by concatenating the MSCA block with two 1 × 1 convolutional layers and GELU18 activation layers at both the head and tail. This block is then integrated with the rest of the MSCAN module as shown in Fig. 4 (a). The complete block is formed by stacking L identical such blocks, resulting in the MSCAN module19.
C3k2_ContextGuided
To enhance the model’s deployment flexibility and detection speed in complex downhole environments, the Light-weight Context Guided module was integrated into CGNet, thereby optimizing and upgrading C3k2. An examination of the C3k2 module reveals that it adeptly integrates the speed advantages of the C2f. module with the adaptability of the C3k module. The C3k2 module is characterized by its dynamic runtime selection feature, which allows it to determine whether to activate the C3k layer for feature processing based on specific requirements. This endows the module with exceptional configurability. Compared with the fixed structure of the C2f. module, the C3k2 module leverages the C3k parameter to flexibly adjusts the convolutional kernel size as needed. This flexibility enables it to excel in situations that demand versatile feature extraction. For instance, when facing different receptive field requirements, the C3k parameter can be adjusted to precisely match the specific needs of feature extraction. The C2f. module, due to its fixed structure and speed advantage, is more suitable for scenarios with strict requirements for computing resources.
Next, the structure of the Context Guided Block (CGBlock), as shown in Fig. 5, consists of four main parts: local feature extractor floc(*), surrounding context extractor fsur(*), joint feature extractor fjoi(*), and global context extractor fglo(*), (·) represents element-wise multiplication. The input feature map initially undergoes processing through a 1 × 1 convolutional layer to adjust the channel count or transform the features. The local feature extractor floc(*) captures local features through a standard 3 × 3 convolutional layer, whereas the surrounding context extractor fsur(*) obtains surrounding features using a 3 × 3 transposed convolutional layer. Subsequently, the local and surrounding features are concatenated and then processed through batch normalization (BN) and a PReLU activation function to generate the fused feature map fjoi(*). The fused feature map fjoi(*) is subsequently input into the global feature extraction module fglo(*), which includes Global Average Pooling (GAP)20 and a Fully Connected Layer (FC)21 to extract and refine the global features. Finally, the output feature map is generated by performing an element-wise multiplication of the global features with the original input feature map22. In summary, the Context Guided Block enriches the feature map representation by integrating local and surrounding features with global features, thereby substantially enhancing the model’s performance.
A new type of C3k2_Context Guided module is proposed by integrating the Light-weight Context Guided module in CGNet into the C3k2 module. The experimental results highlight the notable benefits of the new module: it retains the flexibility of C3k2 while integrating the efficient feature fusion capabilities of Context Guided. This fusion preserves the model’s high-accuracy segmentation performance while also achieving substantial reductions in the number of model parameters and computational complexity. These enhancements enable the model to operate efficiently on mobile devices with limited resources, while boosting its adaptability to various scenarios and conditions, thereby strengthening its generalization capabilities. These improvements have notably elevated the efficiency of detecting foreign objects on underground conveyor belts and have delivered substantial progress to associated applications.
LSCD detection head
The Lightweight Shared Convolutional Detection Head (LSCD) serves as an efficient detection mechanism tailored for target detection tasks. The mechanism employs shared convolutional layers to minimize computational load and parameter size, thereby ensuring efficient detection outcomes. To elevate the performance of LSCD, the structure integrates detail-enhanced convolution (DEConv23) to optimize the original convolutional framework. DEConv integrates parallel standard convolution and differential convolution, with the latter being specifically designed to tackle issues such as image defogging. Furthermore, leveraging the reparameterization technique24, DEConv can be seamlessly converted into a standard convolution, eliminating the need for extra parameters or increased computational load. The formula is provided in Eq. (3) (bias is omitted for simplicity).
The notation DEConv (·) denotes the specific DEConv operation that has been utilized. The parameter Ki=1: 5 corresponds to the convolution kernels of VC, CDC, CDC, ADC, HDC, and VDC, respectively. The symbol “ ∗ ” indicates the convolution operation, while Kcvt represents the converted convolution kernel, which facilitates the integration of parallel convolutions. Traditional convolutional neural networks employ a static convolutional kernel for feature extraction, thereby constraining their capacity to identify nuanced defects. In particular, in scenarios characterized by complex backgrounds and the need for multi-scale target detection, the performance of conventional networks may be substantially compromised. By integrating multi-scale features and merging local and global information, the DEC module can effectively capture the nuances of subtle defects, thereby enhancing detection accuracy and system robustness.
Detail Enhanced Convolution (DEConv) comprises five parallel convolutional layers, including: vanilla convolution (VC), center difference convolution (CDC)25, angular difference convolution (ADC), horizontal difference convolution (HDC), and vertical difference convolution (VDC). The detailed structure is illustrated in Fig. 6.
In the LSCD architecture, the application of GNConv26 has been demonstrated in the research of FCOS27 to enhance the performance of the detection head in positioning and classification tasks. This technique is effectively employed in practice to mitigate the potential loss of feature extraction accuracy that may arise during the lightweight design process. LSCD also incorporates a shared convolution mechanism, which substantially reduces the model’s parameter count, rendering it more lightweight and particularly well-suited for deployment on resource-constrained devices. To address the issue of inconsistent target scales that different detection heads may encounter, LSCD utilizes a Scale layer to adjust the feature scales accordingly, thereby enhancing the detection outcomes. The detailed structure of LSCD is depicted in Fig. 7.
Firstly, each input feature map is processed through a shared convolutional layer with a 1 × 1 convolutional kernel. This step aims to facilitate the exchange of information across channels to accelerate computation. Subsequently, the feature maps are fed into a unified 3 × 3 convolutional layer (denoted as Conv_GN 3 × 3). This layer processes all input feature maps to integrate features and share information to reduce the model’s parameter count and computational demands. Finally, the information extracted from the shared convolutional layer will be sent to the classification and regression module and undergo feature adjustment through the Scale layer to enhance the model’s performance in handling multi-scale features.
Experiments and results
Datasets
The foreign object dataset of the coal mine belt conveyor used in this experiment is derived from the coal mine-specific video AI analysis dataset developed by the Intelligent Detection and Pattern Recognition Research Center of China University of Mining and Technology (China University of Mining and Technology—Belt Conveyor, CUMT-BeIT)28. This dataset was collected from the actual underground belt conveyor operation scenarios in mines, covering the operation images of belt conveyors under various typical working conditions. A total of 4,832 images were collected, including videos of large rocks, anchor rods, coal slurry on the conveyor belt, and normal coal flow. These images cover a broad spectrum of complex scenarios, including different lighting conditions (such as dark and bright environments), differences in the distance between foreign objects and the conveyor belts, as well as the diversity in shapes and sizes of foreign objects, aiming to comprehensively cover all types of foreign object features. Foreign objects are categorized into two primary types: stones and anchor rods. The stones exhibit a wide range of sizes, including both small pebbles and larger stones. Anchor rods have different lengths and diameters and are presented in the image at various angles. The image is labeled using a Labeling tool. The labeling box is mainly rectangular to adapt to the shape characteristics of the foreign object.
However, preliminary analysis reveals that the sample size in the dataset is limited and the category distribution is unbalanced, making it difficult to meet the requirements of model training and validation. Therefore, various data augmentation techniques were employed to expand the dataset. Specific techniques include horizontally flipping the image (180°), applying random rotations (primarily 90°), scaling the image with a quarter-overlap, and adjusting the image’s brightness, contrast and color of the image to emulate scenes under different lighting conditions. After the above processing, the dataset was augmented to a total of 10,646 images and uniformly adjusted to a resolution of 640 × 640 pixels for subsequent processing. Ultimately, the expanded self-built dataset is divided into a training set, a validation set and a test set in a ratio of 8:2:1 to support the training, validation and testing of the model.
Experimental equipment and evaluation indicators
This model is developed using Python, based on the PyTorch framework, and accelerated for training through CUDA 11.3. The hardware configuration is AMD Ryzen 9 7945HX CPU and NVIDIA GeForce RTX 3070 GPU with 8 GB of video memory. During training, the image input 640 × 640, the optimizer is SGD, the training spans 200 epochs, the batch size is 16, and the initial learning rate is 0.01.
In the model performance evaluation, mean average precision (mAP) serves as a primary metric for evaluating recognition accuracy. Meanwhile, GFLOPs (billion floating-point operations) is a key parameter for assessing model complexity. The higher the GFLOPs value, the greater the computational load of the model, usually requiring more computing resources and possibly longer inference time, which in turn leads to an increase in time complexity. Thus, GFLOPs play a crucial role in evaluating the computational complexity of models. Time complexity, which is directly linked to the model’s real-time performance, is particularly critical, especially in real-time detection tasks. Moreover, the number of parameters indicates the model’s scale and reflects its spatial complexity. Spatial complexity directly impacts the model’s storage demands and deployment efficiency, which is particularly significant for devices with limited memory. The model’s computational demands are closely tied to the number of Parameters and GFLOPs. Precision measures the proportion of objects identified as targets by the model that are truly targets, whereas recall reflects the proportion of actual positive samples that the model can correctly predict as positive.
To validate the rationality of the enhanced algorithm and based on the lightweight goal of the method, the model size was selected in the experiment. The Mean Average Precision (mAP), Parameters, computational cost (GFLOPs), Precision, and Recall are used as model evaluation metrics to assess the improved network.
Results of the experiment
The article mainly includes: ① Analysis of the effectiveness of the improved modules: The experimental results of the enhancements in each component of YOLOv11-ASCL are individually compared to validate the efficacy of each improvement. ②Ablation study: To further validate the cumulative enhancement effects on the model after sequentially integrating each improvement. ③Comparative experiment: By comparing the current mainstream models with YOLOv11-ASCL, the superiority of YOLOv11-ASCL is proved. ④ Visual analysis of detection results: To verify the effectiveness of YOLOv11-ASCL in real-world applications, the model’s actual detection performance is visualized, thereby further substantiating the practicality of the enhanced model. The comparative outcomes of these four experiments further substantiate the superiority of YOLOv11-ASCL.
Effectiveness analysis of improved module
To verify the effectiveness of each improvement module, YOLOv11-ASCL conducts comparative experiments between the original YOLOv11n model and each individual enhancement component. During training, each improved module is integrated into the same experimental environment for comparative analysis with the detection results of the YOLOv11n model presented in Table 1.
As shown in the experimental data from Table 1, it can be seen that the improved structural model, when compared with YOLOv11n, not only successfully achieved the goal of model lightweighting but also ensured the stability of detection accuracy. In terms of various indicators, different degrees of optimization have been realized, effectively enhancing detection speed of the model while simultaneously reducing its complexity. YOLOv11-ADown enhances the backbone by substituting the traditional convolutional layer with the ADown downsampling module. While this results in a slight loss of precision, it significantly optimizes other parameters. The mAP increased by 0.7%, the number of parameters decreased by 0.4 M, the computational amount decreased by 1G, the model size shrank by 1 MB, and recall improved by 1%. These changes meet the algorithm’s requirement for lightness while preserving accuracy. YOLOv11-SegNext integrates the SegNext_Attention mechanism into the core of the original model. Although this addition increases the model’s computational load to some extent, it enhances the model’s detection accuracy. The mAP rose by 0.3%, precision by 0.2%, and recall by 1%, demonstrating that incorporating the SegNext_Attention mechanism effectively enhances the model’s detection accuracy. YOLOv11-ContextGuided integrates the Light-weight Context Guided module to enhance and replace the original C3k2. The experimental results confirmed that the improvement successfully achieved model lightweighting by reducing the number of parameters by 0.4 M, computation by 0.9G, and model size by 0.8 MB, while also improving precision by 0.6%. Finally, incorporating the lightweight detection head LSCD results in a parameter reduction of 0.1 M, a computational decrease of 0.7G, and a model size reduction of 0.4 MB. Precision is enhanced by 0.5%, and recall increases by 1%. Despite a slight dip in mAP, the enhancements to the detection head significantly bolster overall model performance. Compared to the original YOLOv11n algorithm, the final YOLOv11-ASCL algorithm demonstrates a 1.5% increase in average mAP accuracy, a 0.7 M decrease in parameters, a 2.1G decrease in computational load, a 1.6 MB decrease in model size, a 1.2% improvement in precision, and a 2% increase in recall. These enhancements confirm that the final improved model is comprehensively optimized relative to the original model.
To deeply explore the impact of each improvement module on the target detection performance under typical harsh conditions such as edge blurring, low resolution and complex lighting, this paper further compared the detection results before and after the improvement through heatmaps. The Fig. 8 shows the comparison of heatmaps before and after the application of the improved modules for the YOLOv11n model under conditions of edge blurring, low resolution and complex lighting. The results indicate that the heatmaps generated by each of the improved modules clearly illustrate the key areas that the model focuses on in the input image, reflecting the effectiveness of the improved modules in enhancing the feature expression ability of the model. By comparing the heatmaps before and after the improvement, it can be clearly observed that the model’s response to the target object is more concentrated and accurate. Especially in conditions such as blurred edges, low resolution, and complex lighting, the improved module effectively enhances the model’s detection robustness and accuracy. These visual results provide an intuitive and compelling support for the effectiveness of each module in practical applications.
Ablation experiments
To further validate the extent of performance optimization achieved by the individual improvements and incremental combinations of different methods in the YOLOv11-SACL model, a series of ablation experiments are designed for comparative analysis. To ensure the accuracy and fairness of the experiment, consistent parameters are employed throughout the training process.
The experiments are respectively the original YOLOv11n algorithm, YOLOv11-ADown replacing the traditional convolution, YOLOv11-SegNext for the attention mechanism, YOLOv11-ContextGuided for enhancing the backbone, and YOLOv11-LSCD for improving the detection head. The four enhancements are integrated in sequence and progressively refined based on the original YOLOv11n model to achieve the final optimized model. The results are presented in Table 2, with “√” denoting the incorporation of the respective method.
The results of the ablation experiment show that compared with the YOLOv11n model, after replacing the traditional convolution with YOLOv11-ADown, the mAP increased by 0.7%, the number of parameters decreased by 0.4 M, the computational load decreased by 1G, and the model size decreased by 1 MB. Based on the previous basis, the YOLOv11-ContextGuided with the improved C3k2 was further adopted. Although the average accuracy partially declined, the model’s lightweighting was further enhanced. The number of parameters was reduced by 0.3 M, the computational load was decreased by 0.4G, and the model size was reduced by 0.4 MB. Overall, The fusion of this improved part is conducive to enhancing the performance of the model. Then, the detection head YOLOv11-LSCD was added. Compared with the enhanced fusion in the previous step, the mAP accuracy increased by 0.3%, the number of parameters decreased by 0.1 M, the computational load decreased by 0.8G, and the model size decreased by 0.4 MB. The final step integrates the attention mechanism YOLOv11-SegNext to form the final enhanced model YOLOv11-ASCL. Compared with the previous iteration, although this addition is somewhat less favorable for model lightweighting, it results in a 0.7% increase in mAP, significantly boosting the model’s detection accuracy. Overall, this integration is advantageous for the comprehensive optimization of the detection model.
To provide a more intuitive comparison of the impact of different improvement modules on detection performance in the ablation study, the mAP@0.5 convergence curves of each model during training are plotted, as shown in Fig. 9. The experiments include the following models:
-
YOLOv11n: The original baseline model.
-
YOLOv11-ADown: Replace the traditional convolution in YOLOv11n with the ADown module
-
YOLOv11-AC: It integrates the ADown module and the ContextGuided module (i.e., YOLOv11-ContextGuided).
-
YOLOv11-ACL: Based on YOLOv11-AC, the detection head module is further improved by introducing LSCD.
-
YOLOv11-ASCL: The SegNext attention mechanism was added to obtain the final optimized model based on YOLOv11-ACL.
As shown in Fig. 9, all improved models exhibit a rapid increase in performance during the early stages of training and converge after approximately 50 epochs. Compared with the original YOLOv11n, the YOLOv11-ASCL model, which integrates multiple improvement modules, demonstrates superior convergence speed and a higher final mAP value, consistent with the comparative results presented in Table 2.
Experiments have demonstrated that the YOLOv11-ASCL network model satisfies the requirements for lightweighting, its four subsequent primary enhancement measures for detecting foreign objects on coal mine conveyor belts exhibit a progressive improvement trend. This result thoroughly substantiates the rationality and efficacy of the enhanced method, which can notably enhance the accuracy of foreign object detection on conveyor belts while ensuring lightweight deployment and maintaining robust algorithm performance.
Comparative experiments
To thoroughly assess the efficacy of the enhanced YOLOv11-ASCL algorithm in target detection in underground coal mine drilling environments, this study selected six prominent target detection models, namely YOLOv5, YOLOv6, YOLOv8, YOLOv9t, YOLOv10, YOLOv11n, YOLOv12n, YOLOv13n and conducted comparative experiments with the improved model YOLOv11-ASCL. During the experiment, the hyperparameters and training parameters of all models were set by default and trained on standardized datasets. The experimental results show that the improved model YOLOv11-ASCL outperforms several other models in terms of the number of parameters, computational load and model size, fully demonstrating the advantages of its lightweight design. The relevant results are detailed in Table 3.
As can be seen from Table 3, the parameter quantity of YOLOv11-ASCL is only 1.8 M, and the computational load is 4.2G. Compared with YOLOv6 and YOLOv11n, the parameter quantity is 56% and 28% less respectively, and the computational load is 63% and 33.33% less, demonstrating a significant lightweight advantage. The model size is 3.9 MB, which is significantly smaller than other models, and it also outperforms YOLOv12n and YOLOv13n in all metrics. This indicates that YOLOv11-ASCL is more suitable for resource-constrained scenarios, significantly outperforming other models in terms of efficiency. This indicates that YOLOv11-ASCL is more suitable for resource-constrained scenarios. As can be more clearly seen from Fig. 10, YOLOv11-ASCL outperforms other models in terms of accuracy. The mAP@0.5 reaches 94.5%, which is 1.5% higher than that of YOLOv11n, and the overall accuracy is better.
In the object detection task, Box-Loss is the core metric for measuring the bounding Box regression ability of the model, which directly reflects the degree of deviation between the predicted box and the real box. Compared with classification loss or other auxiliary indicators, Box-Loss can better reflect the performance of the model in terms of target localization accuracy. Therefore, comparing the changing trends of different models on Box-Loss can not only reveal the differences in convergence speed and training stability, but also visually reflect the advantages and disadvantages of the models in detection accuracy and generalization ability. Therefore, the experiment compares the improved model with YOLOv11 in Box-Loss, as shown in Fig. 11.
During the complete 200-round training process, it can be clearly observed that the overall performance of the improved YOLOv11-ASCL model on Box-Loss is superior to that of the original YOLOv11. Specifically, in the early stage of training, the initial value of Box-Loss of YOLOv11-ASCL is lower and the decline rate is faster, demonstrating that the model has a more efficient convergence ability. As the training progresses, both enter a stable convergence process in the mid-term stage. However, YOLOv11-ASCL always remains at a level slightly lower than YOLOv11, showing a more stable optimization trend. In the later stage of training, the advantages of YOLOv11-ASCL were further demonstrated. Eventually, at the 200th round, the Box-Loss reached 0.568, significantly lower than YOLOv11’s 0.607. This result indicates that the proposed improvement not only accelerates the convergence speed of the model, but also effectively enhances the accuracy of target positioning and the generalization ability of the model, proving the application potential of this method in complex scenarios.
Visualization and analysis of test results
To visually evaluate the foreign object detection performance of the conveyor belt in complex underground mine environments, this paper selected models from YOLOv5 to YOLOv13 series as well as the improved YOLOv11-ASCL algorithm. The detection results were comprehensively compared and presented under five typical working conditions including edge blurring, low resolution, complex lighting, small target foreign objects, and slender foreign objects (corresponding to the five subgraphs a, b, c, d, e in Fig. 12). In scenes with complex lighting conditions and distant targets, the contours of stones are blurred and the image resolution is significantly reduced. Especially for smaller or slender stones, traditional models show different degrees of decline in detection accuracy and missed detections, making it difficult to effectively capture the details of the targets. In contrast, YOLOv11-ASCL effectively enhances the model’s sensitivity and discrimination ability towards low-quality image features by introducing an adaptive context-aware module and an enhanced feature fusion mechanism. It can accurately identify stones of different sizes and distances on the conveyor belt, significantly reducing false detections and missed detections, and significantly improving the overall detection efficiency.
To deeply analyze the differences in detection performance among various models, the corresponding heatmaps were further generated (refer to Fig. 12). In the heatmap, areas with lighter colors indicate lower model confidence, while areas with darker colors correspond to parts with higher confidence. Under conditions such as insufficient light, reduced resolution and blurred target edges, the attention distribution of some models from YOLOv5 to YOLOv13 is limited, failing to fully cover the target area, resulting in insufficient capture of key details and differences in detection stability. The improved YOLOv11-ASCL algorithm, through an optimized attention mechanism, achieves broader and more uniform attention to the key areas of the target, significantly enhancing the confidence level. This effectively improves the accuracy of target positioning and the stability of detection, fully demonstrating its advantages in the detection of foreign objects on conveyor belts in complex mine environments.
To verify the real-time detection capability of YOLOv11-ASCL under the condition of high-speed belt operation, based on the original single-frame static results, video multi-frame detection under typical working conditions was carried out. Under the same parameter settings, the consecutive multiple frame images in the video (i.e., at different time points) are detected, and the position changes, detection stability and continuity performance of the same target over the time dimension are displayed. Through multiple independent experiments, the stability of the model in target tracking and recognition in continuous time series is demonstrated. Fig. 13a and b show the multi-frame detection results of YOLOv11n and YOLOv11-ASCL under the condition of high-speed belt operation. It can be seen that in the case where the target position changes rapidly over time and the background texture continuously moves, YOLOv11-ASCL can still accurately locate and identify the foreign objects without any obvious missed detections or incorrect detections, and the position of the detection box smoothly changes along with the movement of the target. This indicates that YOLOv11-ASCL not only has high detection accuracy in single-frame static conditions, but also maintains good detection stability and robustness in high-speed dynamic scenarios.
Conclusions
To address the problems such as the difficulty in extracting foreign object features during the transportation of coal mine conveyor belts caused by the complex environment underground in coal mines, a foreign object detection algorithm for underground conveyor belts named YOLOv11-ASCL was designed based on the YOLOv11n architecture. This algorithm introduces the ADown downsampling module, enabling the model to efficiently capture image features at a higher level and thereby enhancing operational efficiency. Meanwhile, the C3k2_ContextGuided module of enhanced C3k2 is introduced to empower the model to more adaptively capture contextual information at each stage. Additionally, the SegNext_Attention module is incorporated to further refine detection accuracy for small targets. Ultimately, the lightweight detection head LSCD is employed to further optimize the model’s lightweight characteristics by significantly reducing the number of parameters through the shared convolution mechanism.
The refined YOLOv11-ASCL algorithm demonstrates outstanding performance in detecting foreign objects on conveyor belts in complex coal mine backgrounds, achieving a recognition accuracy of 94.5%, which was 1.5% higher than that of the original model. This significantly reduced the false detection and missed detection phenomena in the detection of small targets. Moreover, the algorithm achieved notable optimization in model performance. The number of parameters has been reduced by 0.7 M, the computational load has been decreased by 2.1G, the model size has been reduced by 1.6 MB, and at the same time, the Precision has increased by 1.2% and the Recall has also grown by 2%. These enhancements not only achieve model lightweighting but also significantly boost the model’s deployment flexibility and detection efficiency in complex underground environments. This provides an efficient, accurate, lightweight, and easy-to-deploy solution for research and applications in relevant fields.
Data availability
The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.
References
L. S. Martinez-Rau, Y. Zhang, B. Oelmann, and S. Bader, “On-device anomaly detection in conveyor belt operations,” arXiv preprint arXiv:2411.10729, (2024).
Yuan, X., Wu, Y., Sun, L., & Wang, X. Research on Efficient Construction Paths for Intelligent Coal Mines in China from the Configuration Perspective.Applied Sciences, 13(1), 673 https://doi.org/10.3390/app13010673 (2023).
Yao, R., Qi, P., Hua, D., Zhang, X., Lu, H., & Liu, X.A foreign object detection method for belt conveyors based on an improved YOLOX model.Technologies, 11(5), Article 114. https://doi.org/10.3390/technologies11050114(2023).
Haiqiang, L., yecheng, G., Xiaojing, C. et al. Foreign object detection algorithm for coal mine conveyor belt based on improved YOLOv7. Inst. Techn. Sensor (10), 95–99 (2024). (in Chinese)
Zonglin, L. et al. Research on the detection of foreign objects in coal mine belt conveying based on improved YOLOv8n. Min. Saf. Environ. Prot. 51(4), 41–48 (2024) (in Chinese).
Yan, H. et al. Foreign object detection of coal mine conveyor belt based on improved YOLOv8. J. Mine Autom. 50(6), 61–69 (2024) (in Chinese).
Han, G. A. O. et al. Coal mine conveyor belt foreign object detection based on feature en-hancement and Transformer. Coal Sci. Technol. 52(7), 199–208 (2024) (in Chinese).
Li Bin, Li Shenglin. Improved YOLOv11n Small Object Detection Algorithm in UAV View[J/OL].Computer Engineering and Applications,1–11[2025–03–08].http://kns.cnki.net/kcms/detail/11.2127.TP.20241223.1319.020.html (2025). (in Chinese)
Guo, M. H. et al. Segnext: Rethinking convolutional attention design for semantic segmentation. Adv. Neural. Inf. Process. Syst. 35, 1140–1156 (2022).
Sun, Y. et al. MSCA-Net: Multi-scale contextual attention network for skin lesion segmentation. Pattern Recogn. 139, 109524 (2023).
Wu, T. et al. Cgnet: A light-weight context guided network for semantic segmentation. IEEE Trans. Image Process. 30, 1169–1179 (2020).
Sharma, A., Kumar, V., & Longchamps, L. (2024). Comparative performance of YOLOv8, YOLOv9, YOLOv10, YOLOv11 and Faster R-CNN models for detection of multiple weed species. Smart Agricultural Technology, 9, 100648.https://doi.org/10.1016/j.atech.2024.100648
Zhao, H., Zhang, Y., Liu, S. et al. Psanet: Point-wise spatial attention network for scene parsing. In Proc. European conference on computer vision (ECCV). 267–283 (2018).
Sharma, P., Patel, R. & Kumar, M. Enhanced atrous spatial pyramid pooling feature fusion for small ship instance segmentation. J. Imag. 10(2), 42. https://doi.org/10.3390/jimaging10020042 (2024).
Lin, T.-Y., Dollár, P., Girshick, R., He, K., Hariharan, B., & Belongie, S.Feature Pyramid Networks for Object Detection.In Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR) (pp. 2117–2125) (2017).
Howard, A. G., et al. “MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications,” arXiv:1704.04861, 2017.
Geng, Z., Sun, J., Li, W., Liu, S., Ji, X., & Ding, E.Is Attention Better Than Matrix Decomposition?arXiv preprint arXiv:2109.04553 (2021).
D. Hendrycks and K. Gimpel, “Gaussian Error Linear Units (GELUs),” arXiv preprint arXiv:1606.08415 , (2016).
Li et al., 2023. SegNeXt: Rethinking Convolutional Attention Design for Semantic Segmentation. NeurIPS 2023.
Lin, M., Chen, Q. & Yan, S. “Network In Network,” In Proc. Int. Conf. Learn. Represent. (ICLR) (2014).
Basha, S. H. S. et al. Impact of fully connected layers on performance of convolutional neural networks for image classification. Neurocomputing 378, 112–119 (2020).
Hu, J., Shen, L., & Sun, G. (2018). Squeeze-and-Excitation Networks. CVPR 2018.
Chen, Z., He, Z., Lu, ZM. DEA-Net: Single image dehazing based on detail-enhanced convolution and content-guided attention. In IEEE Trans. Image Proc. (2024).
Ding, X., Zhang, X., Ma, N. et al. Repvgg: Making vgg-style convnets great again. In Proc. IEEE/CVF Conf. comput. Vis. pattern recognit. 13733–13742 (2021).
Z, Y., C, Z., Z, W. et al. Searching central difference convolutional networks for face anti-spoofing. Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. 5294–5304 (2020).
Wu Y, He K. Group normalization. In Proc. European conference on computer vision (ECCV) 3–19 (2018).
Tian, Z., Shen, C., Chen, H., & He, T.FCOS: Fully Convolutional One-Stage Object Detection.IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 44(4), 1426–1441 (2022).
Deqiang, C. et al. Lightweight network based on residual information for foreign body classification on coal conveyor belt. J. China Coal Soc. 47(3), 1361–1369 (2022) (in Chinese).
Author information
Authors and Affiliations
Contributions
Conceptualization J.R.L. Data curation J.R.L. Formal analysis X.P.Y. Funding acquisition Z.B.F. Investigation Z.B.F. Methodology J.R.L Project administration J.R.L. Resources X.P.Y. Software Z.B.F. Supervision X.P.Y. Validation J.R.L. Visualization Z.B.F. Writing -original draft J.R.L. Writing-review & editing R.N.L.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Ling, J., Fu, Z., Yuan, X. et al. Development of a deep learning-based foreign object detection algorithm for coal mine conveyor belts. Sci Rep 15, 42291 (2025). https://doi.org/10.1038/s41598-025-22636-5
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41598-025-22636-5























