Introduction

Coal is a significant energy resource; however, during the mining process, the inclusion of coal gangue reduces the calorific value of coal and compromises its quality. Coal gangue contains heavy metals and other hazardous substances, and its combustion releases many harmful gases, leading to severe environmental pollution. The separation of coal gangue not only enhances coal quality but also mitigates environmental pollution and promotes the efficient utilization of resources.

The traditional manual gangue sorting method is inefficient, labour-intensive and prone to misjudgment and omission. Although the mechanical methods of gangue sorting can significantly reduce misjudgment, they involve complex equipment1,2, high costs, and stringent requirements regarding the particle size of the gangue3. Examples of such methods include heavy media sorting, jigging sorting, and other similar techniques.

Machine vision-based inspection technology does not require complex mechanical equipment4,5. It captures images of coal and gangue using image acquisition devices and extracts features through image processing algorithms6,7, enabling the identification and localization of gangue8. However, the detection accuracy of this technology for coal gangue remains insufficient. With the rapid advancement of artificial intelligence, accurate detection of coal gangue has become feasible9,10,11. By employing artificial intelligence algorithms, such as convolutional neural networks (CNN)12,13,14, you only look once (YOLO) series15,16,17, and pyramid scene parsing network (PSPNet)18, the detection accuracy can be enhanced by training models on large datasets of images. However, in the coal production process, coal gangue detection must be completed quickly to meet production efficiency requirements. Therefore, a lightweight network architecture is essential to optimize the detection algorithm, reduce computational complexity, and improve detection speed.

Traditional deep learning models can overfit the training data, especially if the dataset is small or not representative of real-world conditions, leading to poor generalization on new data. Therefore, traditional deep learning models require large amounts of labeled data to perform well19,20,21. However, collecting and labeling sufficient high-quality coal gangue images is expensive and challenging. In addition, deploying models for real-time image recognition in mining operations demands efficient algorithms and sufficient processing power to ensure timely and accurate recognition22,23,24. YOLOv5 offers very fast inference speeds while maintaining high accuracy, making it well-suited to the real-time and rapid requirements of detecting coal gangue25,26,27. Compared with other versions, the YOLOv5 model can be quickly deployed on resource-constrained devices without sacrificing too much accuracy, meeting the lightweight requirements of detecting coal gangue.

The identification of coal gangue targets based on deep learning requires first recognizing and then locating the target. Therefore, the following steps are researched as follows: (1) Investigate the YOLOv5 model in deep learning to propose a method for coal gangue image recognition. (2) Preprocess the data, handle anomalies, and ensure that the images meet the requirements of the coal gangue recognition model. Establish an image dataset based on coal gangue image data and preprocessed image data. (3) Train the target recognition model using the dataset to locate coal gangue and annotate it with rectangular bounding boxes through deep learning target recognition. (4) Use optimization methods to enhance target recognition and process the data images obtained through experiments. (5) Validate the model. The overview of the workflow is shown in Fig. 1.

Fig. 1
Fig. 1
Full size image

The flowchart of this work.

Problems were encountered during the experiment, such as the selection of data images during the production of the dataset and the optimization module added to the recognition model. These problems were theoretically feasible and logically rigorous, but they could not be carried out in practice. This was mainly because the initially selected optimization module did not have a significant recognition effect on the data image selection, which caused the prediction time to increase. To solve these problems, the parameters were continuously corrected, and the multiple channel attention (MCA) mechanism28,29 and lightweight content-aware re-assembly of features (CARAFE) up-sampling operator30,31 were added.

The YOLOv5 optimal model improves the precision (P) value for recognizing coal and rock from a baseline of 0.963 to 0.966, the recall (R) value from 0.954 to 0.959, and the mean average precision (mAP) value from 0.975 to 0.977. The results show that the confidence of the optimal model is significantly higher than that of the basic model, and the recognition effect is significantly improved.

Model

The experiment was based on the YOLOv5 model with improvements introduced in the feature enhancement stage, incorporating the MCA mechanism and the lightweight CARAFE up-sampling operator.

Principles of the YOLOv5 model

YOLOv5 is an object detection algorithm based on deep learning technology. It utilizes components such as the CSPDarknet53 backbone network, feature pyramid structure, lightweight detection head, and anchor boxes, among others, to achieve efficient object detection32,33. The model performs forward propagation to compute bounding boxes and class confidence scores, optimized through improved activation functions and loss functions, ultimately achieving high detection speed and accuracy.

Backbone network: YOLOv5 uses CSPDarknet53 as its backbone network, known for its lightweight Darknet architecture with high performance and computational efficiency. The network employs the cross-stage partial (CSP) network structure to split and process input feature maps in parallel, enhancing information propagation efficiency and feature reuse.

Feature pyramid: A computer vision technique used for multiscale object detection and image segmentation, YOLOv5 incorporates a feature pyramid structure to fuse feature maps from different levels, enabling the detection of objects at different scales. Detecting objects across different feature map levels improves the model’s capability to handle objects of varying sizes.

Detection head: YOLOv5 adopts a lightweight detection head structure responsible for generating detection bounding boxes and class confidence scores. The detection head includes a series of convolutional layers, fully connected layers, and activation functions for predicting bounding box positions and class probabilities.

Anchor boxes: These are predefined bounding boxes used to adjust for object shape and scale by predicting offsets and scale information. YOLOv5 integrates Anchor boxes to enhance the model’s detection capabilities across different object shapes and scales34.

MCA attention mechanism

The MCA mechanism is an attention mechanism used in deep learning models. It aims to enhance the model’s learning ability to correlate features across different channels, thereby improving its performance on specific tasks. The core idea is to dynamically learn the importance of each channel in the feature map using attention mechanisms. This mechanism then integrates these weighted features to extract richer and more effective feature representations. By effectively capturing channel correlations in image features, MCA enhances the representation capability of deep learning models.

In traditional attention mechanisms, attention weights are typically computed in the spatial dimension. In contrast, MCA focuses on weighting attention across channel dimensions. This approach allows the model to flexibly learn correlations between different channels, thereby improving its ability to represent input data. In practice, MCA typically involves the following steps: (1) Channel segmentation: First, the input features are segmented into multiple channels, each containing a set of features. (2) Compute attention weights: For each channel, attention weights are computed using an attention mechanism. Typically, this involves linear transformations of features within the channel to obtain the attention distribution. (3) Weighted feature fusion: Multiply the attention weights of each channel by the features within that channel. Sum these weighted features across all channels to obtain the final weighted feature representation. (4) Parallel computation with multiple heads: Multiple attention heads are often introduced to enhance representation capability. These heads compute attention weights and weighted feature representations in parallel. The outputs from these multiple heads are then concatenated or aggregated to obtain the final output feature representation.

Lightweight CARAFE up-sample operator

The lightweight CARAFE up-sampling operator is an up-sampling algorithm based on reversible convolution, aimed at enhancing the efficiency and accuracy of deep learning models during up-sampling35. Compared to traditional up-sampling methods like bilinear interpolation or transposed convolution, CARAFE offers lower computational complexity and higher up-sampling quality. Its core idea is to use reversible convolution for up-sampling while integrating a local feature reassembly mechanism to preserve detailed information in feature maps.

Specifically, CARAFE achieves more accurate and detailed up-sampling results by reassembling features from local receptive fields during the up-sampling process. This content-aware reassembly approach allows CARAFE to better preserve semantic information and spatial structure of feature maps, avoiding potential blurring and distortion issues associated with traditional up-sampling methods. In practice, CARAFE typically involves the following steps: (1) Reversible convolution up-sampling: First, the input feature map undergoes up-sampling using reversible convolution. Reversible convolution is a specialized convolution operation that can enlarge the size of feature maps without losing information. (2) Feature re-assembly: For each pixel position after up-sampling, CARAFE reassembles features from local receptive fields. Specifically, it calculates local receptive fields based on the pixel position in the original feature map and uses these local features for reassembly to generate the feature representation at the target position. (3) Re-assembly weight calculation: During feature reassembly, CARAFE computes reassembly weights for each pixel position to determine how local features are utilized for reassembly. These reassembly weights are typically calculated based on spatial position and feature similarity information to ensure accuracy and fidelity in reassembly. (4) Feature fusion: Finally, CARAFE integrates the feature map obtained from feature re-assembly with the up-sampled feature map to produce the final up-sampling result.

Optimized model based on YOLOv5

The YOLOv5 model has been enhanced to address the challenges of coal gangue image recognition tasks. As part of the improvement, the lightweight CARAFE up-sampling operator was chosen. CARAFE is a lightweight up-sampling operator that effectively increases the receptive field and enhances the utilization of semantic information from feature maps. This allows the model to maintain detection accuracy while reducing computational complexity and accelerating recognition.

Next, the MCA attention mechanism is introduced during the feature enhancement phase. The MCA mechanism aids the model in integrating high and low-level feature information more effectively, thereby enhancing feature representation and robustness. By incorporating MCA modules into the backbone network, spatial position information encoding is shared, facilitating the fusion of high and low-level feature information. This enhancement improves the network’s feature extraction capability, enabling more accurate localization of target information and further enhancing recognition capability36.

The optimized model37 comprises the backbone network, neck network, and prediction network, as illustrated in Fig. 2. MCA modules are integrated into the neck network, while CARAFE modules replace the original up-sampling modules in the backbone. The placement of MCA modules is meticulously adjusted to enhance feature extraction by integrating information across channels, horizontal spatial dimensions, and vertical spatial dimensions. This refinement assists the backbone network in precisely locating target information and enhances its recognition capability.

Fig. 2
Fig. 2
Full size image

The flowchart of the optimized model based on YOLOv5.

Experiment and analysis

Collecting and preprocessing of coal gangue image data

The pictures of coal and gangue in the manuscript were obtained by our own camera. We collected 3200 different pictures of coal and gangue for model training and testing. The sample are illustrated in Fig. 3.

Fig.3
Fig.3
Full size image

The raw images of coal gangue.

Methods were employed to augment the original coal gangue images, enhancing the dataset. Data labeling involves annotating image data by adding information about the location and category of targets in each image, facilitating model training and evaluation. Before training, coal gangue images need to be labeled. The LabelImg tool was used for annotation, as shown in Fig. 4.

Fig. 4
Fig. 4
Full size image

Annotation of coal and gangue dataset using LabelImg tool.

After completing the annotation, place the images of the test set and training set into the ‘images’ folder under the ‘train’ and ‘val’ directories, respectively. Similarly, place the image labels into the ‘labels’ folder under the ‘train’ and ‘val’ directories. This completes the creation of the target dataset, as shown in Fig. 5.

Fig. 5
Fig. 5
Full size image

The dataset classification.

Experiment details

The operating system used for this experiment is Windows 10, with an Intel(R) Core (TM) i7-8700 CPU as the core processor and an NVIDIA GTX 2070 as the graphics processor (GPU). The development framework for the program includes Python 3.8.5, CUDA 10.2.89, cuDNN 7.6.5, and PyTorch 1.6.0. Using the concept of transfer learning, the dataset is employed to train the model and obtain pre-trained weight parameters. The batch size for model training is set to 16, and the momentum of learning rate and weight decay are set to 0.934 and 0.0005, respectively. The optimizer used is SGD, with an initial learning rate of 1 × 10–2. The parameters are detailed in Table 1.

Table 1 Model training parameters.

This paper evaluates the detection performance of each model on the ARDs-5-TE dataset. For each test image, precision (P) and recall (R) are calculated by comparing the detection results with the ground truth labels. Further metrics include the F1 score for each class, which is the harmonic mean of P and R, and the average precision (AP) for each class, representing the area under the precision-recall curve. The mean average precision (mAP) is then computed by averaging AP across all classes, providing an overall measure of the model’s detection performance in complex scenarios. Computational complexity is indicated by the number of algorithm parameters (Par, unit in Mb) and FLOPs (floating-point operations, unit in G), where higher Par values imply longer training and inference times, and FLOPs represent the total number of floating-point operations required by the model to process an input instance. Reducing FLOPs helps to improve the speed and efficiency of the model’s operation. Inference efficiency is measured by the average inference time per image in the test set (in ms), all computed on a GTX 2070.

Here TP represents true positive predictions, TN represents true negative predictions, FP represents false positive predictions, and FN represents false negative predictions, the calculations are as follows:

$$P = \frac{TP}{{TP + FP}}$$
(1)
$$R = \frac{TP}{{TP + FN}}$$
(2)
$$F_{1} = 2 \times \frac{P \times R}{{P + R}}$$
(3)

Experimental results

This experiment is divided into four sets of data: YOLOv5 basic model experiment, YOLOv5-MCA experiment, YOLOv5-CARAFE experiment, and YOLOv5 optimal model experiment. Each experiment tested 310 images, with 2472 images used for training and 309 for validation.

During validation, different types of coal gangue were identified under varying backgrounds. Each image contained numerous coal gangue pieces of different sizes and arrangements. Through the experiments, all targets were successfully detected, validating the model’s ability to simultaneously detect multiple types of coal gangue. Figure 6 demonstrates the recognition performance of the YOLOv5 optimal model.

Fig. 6
Fig. 6
Full size image

Recognition results of YOLOv5 optimal model.

Results analysis

This study conducted a statistical analysis of experimental results, including P, R, and mAP values for three categories: coal, rock, and all. Additionally, it evaluated Par values, FLOPs, and time values (prediction time per single image). The experimental data results from the four experiments are summarized, with partial weight results shown in Table 2 and Fig. 7.

Table 2 The Table of partial weight results.
Fig. 7
Fig. 7
Full size image

The bar chart of partial weight results.

The experimental results are shown in Table 3 and Fig. 8. From top to bottom are the four groups ranging from the basic model to the optimal model. The YOLOv5 optimal model improves the P value for recognizing coal and rock from a baseline of 0.963 to 0.966, with a relative improvement of 0.31% ((0.966 − 0.963)/0.963 × 100% ≈ 0.31%). The R value improves from 0.954 to 0.959, corresponding to an improvement of 0.52% ((0.959 − 0.954)/0.954 × 100% ≈ 0.52%), indicating a reduced omission rate in recognizing coal and rock, thereby detecting all relevant targets more comprehensively. The mAP value improves from 0.975 to 0.977, with a relative improvement of 0.2% ((0.977 − 0.975)/0.975 × 100% ≈ 0.20%). However, due to the increased model complexity, the prediction time per single image has slightly increased. The experimental results demonstrate that the design of this optimal model structure is reasonable, and its recognition performance has been improved.

Table 3 Experimental data results.
Fig. 8
Fig. 8
Full size image

The bar chart of experimental results.

In four verification experiments, different recognition effects of the same image are compared, as shown in Fig. 9. The models of YOLOv5 basic, YOLOv5-MCA, YOLOv5-CARAFE and the YOLOv5 optimal are shown in Fig. 9a–d, respectively. The results show that the confidence of the optimal model is significantly higher than that of the basic model, and the recognition effect is significantly improved, indicating that the structure of this optimized model is reasonable and the recognition ability is feasible.

Fig. 9
Fig. 9
Full size image

Experimental verification comparison chart showing results of (a) YOLOv5 basic, (b) YOLOv5-MCA, (c) YOLOv5-CARAFE, and (d) YOLOv5 optimal.

Conclusions

Based on the construction of a coal gangue image dataset, this research integrates deep learning theories and methodologies to achieve accurate identification of targets within coal gangue images. In the process of target recognition, convolutional neural networks, particularly the YOLOv5 optimal model, are employed. Ample and high-quality data support for model training is ensured through data preprocessing and annotation. The novel YOLOv5 optimal model is proposed by adding the MCA attention mechanism and the lightweight CARAFE up-sampling operator. Experimental tests show that the optimal model finally achieve the expected design goal through testing, training, and prediction of the dataset. The YOLOv5 optimal model improves the precision (P) value for recognizing coal and rock from a baseline of 0.963 to 0.966, with a relative improvement of 0.31%, the recall (R) value from 0.954 to 0.959, corresponding to an improvement of 0.52%, and the mean average precision (mAP) value from 0.975 to 0.977, with a relative improvement of 0.2%. The results can be utilized to identify coal and coal gangue accurately and quickly, with notable improvements in recognition effectiveness.