Introduction

At present, intelligent sorting robots for coal gangue have achieved sorting of different types of coal gangue to a certain extent. However, there are still some challenges in practical applications1,2,3,4. The coal gangue sorting robot recognizes the target coal gangue and obtains its pose information through the recognition system during the sorting process5,6,7. The robot control system combines pose information, belt speed, and time to calculate the real-time position of the target coal gangue, thereby achieving robot sorting. However, the actual working environment is complex. During the process of gangue moving from the recognition area to the grasping area, factors such as belt slippage, deviation, and speed fluctuations cause changes in the position and posture of the gangue. The existing methods are unable to achieve precise positioning of coal gangue, resulting in manipulator empty grasping, missed grasping, and even grasping failure. These all seriously affect the sorting efficiency of robots. To solve this problem, it is necessary to accurately locate the target coal gangue before the robot performs sorting actions. Therefore, by using visual methods to match the target coal gangue recognition image provided by the recognition system with the sorting image collected during sorting, precise positioning of the target coal gangue can be achieved. This method can further improve the sorting accuracy and efficiency of robots.

The research contributions of our work can be summarized as follows:

  1. (1)

    The coal gangue image matching method combining SuperPoint network and SuperGlue network was proposed. Applied to the intelligent sorting scenario of robot coal gangue, it has rich feature information and strong feature matching ability.

  2. (2)

    The improved MSRCR image enhancement algorithm is used to improve the stability and applicability of SuperPoint feature point detection in complex conditions.

  3. (3)

    The SuperGlue significantly enhances the feature representation and matching capabilities of the network. By utilizing PROSAC, the matching features are further refined and filtered, thereby ensuring the precision of image matching even under complex conditions.

Next, the relevant work on image matching in robot sorting scenarios is introduced in Section “Related work”. Section “Proposed method” introduces the proposed method framework. Section “Experiment” introduces the performance indicators and analysis of four image matching methods, and Section “Discussion and conclusion” presents the conclusions.

Related work

There introduces some related work on the application of image matching in sorting scenarios.

With the widespread application of machine vision in various industrial scenarios, scholars have also made certain progress in the research of image matching methods in sorting systems. Yang et al. proposed an improved edge feature template matching algorithm for the rapid sorting of multi-objective workpieces, which can achieve high real-time performance when combined with a pyramid layering strategy8. Yao et al. proposed a fast part matching method based on HU invariant moment features and improved Harris corner points to address the issues of long matching time and low precision in the process of part image matching. This method improves the speed of image matching, but the Harris corner detection operator performs relatively well only in fixed scale image detection9. Wang et al. proposed an improved ORB image matching algorithm for the fast recognition and sorting requirements of workpieces with complex texture features on the surface10. It replaces the Brief descriptor with the SURF feature sub-description, enhancing the robustness to lighting and image scale changes, and improving the matching accuracy of workpiece images under scale changes.

The above traditional image matching methods based on SIFT11,12, SURF13,14,15, BRIEF16,17, and ORB18,19,20 all rely on manually designed descriptors. Although they have certain effects, considering the on-site environment of the intelligent sorting robot for coal gangue, there are still some shortcomings. For example, for coal gangue images under complex lighting and scale changes, the above methods cannot effectively extract stable feature points or vectors from the images. Moreover, template based matching has relatively low efficiency, as deep learning can extract advanced semantic information from images.

To address above problems, DeTone et al. proposed SuperPoint network in 2018, which is an end-to-end learning approach that inputs images and outputs features of feature points and descriptors21. Among them, SuperPoint is not only accurate but also has a simple structure that can extract key points in real time. However, in the feature-matching stage, traditional methods often rely on minimizing or maximizing specific established metrics, such as squared differences and correlations. The typical Fast Approximation Neighbor Library (FLANN) matching method has poor robustness for matching image pairs lacking texture. SuperGlue: a graph neural network feature matching method provided a new idea for the matching problem22,23,24,25. The SuperGlue network is a mid-level matching network that needs to be combined with a feature extraction network to better leverage its performance.

The SuperPoint network extracts the most robust image feature points, while the SuperGlue network is completely robust to lighting changes, blurring, and scenes lacking texture. We have combined these two networks to achieve a significant improvement in the matching accuracy of complex images. We have also applied this combination to the sorting of coal gangue by robots in complex scenarios.

Proposed method

Motivation

We conducted research on coal gangue image matching under complex conditions by combining SuperPoint and Superglue networks. However, during the research process, based on the actual working conditions of coal gangue sorting, we believe that further optimization can be carried out in the following two aspects.

  1. (1)

    Due to the differences in object distance, scale, and rotation angle of sorting images acquired in different regions, it is difficult to detect enough key points for matching. In order to improve the ability to detect feature points, the sorting images are enhanced using the improved MSRCR algorithm.

  2. (2)

    The sorting process of coal gangue is complex. The presence of coal slag, water stains, and other factors on the belt can cause serious interference with the matching results. This will result in a decrease in the number of matching points and incorrect matching point pairs. Therefore, by optimizing the matching results through PROSAC, the matching accuracy can be improved.

Therefore, to solve the problem of image matching of coal gangue under complex conditions, we propose a more robust image matching network called IMSSP-Net.

Model construction

The coal gangue sorting robot obtains the recognition result of the target coal gangue from the recognition system as the recognition image. After the coal gangue enters the sorting area, a camera is used to capture the current image as the sorting image. Figure 1 illustrates the overall structure of our network. Firstly, the improved MSRCR algorithm enhances the two images separately. Then, the SuperPoint algorithm is used to extract feature points and their descriptors from the recognition image and sorting image. Next, it is used as input to perform feature matching between images using the SuperGlue algorithm. Finally, the obtained matching points are filtered using PROSAC to obtain the best matching point pair. The precise location of the target coal gangue is obtained by using the minimum bounding rectangle to frame the matching results.

Fig. 1
figure 1

Overview of the IMSSP-Net method. Figures 1 and 2 respectively represent the original recognition image and the original sorting image. Iretinex is the image enhanced by MSRCR. Equation (1) normalizes Iretinex and outputs Iout1. Equation (2) implements the power-law transformation of Iretinex, and outputs Iretinexγ. Normalize Iretinexγ using Eq. (1) to obtain Iout2. Equation (4) achieves brightness adjustment.

Improved MSRCR algorithm and feature extraction

To improve the ability of network to detect feature points in coal gangue images. Therefore, we use the improved MSRCR for image enhancement and then use the SuperPoint algorithm to detect the features of the image.

Improved MSRCR algorithm

The MSRCR is a commonly used image enhancement technique that enhances the details and contrast of images while preserving edge information. Its principle involves decomposing the image into different scales, followed by local contrast adjustment and enhancement. Finally, the enhanced image is reconstructed.The normalization formula is employed for intensity mapping, normalizing the Retinex components to scale the intensity range of the image.This achieves overall adjustment of image contrast and brightness. The relationship is expressed as follows:

$$I{\text{out}}1 = \frac{{I_{{{\text{retinex}}}} - \min (I_{{{\text{retinex}}}} )}}{{\max (I_{{{\text{retinex}}}} ) - \min (I_{{{\text{retinex}}}} )}} \times (H_{{{\text{clip}}}} - L_{{{\text{clip}}}} ) + L_{{{\text{clip}}}}$$
(1)

In the equation, Iout1 and Iretinex represents the pixel values of the output image and enhanced image by the MSRCR algorithm. The max(Iretinex) represents the maximum pixel values of the image enhanced by the MSRCR algorithm, min(Iretinex) represents the minimum pixel values of the image enhanced by the MSRCR algorithm, and Lclip and Hclip respectively indicate the lower and upper limits of grayscale levels.

Power-law transformation is a non-linear image enhancement method that alters the pixel value distribution in order to highlight details and features in the image26. It is based on a simple mathematical principle, mapping original pixel values through an exponential function to anew pixel value range, thereby changing the shape of the original pixel value distribution. The relationship is expressed as follows:

$$I_{{{\text{retinex}}}}^{\gamma } = \left( {I{\text{retinex}}} \right)^{\gamma }$$
(2)

In the equation, Iretinex represents the image enhanced by MSRCR. Iretinexγ represents the result of Iretinex power-law transformation. \(\:{\upgamma\:}\:\)is the exponent of the power-law transformation. When γ > 1, the pixel values of the output image become more concentrated, emphasizing details. When 0 < γ < 1 ,the pixel values become more dispersed, resulting in smoother details When γ = 1, the power-law transformation has no effect.

Integrating the power-law transformation Eq. (2) into the normalization formula of MSRCR Eq. (1) ,the improved MSRCR algorithm is as follows:

$$I{\text{out}}2 = \frac{{I_{{{\text{retinex}}}}^{\gamma } - minI_{{{\text{retinex}}}}^{\gamma } )}}{{maxI_{{{\text{retinex}}}}^{\gamma } ) - minI_{{{\text{retinex}}}}^{\gamma } }} \times H_{{{\text{clip}}}} - L_{{{\text{clip}}}} + L_{{{\text{clip}}}}$$
(3)

Then use the Eq. (4) to adjust the brightness level range of the output image.

$$I_{{{\text{out}}}} = \left( {I_{{{\text{out2}}}} } \right)^{\beta }, \beta = \frac{{I_{{{\text{out1}}}} }}{{max(I_{{{\text{out1}}}} )}}$$
(4)

In the equation, \(\:{\upbeta\:}\) represents the enhancement factor, used to adjust the brightness level range of the output image.Its expression is as follows:

The IMSRCR and MSRCR methods were used to enhance the recognition and sorting images respectively, and the results are shown in Fig. 2. The expression of Hclip and Lclip in Eq. (1) is general expression. In the experiment, Hclip and Lclip are set to 255 and 0, respectively.

Fig. 2
figure 2

The Visualization of feature point detection. (a) The original recognition image. (b) The MSRCR enhanced recognition image. (c) The IMSRCR enhanced recognition image. (d) The original sorting image. (e) The MSRCR enhanced sorting image. (f) The IMSRCR enhanced sorting image.

To evaluate the quality of the enhanced image, this article uses two widely used image quality evaluation metrics, Peak-Signal to Noise Ratio (PSNR) and Structural Similarity Index (SSIM). The results are shown in Table 1.

Table 1 The comparison of image enhancement.

It can be seen that compared with the MSRCR method, the improved method has significantly improved PSNR and SSIM. The IMSRCR image has a more balanced global grayscale, clearer contrast, and richer details.

Feature extraction

The SuperPoint algorithm is different from traditional feature detection algorithms, and its main feature is to simultaneously detect image key points and extract descriptors27. It utilizes a self-supervised convolutional neural network model and is implemented through the Encoder Decoder architecture. In the SuperPoint algorithm, the input image is first subjected to feature extraction through a VGG encoder to obtain a feature map. Then, decode two branches on the output feature map, one for key point detection and the other for descriptor extraction. The convolutional layer in SuperPoint network is shown in Table 2.

Table 2 The convolutional layer in SuperPoint network.

The key point detection branch adopts a sliding window approach to calculate the probability of being a key point pixel by pixel on the feature map. The size of each window is usually 8 × 8 pixels, and each pixel is evaluated for being a key point and a probability map of the key points is generated. This probability plot is a 3D tensor with a size of H/8 × W/8 × 65, where M and N are the height and width of the image, respectively, and each pixel block corresponds to an 8 × 8 area as the probability of key points. Determine the final key point position through threshold processing and connectivity analysis. Its structure is shown in Fig. 3.

Fig. 3
figure 3

The key point decoder.

The descriptor extraction branch is responsible for extracting descriptors for each key point from the feature map. This branch converts feature maps into three-dimensional tensors of H/8 × W/8 × D, where D is the dimension of the descriptor. Extract local features through a series of convolution and pooling operations. To obtain the final descriptor, it is necessary to perform different operations and normalization. The feature map is transformed into a three-dimensional tensor of H×W×D, representing the descriptor of each pixel. Its structure is shown in Fig. 4.

Fig. 4
figure 4

The descriptor decoder.

The SuperPoint network was used to extract feature points from the original image, MSRCR, and improved MSRCR images, as shown in Fig. 5. The colored dots in Fig. 5 represent the detected feature points. By comparison, the improved MSRCR method further enriches the color information of the image and extracts significantly more feature points compared to the other two methods. This will be more beneficial for subsequent matching.

Fig. 5
figure 5

The Visualization of feature point detection. (a) Feature detection in original recognition image. (b) Feature detection in the MSRCR enhanced recognition image. (c) Feature detection in the IMSRCR enhanced recognition image. (d) Feature detection in original sorting image. (e) Feature detection in the MSRCR enhanced sorting image. (f) Feature detection in the IMSRCR enhanced sorting image.

Feature matching and optimization

After completing the feature point extraction of coal gangue recognition images and sorting images, we need to match the feature points and optimize the matching results.

SuperGlue feature matching

After using SuperPoint to complete the feature detection of coal gangue recognition images and sorting images, it is necessary to match the feature points of the two images through feature matching methods to obtain the target coal gangue position. This article adopts the SuperGlue algorithm, which comprehensively considers the position information of feature points and visual appearance information to improve the accuracy and robustness of matching.

SuperGlue is a feature matching algorithm based on Graph Neural Network (GNN), which constructs the graph structure of feature points and utilizes Graph Convolutional Neural Network (GCN) for feature aggregation and matching. Figure 6 illustrates its network structure. It mainly consists of two parts: the first half combines self attention mechanism and cross attention mechanism, aiming to extract better matching descriptors. This part significantly improves the accuracy and robustness of feature matching by continuously strengthening the role of the (L = 7) vector f. The optimal matching layer in the latter half uses the inner product to obtain the score matrix and iteratively optimizes it using the Sinkhorn algorithm to achieve preliminary pose estimation optimization. Its structure is shown in Fig. 6.

Fig. 6
figure 6

The architecture of SuperGlue.

The first part of attention GNN is the key point encoder. Firstly, the feature points are dimensionally enhanced using a key point encoder. Each key point combines its position and score information and uses a multi-layer perceptron (MLP) to embed the position information of the key points into a high-dimensional vector. As shown in Eq. (5). Specifically, the input data is the key point p and its corresponding feature descriptor d extracted by the feature extraction network, which includes the coordinates and confidence information of the feature points. Firstly, feature points are converted into 256-dimensional data through 5 multi-layer perceptrons (fully connected layers), with channel numbers of 3, 32, 64, 128, and 256, respectively. As shown in Fig. 7.

$$x_{i}^{(0)} = d_{i} + MLP(p_{i} )$$
(5)

Among them, di and pi are the i-th descriptor and feature point, respectively.

Fig. 7
figure 7

The Key point encoder.

The second part of attention GNN is the graph attention mechanism, which includes two undirected edges: Image edges εself and Cross edge εcross. The former connects the feature points inside the image, while the latter connects the feature points i of adjacent images (denoted as A and B), where t is the time corresponding to adjacent frames. The txiA represents the middle expression of the i-th element on image A in layer L. The information mε→i aggregates the {j:(i, j)ε} of all feature points, where εself, εcross }. From this, all features i in image A transmit updated residual information.

$$^{i + 1} x_{i}^{A} = {}^{t}x_{i}^{A} + MLP\left( {\left[ {ix_{i}^{A} \left\| {m_{i \to i} } \right.} \right]} \right)$$
(6)

Among them, [∙∙] represents concatenation operation, and MLP represents multi-layer perceptron. When the number of layers L is odd, ε = εcross, and when the number of layers L is even, ε = εself. By alternately updating between self mechanism and cross mechanism through mε, we simulate the process of humans repeatedly looking between two images until differences or similarities are discovered. The final feature description vector used for matching is shown in Eq. (7).

$$f_{i}^{A} = W \, {.}x_{iA}^{(L)} + b$$
(7)

The above equation W represents weight, and b represents deviation.

The main function of the optimal matching layer is to generate a matching score matrix and output the final matching pair. After completing the feature description vectors fiA and fiB, the inner product is calculated to obtain the score matrix SRH×W, where M and N represent the number of feature points in the original image and the image to be matched, respectively. The inner product is calculated as follows:

$$S_{i,j} = \left\langle {f_{i}^{A} ,f_{i}^{B} } \right\rangle ,\forall \left( {i,j} \right) \in A \times B$$
(8)

Among them, <·,·> is the inner product. Expand the score matrix by one channel to obtain,‾S as shown in Eq. (9), and directly assign unmatched values to that channel. The score of newly added rows and columns is a fixed value, which can also be obtained through training.

$$\overline{S}_{i,W + 1} = \overline{S}_{H + 1,j} = \overline{S}_{H + 1,W + 1} = z \in R$$
(9)

Where SRH×W, represents the allocation matrix,‾SR(H+1)×(W+1) represents the augmented matching matrix, and each row represents the matching probability between a point and other points to be matched. Since each point can only have one matching point at most, the sum of values in each row of ‾S is 1. Finally, the Sinkhorn algorithm is used to solve for the maximum value of the ‾S allocation matrix, resulting in the optimal allocation result.

PROSAC optimization

Currently, traditional feature optimization methods often adopt the Random Sample Consensus (RANSAC) algorithm. However, since RANSAC selects data randomly without considering the difference in data quality, it may lead to difficulties in convergence under certain circumstances. To overcome this limitation, this paper introduces the PROSAC algorithm. The PROSAC algorithm adopts a semi-random approach. It evaluates the quality of matching point pairs, sorts them in descending order based on their quality, and gives priority to selecting high-quality point pairs. PROSAC can effectively reduce the computation time of the algorithm. Moreover, in cases where RANSAC fails to converge, its semi-random selection method can still ensure the convergence of the algorithm, thereby obtaining more accurate feature point-matching results. Figure 8 shows the results without optimization and the results further optimized with PROSAC. The matching relationship between the recognition image feature points and the sorting image feature points in the figure is connected by colored lines.

Fig. 8
figure 8

The match experiment.

From Fig. 8, we can see that there are some feature point pairs matching errors in Fig.(c), which leads to a deviation in the matching results indicated by the blue box in Fig.(e). However, after using the PROSAC algorithm for optimization, the mismatched feature point pairs were filtered out as shown in Fig.(d). The blue box in Fig.(e) outputs the correct result.

Experiment

Experimental environment and image acquisition

The experiment adopts the existing experimental platform of the coal gangue sorting robot of our team, and the experimental middling coal gangue sorting robot is a truss structure. The entire system consists of a belt conveyor, a coal gangue recognition system, a robotic arm sorting system, a robot control system, and a visual servo system. As shown in Fig. 9.

Fig. 9
figure 9

The coal gangue sorting robot system.

The experimental hardware environment is a PC with processor i7-10700, 16 GB of memory, NVIDIA GeForce RTX 2060 GPU, and MV-HS050GC Vickers camera. The industrial camera obtains 1000 images of coal and gangue from the gangue recognition system, including 500 images of coal and gangue respectively. The pixels of the coal gangue recognition image and sorting image are 455\(\:\times\:\)360 and 1440\(\:\times\:\)1080, respectively. We expanded the original samples to 10,000 by translating, rotating, and scaling the sample images, and constructed a coal gangue recognition sample library to ensure the rich diversity of the samples. Partial recognition images are shown in Fig. 10. The Gangue Dataset contains mixed images of coal gangue under scale object distance, rotation angle, and scale transformation.As shown in the Fig. 11, they are respectively referred to as GD-OT, GD-RT, and GD-ST datasets.

Fig. 10
figure 10

The recognition images.

Fig. 11
figure 11

The dataset images.

Experimental results and analysis

The eye in hand camera installation mode was used in the experiment. Due to differences in object distance, scale, rotation angle, and other factors, images captured in different regions may be affected. Therefore, matching experiments were conducted to verify the recognition images and sorting images under the above conditions. Precision (P), Recall (R), PR curve, and matching time (T) are used as evaluation indicators28.

$${\text{Precision (P)}} = \frac{{N_{{{\text{TP}}}} }}{{N_{{{\text{TP}}}} + N_{{{\text{FP}}}} }} \,$$
(10)
$${\text{Recall (R)}} = \frac{{N_{{{\text{TP}}}} }}{{N_{{\text{P}}} }}$$
(11)

NTP, NFP, and NP represent the numbers of true positive matching points, false positive matching points, and the total numbers of matching points, respectively.

The matching results of our algorithm under different conditions are shown in Figs. 12, 13, 14 and 15. The colored dots in the image are the detected feature points and matching points connected by colored lines.

Comparison experiment of matching different object distances

We set the scale of the image in our experiment to α .The rotation angle is 0°. To test the matching performance of the algorithm at different object distances, sorting images of different object distances were collected in the sorting area, where the vertical distances between the robotic manipulator and the belt were 80 cm, 70 cm, 60 cm, and 50 cm, respectively. Figure 12 shows the results of the matching experiment of our algorithm on coal gangue recognition images and sorting images with different object distances.

Fig. 12
figure 12

The Match results of different object distances.

From the matching results in Fig. 12, it can be seen that our method can correctly match the target coal gangue at different object distances. To further demonstrate the effectiveness of the algorithm proposed in this paper, the Descent-based Dense Feature Network (D2-Net) feature matching method29, MatchNet matching method30, Superpoint-Superglue (SP-SG) matching method31, LR-Superglue(LR-SG)matching method32, and DBSCAN-Superglue(DBSCAN-SG) matching method33 were used to match the coal gangue recognition image with the coal gangue sorting image.

Table 3 records the experimental data of matching rates and matching times for different methods, and takes the average value as the result.

Table 3 Matching experimental data results for different object distances.

The experimental results indicate that as the object distance increases, the matching rates of the six methods begin to decrease, and the matching time does not change significantly. Except for the method in this article, the matching rates of the other five methods have significantly decreased. Our method requires slightly longer time than SP-SG and DBSCAN-SG methods, but has higher matching accuracy. Our method maintains a matching rate of over 97% and a match time of less than 96ms.

Comparison experiment of matching different scales

We set the object distance of the coal gangue recognition image and sorting image to 50 cm, and the rotation angle is 0°. The scale factors are set to 0.3\(\:{\upalpha\:}\), 0.5\(\:{\upalpha\:}\), 0.7\(\:{\upalpha\:}\), and 0.9\(\:{\upalpha\:}\), respectively. Figure 13 demonstrates that our algorithm can match the target coal gangue under different scaling factors.

Fig. 13
figure 13

The Match results of different scales.

To further demonstrate the effectiveness of the algorithm proposed in this paper, the D2-Net feature matching method, MatchNet matching method, LR-SG matching method, DBSCAN-SG matching method and SP-SG matching method were used to match the coal gangue recognition image with the coal gangue sorting image. Table 4 records the experimental data of matching rates and matching times for different methods, and takes the average value as the result.

Table 4 Matching experimental data results for different scales.

According to the results in Table 4, the following conclusions can be drawn. With the scale increases, the matching time of each method becomes longer, but the matching rate also gradually increases. Our method has significant advantages in speed and matching rate compared to other methods, such as D2-Net and MatchNet. The LR-SG method is more sensitive to scale transformations. The DBSCAN-SG method requires less time, but its matching accuracy performs poorly. Our method has a higher matching rate than the SP-SG algorithm and maintains a matching rate of around 98%.

Comparative experiments on different rotation angles

We set the object distance to 50 cm and the scale factor to \(\:{\upalpha\:}\). We only change the rotation angle, and other experimental conditions remain unchanged. We rotate at angles of -90°, -45°, 45°, and 90°, respectively. Under experimental conditions, obtain coal gangue sorting image samples with different rotation angles, where the clockwise rotation direction is positive. Using the algorithm proposed in this article, matching experiments were conducted on different rotation angles, as shown in Fig. 14.

Fig. 14
figure 14

The Match results of different rotation angles.

To further demonstrate the effectiveness of the algorithm proposed in this paper, the D2-Net feature matching method, MatchNet matching method, LR-SG matching method, DBSCAN-SG matching method and SP-SG matching method were used to match the coal gangue recognition image with the coal gangue sorting image. Table 5 records the experimental data of matching rates and matching times for different methods, and takes the average value as the result.

Table 5 Matching experimental data results for different rotation angles.

According to Table 5, when the rotation angle is between − 90° and 90°, the matching of the six methods first increases and then decreases, with the highest matching rate occurring at a rotation angle of 0°. In terms of matching time, the matching time of the six methods first decreases and then increases, with the shortest matching time occurring when the rotation angle is 0°. Our method still maintains a matching rate of over 98.1% and a matching time of less than 95ms, indicating significant advantages over other methods.

The matching results of complex conditions

The scene of coal gangue sorting is dynamically changing. There may be water stains, coal slag, and coal foreign objects on the belt. Therefore, algorithms were added for validation in these situations, and the results are shown in Fig. 15. The experiment includes matching results under different object distances, angles, and scales.

Fig. 15
figure 15

The comparison of matching results.

From the experimental results, it can be seen that although the matching environment is complex. But it can still perfectly match the target gangue. The Precision, Recall and matching time are 98.2%, 98.3%, and 84.6 ms, respectively. It has achieved good performance.

In summary, we have compiled the average matching results under different experimental conditions, as shown in Table 6. From the table, it can be seen that the average matching rate of this method for coal gangue recognition images and sorting images is 98.2%, with an average matching time of 84.6ms. It has the characteristics of a high matching rate, good real-time performance, and strong robustness. The performance advantage of our method can further improve the accuracy of coal gangue sorting robots.

Table 6 The comparison of matching results.

In addition, the comparison results of our method with other matching performance are shown in Fig. 16. The larger the area of the envelope of the P-R curve (where P is precision and R is recall), the higher the precision.

Fig. 16
figure 16

P-R Curves.

Discussion and conclusion

In light of alterations in the pose of the target gangue during transmission, this paper puts forth an IMSSP-Net gangue image fast matching method to re-obtain the pose information of the target coal gangue. The proposed method will facilitate enhanced precision and efficiency in the robot’s grasping capabilities. Furthermore, to address the issue of the low matching rate of conventional methods in complex contexts, the SuperPoint and SuperGlue networks are combined to achieve precise and effective matching between gangue recognition and sorting images. To address the limited feature extraction capability of traditional methods, the improved MSRCR is adopted to enhance the feature point detection capability of the SuperPoint algorithm. For the matching results, the PROSAC method is used to refine and filter the matched features, thereby improving the matching precision. The experimental results demonstrate that the proposed IMSSP-Net Net method is more suitable for the research of coal gangue image matching. Compared with other algorithms, our algorithm can be applied to coal gangue sorting in complex environments. The average matching precision of the algorithm is 98.2%, and the time is 84.6ms. At present, the matching time and precision can meet the sorting tasks of coal gangue sorters well, but further research and improvement are needed to address the phenomenon of stacking in large coal throughput.