Introduction

Due to factors such as construction and dropping, small and inconspicuous nails are often present on roads, posing safety risks to pedestrians and vehicles. Traditional manual road cleaning methods are labor intensive and inefficient. During the day, with heavy traffic on the roads, the road sweeper designed for large, lightweight debris such as leaves, branches, and bottles are often ineffective at cleaning small, heavy nails, especially those partially embedded in the pavement due to vehicles pressure. To avoid disrupting traffic, we choose to conduct inspections on clear nights. Therefore, a robot system is designed to recognize nails on road surface and then attempts to locate and retrieval. If fails, the nails will be marked for further action. The core is to recognize road nails under uneven lighting conditions. Non-deep learning-based image recognition algorithms are sensitive to key features and sizes of objects; hence they struggle to accurately identify road nails in low quality images obtained at night. Deep learning-based image recognition algorithms learn the mapping from images to labels to improve the information utilization of low quality images. However, the end-to-end training process can not deal with occlusion, vague, and similarity between objects especially uneven lighting conditions. Therefore, this paper proposes an improved YOLOv5 algorithm, combined with improved C3 module, RepGFPN and OTALoss, which improves the detection accuracy of the network under uneven lighting conditions from the perspectives of feature extraction, multi-scale fusion and loss. To optimize the deployment of the improved network model on NVIDIA detection device, the network is lightweighted by reducing the number of parameters to enhancing detection speed.

Related work

With the rapid development from traditional technology to deep learning, object detection networks have become faster and achieved higher accuracy. Traditional manual feature extraction object detection algorithms such as Scale Invariant Feature Transform (SIFT)1, Histogram of Oriented Gradients (HOG)2, and Deformable Parts Model (DPM)3, heavily rely on manually designed image features and object contours. Fukushima4 first proposed the concept of neural machines, which was the initial implementation of Convolutional Neural Networks (CNN)5. Since Hinton first proposed AlexNet6 neural network in 2012, object detection has been elevated to a new level. D. Arora and K. Kulkarni used Faster R-CNN to address the difficulty of object recognition on Efficient Shelf7. However, two stage object detection models such as R-CNN8, Fast R-CNN9, Faster R-CNN10 and others exhibited larger model size and longer training times. Despite their incapacity for end-to-end training, the fundamental ideas, such as the Region Proposal Network (RPN)11, continue to find widespread applications in contemporary object detection models. Nowadays, YOLO series has been widely used in object detection. The initial iterations of the YOLO family—specifically, YOLOv112, YOLOv213, and YOLOv314—incorporated prior frames, residual networks, feature fusion, multi-scale training techniques and outperform other object detection algorithms of the same period. R. Mi, Z. Hui et al. presents an improved YOLOv3-SPP algorithm with DIOU-NMS Loss and dilated convolution, achieving a 1.79% mAP improvement and reducing missed detections in dense vehicle scenarios15. The first three versions of YOLO series are relatively complex. YOLOv416 optimized the training strategy and data enhancement approaches by using Mosaic data augmentation and introducing SPP17 structures. However, as the model structure is further expanded and optimized, the demand for computational resources in YOLOv4 increases. YOLOv518 integrates Conv, C3, SPPF modules to enhance the network’s learning ability. Moreover, it consists of various models’ variants with different sizes, significantly expanding the applicability of YOLOv5. G. Ma, Y. Zhou et al.19 improved detection accuracy of lightweight models in complex backgrounds by adding lightweight convolution to the Neck and using CIoU Loss for regression. T. Jiang, Y. Xian et al. proposed an improved YOLOv5 algorithm for detecting traffic signs in complex environments by integrating SE modules, CoT modules and a small object detection layer20. P. Singh, K. Gupta et al. applied machine vision and deep learning to unmanned aerial vehicles, evaluating the performance of YOLOv5, RetinaNet, and Faster R-CNN in challenging environments21. Following YOLOv5, YOLO family has subsequently introduced: YOLOv622, YOLOv723 and YOLOv824, continuing to explore and innovate based on previous versions. Q. Wang, C. Li et al. integrated the improved YOLOv7 model into substation inspection robots25. However, due to the large model size and the lack of a pre-trained backbone, there were certain difficulties in model deployment. As the latest version of YOLO, YOLOv926 introduced new concepts to address the various changes required for deep networks to achieve multiple objectives. However, due to the large model size, deployment remained challenging. Besides the YOLO series, many other algorithms have made significant contributions on object detection. RetinaNet27 improves detection performance by addressing the class imbalance issue through the introduction of the focal loss function. EfficientDet28 achieves efficient object detection by improving and extending EfficientNet29.

Among all the above mentioned detection algorithms, balancing speed and accuracy is the most important. In terms of model parameters and FLOPs, YOLOv5 performs better than YOLOv6 and YOLOv8. Balancing detection accuracy and speed, YOLOv5n is currently the most suitable network model for deployment on constrained embedded systems. Despite this, the fixed size of the YOLOv5n model’s detection frame poses challenges in its application to objects of varying sizes, it may not entirely fulfill the requirements for high precision detection of road nails.

Robot structure

As shown in Fig. 1, the robot’s mechanical structure is divided into four parts: the walking mechanism, the visual inspection mechanism, the electromagnetic retrieval mechanism, and the ring marker mechanism. The first part is walking mechanism. It designed based on the traditional road sweepers, using a four wheel structure to support and drive the entire inspection robot system. This ensures that the inspection robot can operate in various complex road conditions, thereby reducing its maintenance costs. The second part is visual inspection mechanism. To meet the large field of view requirements during the inspection process, a spherical gimbal is used to mount a USB binocular industrial camera. The gimbal adjusts the angle and direction of the binocular camera, enabling full-range camera movement.

Fig. 1
figure 1

Mechanical structure of the robot.

The third part is electromagnetic retrieval mechanism. As shown in Fig. 2a, the device consists of a guiding block, gears, timing pulleys, and a magnetic block. The gears and motor make the device move horizontally along the guide rail. The timing pulleys and guiding block combine to adjust the vertical height. The magnetic block can retrieval road nails by switching its power on and off. The fourth part is ring marker mechanism. As shown in Fig. 2b, the mechanism consists slider, sleeve, ferrule sleeve, and ejector wheel. It connected to the electromagnetic retrieval mechanism via an optical axis, enabling synchronized movement of ring marker and electromagnetic retrieval device. Fluorescent markers are placed in the sleeve. The ejector wheel ejects the bottom baffle of the sleeve, causing the fluorescent marker to drop onto the nails. Fluorescent markers are easier to detect in the dark with uneven lighting conditions, making them easier to process later.

Fig. 2
figure 2

Partial device mechanical structure.

The work process of the robot is shown in Fig. 3, we constructed a stereo vision system based on the robot. To ensure that the binocular camera field of view is maximized during the robot’s movement, the camera was positioned in the center top of the robot with 45° diagonally downward. During the robot’s inspection process, collected the images using stereo camera to construct a training and testing datasets for nighttime road nails. Considering the characteristics of the datasets, we designed the improved YOLOv5n network and carried out subsequent processing. The combined algorithm was deployed on NVIDIA Jetson Orin Nano device to provide real time localization of road nails to electromagnetic retrieval system. The robot was all controlled by STM32F103 chip, through the coordination of various components, it can complete the inspection of road nails.

Fig. 3
figure 3

The work process of the robot.

Algorithms

Nails occupied a small proportion in the images captured by our designed robot. Due to the lighting equipment on the robot, uneven lighting appears in the images, which severely affects the recognition accuracy of existing algorithms. Additionally, in various complex road conditions, nails can easily be obscured by other objects, causing the low accuracy for nighttime detection. To enhance detailed information learning ability of the network, we improved the original model and lightweighted, willingly to enhance the recognition accuracy of road nails while reducing the model’s parameters.

C3_improved

C3 module is a key component of the backbone, utilizing a hierarchical structure to extract features. It connects low to high level feature maps through a series of convolutional layers and bottleneck structures. However, with the continuous stacking of convolutional layers and bottleneck structures, the information of the road nails which occupies a relatively small part of the original image will be lost, especially in uneven lighting conditions. As shown in Fig. 4, we directly fuse low and high feature maps through residual connections to compensate for the lost detail information. We combined the C3_improved and C3 modules in the backbone to reduce the loss during the layer-by-layer transmission, making the network pays more attention to the detailed information of road studs during the training process.

Fig. 4
figure 4

The improved model structure.

RepGPFN

Although the backbone structure combining the C3_improved and C3 modules can enhance the learning ability of road nail details during the training process, the model’s recognition capability is still lacking. To improve the fusion capability, we incorporated reparametrized feature pyramid network (RepGFPN) into the original structure. As shown in Fig. 4, the improved structure employed multiple sampling operations, using different channel dimensions for feature maps at different scales, integrate high semantic and low spatial features of road nails to assist C3_improved in learning details. Its re-parameterization mechanism enhances the network’s feature representation capability by automatically adjusting features across different levels, effectively capturing the detailed information impacted by lighting changes. This improves the model’s ability to recognize objects in uneven lighting environments and reduces information loss caused by inconsistent lighting conditions.

OTA loss

In uneven lighting images captured during the robot inspection process, road nails still occupied a small proportion. The stacking of convolutional layers and multiple sampling operations can result in the loss of detailed information, leading to a significant reduction in the usable road nail information for the network. Although newly added C3_improved and RepGFPN can significantly reduce information loss and improve feature fusion methods, they still cannot effectively recover more nail information from the limited data. We proposed a dynamic label assignment method based on an optimization strategy (OTA) and replace the original loss as shown in Fig. 4. By weighting the classification loss of different nails or considering all of them as the cost of transmission, the network learns the optimal label assignment method by minimizing the cost. The pseudo code of the proposed method is listed in Algorithm 1.

Lightweight

Model lightweight acceleration is a crucial research area in deep learning, focusing on reduce model size and computational complexity to enhance the efficiency of network models on resource constrained devices. The improved object detection network model has more parameters compared to the original model. We achieve model lightweighting by sparsity training, model pruning and knowledge distillation training. These techniques reduce the model size and parameters, resulting in faster inference and improved detection accuracy for road nails.

BN layer is typically placed after the convolutional layer. Its primary role in the network model is to accelerate convergence and reduce the difficulty and complexity of training. The backbone of the network often contains a large number of parameters. In this paper, we lightweight the backbone as shown in Fig. 5.

Fig. 5
figure 5

Lightweight design.

We introduce a sparsity factor into BN layer and perform sparsity training. During training, the BN layer gradually becomes sparse. After sparsity training, we analyze weights of the BN layer, prune the layers with BN weights approaching zero. We then perform fine-tuning to restore the pruned model’s original fit. The fine-tuned model serves as the student model, while the improved model acts as the teacher model. Using knowledge distillation to extract knowledge from the large teacher model and condense it into the smaller student model. By adjusting the distillation weight, we enhance the detection accuracy of the student model, gradually making it reach or exceed the teacher model’s accuracy, thereby achieving a lightweight model. The pseudo code of the lightweight method is listed in Algorithm 2.

Experiments

Camera calibration

Camera calibration is an important technique in computer vision which used to determine the internal and external parameters of a camera. This process enables the accurate conversion of pixel coordinates in an image to world coordinates. For the road nails recognition application, the stereo camera is calibrated by Zhang’s calibration method. The calibration summary are as follows:

  1. (1)

    A calibration board with 8 × 11 corner points and each square have a size of 19 mm is selected.

  2. (2)

    As shown in Fig. 6, maintain the stability of the stereo camera and take a few images of calibration board from different orientations and angles.

  3. (3)

    Use binocular calibration algorithm to detect the feature points of our calibration board.

  4. (4)

    Estimate and refine the intrinsic and the extrinsic parameters.

Fig. 6
figure 6

The calibration process.

The intrinsic matrix, extrinsic matrix, and distortion coefficients can be calculated from each image. The extrinsic matrix includes the rotation matrix and translation vector. For a stereo camera, the distortion coefficients include radial distortion k and tangential distortion p. The parameters and results obtained from the calibration of the stereo camera are shown in Table 1, and Fig. 7a,b below.

Table 1 Binocular camera parameters.
Fig. 7
figure 7

Calibration results.

Datasets

In this experiment, as shown in Fig. 8, we choose seven different types of nails named d1 to d7. These seven types of nails include long nails, iron nails, sharp nails, thumbtacks, and push pins. To simulate real road scenes, seven road surfaces: smooth road, marble road, pavement road, dirt road, gravel road, asphalt road and grassy road are selected, and several factors such as, shadow occlusion, puddles, obstructions, overlapping nails, similar objects influence are also considered. We collected a total of 2100 images of road nails by the binocular camera to construct a dataset for YOLOv5n network. This dataset was annotated and cross validated by multiple experts using LabelMe a to ensure the accuracy of the annotated data. 75% of the images were randomly selected as the training set and the remaining 25% were used for test. The datasets collection process and some samples of the dataset are shown in Fig. 9.

Fig. 8
figure 8

Different types of nails.

Fig. 9
figure 9

Datasets collection process and a partial sample of the datasets.

Results analysis

To verify the effectiveness of the model designed in this paper, we established a training platform and deployed the trained model for validation. The experimental environment settings for model training and deployment are detailed in Table 2. Unless otherwise specified, the baseline model remains unchanged.

Table 2 Experimental environment.

Improved experiments results

To avoid ineffective recognition caused by large black areas in the original images collected at night, we cropped the original images from 640 × 480 to 640 × 300. The cropped image datasets are shown in Fig. 10. The comparative experimental results are shown in Fig. 11.

Fig. 10
figure 10

A partial of the cropped datasets.

Fig. 11
figure 11

Comparison experimental results.

From the results, we found that due to the small proportion of uneven lighting in the cropped images, the precision, recall, and mAP of the network model have been significantly improved, with the improvement generally ranging from 6 to 11%. From the perspective of curve trends, after 100 epochs, the curve is higher than the original image, further proving the effectiveness of image cropping in reducing the impact of uneven lighting on model detect accuracy. To further evaluate the contribution of our improvement to reducing the impact of the dark environment on the model effect, we designed ablation experiments. In the subsequent training and evaluation, we used the cropped datasets. Different improved names are shown in Table 3, and the results are shown in Table 4, Figs. 11 and 12.

Table 3 Different improved name.
Table 4 Ablation experiments results.
Fig. 12
figure 12

Ablation experiments results.

As the results shown in Table 4, the improved C3 module, in combination with the RepGFPN through residual connections, enables the network to better fuse and share information from multi-level feature maps in uneven lighting conditions. This allows the network to capture features of small objects such as nails across multiple scales, enhancing detection accuracy. Meanwhile, the proposed loss function, with its object aware mechanism, dynamically adjusts the loss for different objects during training, allowing the model to focus more on small objects. Further improved the network’s Recall and mAP. To further analyze the detection performance of the proposed model, we enlarged the region between 300 and 400 epochs in Fig. 12. By examining the enlarged region, we found that the recall and mAP curves of our proposed model are significantly higher than those of the other models. However, the simultaneous integration of three distinct improvements in the network leads to an overemphasis on detailed image features under uneven lighting conditions, as well as adaptation inaccuracies during multi-level feature fusion, resulting in errors between the predicted boxes and true boxes. Consequently, the model’s detection accuracy is slightly lower than that of Model B and F, and the box_loss is higher than Model B, but the slight box_loss has little effect on the recognition accuracy. Training loss is also a direct criterion for evaluating the performance of a model. The training loss of the model is composed of three weighted components: classification loss, object loss, and bounding box loss. As shown in Fig. 13, even though the network complexity increases after adding improved C3 module and RepGFPN, the loss is still significantly lower than other models, which is only about 0.05.

Fig. 13
figure 13

Comparison of training loss.

Lightweight experiments results

In this subsection, we called the model CRO-YOLOv5n. First, we sparsified and added the regularized loss of BN parameters to the model loss function forces the BN parameter to converge towards 0 during training. After sparsifiy trained, we cut off the channel corresponding the BN parameters closest to 0. Parameters we set during the training process are shown in Table 5, the results obtained are shown in Fig. 14.

Table 5 Parameters set during sparsify train.
Fig. 14
figure 14

Sparsify trained results.

BN_weights histogram can visualize the process of sparsification training, the vertical axis of the histogram represents the number of training sessions, the number of training sessions increases from top to bottom. As shown in Fig. 14, with the training rounds increases, horizontal axis of histogram’s training peak is constantly approaching 0, represents that most of the bn have become sparse. The purple curve indicates the peak is approaching the X-axis with a smoother process. In the process of BN sparsity, both mAP@0.5 and mAP@0.5:0.95 shows improvement compared to CRO-YOLOv5n, indicates the promising training results.

Then, we pruned the backbone. In this paper, we designed different pruned rates for comparative experiments. Parameters we set during training process are shown in Table 6, the prune trained results are shown in Table 7 and Fig. 15.

Table 6 Parameters set during pruned train.
Table 7 Inference results of the model pruned.
Fig. 15
figure 15

Prune trained results.

From Fig. 15 we can see that, when pruned rate is 0.2, the detect precision, recall, and mAP all achieved optimal performance. At 0.4 and 0.6, the trained results are similar. However, when the pruned rate is 0.8, the trained results are significantly lower. For model inference, we set batch size to 1, the inference results are shown in Table 7. FPS indicates the number of images processed per-second, while inference time refers to the duration needed to process a single image. With reductions in both model parameters and size, when pruned rate is 0.2, the inference time is 9.4 ms and FPS is 92.611, which are both better than those of the unpruned model. Although there is a slight decrease compared to the optimal pruned rate 0.6, considered the performance of the pruned model, we conclude that pruned rate 0.2 offers the best overall detect performance.

At last, we used sparitied.pt as teacher network weight and 0.2pruned.pt as student network weight. Distilled the knowledge from teacher network model and transferred to student network model. Knowledge distillation weights ranging was set from 10 to 100, the results obtained through training are shown in Fig. 16.

Fig. 16
figure 16

Distilled train results.

As the results shown in Fig. 16, it can be seen that different knowledge distillation weights have a certain impact on results. During training process, the trained model achieved the highest precision as the weight set to 40%, the recall and mAP slightly decrease compared to the others, but remain generally consistent. We choose the 40% distillation weight trained model compared with the original model and the improved model, the comparison results are shown in Fig. 17. We also set batch size to 1, the inference results are shown in Table 8.

Fig. 17
figure 17

Trained results of the model comparison.

Table 8 Inference results of the model comparison.

As the results shown in Fig. 17, the model got the best precision and mAP after sparse train, prune train and knowledge distillation. As can be seen from Table 8, the lightweight model has 16% fewer parameters compared to the improved model, only 0.18 M more parameters than the original model. Additionally, the computational complexity of the model decreased from 5.2 GFLOPs to 4.5 GFLOPs, resulting an increase in the inference speed. Overall, the lightweight model shows significant advantages in recognizing road nails at nighttime.

Localization retrieval and ring mark experiments

We combined the lightweight model with SGBM and deployed it on NVIDIA Jetson Orin Nano device. This allows for accurate recognition and localization of road nails during robotic inspections. We designed multiple localization experiments for different nails under various conditions, including different times and road surfaces. The experimental results are shown in Fig. 18. In result figures, the horizontal axis indicates the group number, while the vertical axis represents the actual distance to the road nails. The colored dashed lines indicate the range of ± 2 cm.

Fig. 18
figure 18

Nighttime road nail localization results.

From the above results, it can be seen that the combined algorithm achieves high localization accuracy for different nails across various road surfaces, with errors generally staying within ± 2 cm. Building on the high precision localization, we designed retrieval and ring mark experiments for road nails. We tested the robot’s retrieval and ring mark results over 200 inspection cycles. Table 9 lists the nail retrieval results on different road surfaces. Table 10 lists the ring mark results on different road surfaces. Figure 19 shows the retrieval rate for different road surfaces and different road nails.

Table 9 Electromagnetic retrieval results.
Table 10 Ring mark results.
Fig. 19
figure 19

Successful retrieval results.

Combining the retrieval results from Tables 9, 10, we found that the inspection robot achieved a high nail retrieval rate and a high success rate in ring mark on each type of road surface. Similarly, sand and weeds on dirt and grass road reduce the magnetic mechanism’s effectiveness, also leading to lower rate. Other road types have minimal impact on the magnetic retrieval system, resulting in higher retrieval success rate. For the same reasons, gravel road, dirt road and grass road also exhibit some instances where nails are not effectively marked with rings. In contrast, other types of road surfaces allow for effective mark. Based on the above results, we can calculate that the retrieval rate for each type of nail and the ring mark rate for each condition is both maintained above 99%. The overall retrieval rate for each condition remains above 98%. Compared to other types of nails, the d7 nail is relatively short and wide, which leads to a higher concentration of iron, making it easier to retrieval. On the other hand, the d3 nail is relatively small in size, making it more challenging. The overall retrieval rates for the remaining types of nails are similar. Thus, the experimental results indicate that the shape of the nails has a certain impact on the retrieval accuracy.

Conclusion

This paper presents a robotic system designed for the localization, retrieval, and ring marking of road nails on nighttime road surfaces. In order to improve the accuracy of road nail recognition at night, we proposed a YOLOv5 object detection algorithm which integrates improved C3, RepGFPN, and OTALoss. By integrating the improved C3 and RepGFPN modules, the network’s ability to capture multi-scale nail details in uneven lighting conditions is significantly improved. Additionally, OTALoss is introduced to dynamically adjust the loss for different objects, enabling the network to focus more effectively on small objects. To optimize the deployment of the network, we applied techniques such as sparsification, fine-tuning, and distillation training, reducing the network’s parameters. Experimental results demonstrate that, with a 16% reduction in parameters, the network achieves a mAP of 91.5%, marking an 11.3% improvement over the original network. Experiments on seven different road surface show that the nail localization error remains within 2 cm, with the retrieval and ring marking success rate for each type of nails exceeding 99%, and the retrieval success rate for each road type surpassing 98%.