Abstract
In response to the issues of false detection, missed detection, and large model parameter volume during the detection of road surface cracks in complex backgrounds, a lightweight road crack detection model named YOLO-DGVG based on deformable convolution is proposed. Firstly, deformable convolution DCNv2 is introduced into the backbone network, and the C2f-DCNv2 structure is designed to enhance the network’s adaptive adjustment capability to the shape of road surface cracks. Secondly, the lightweight convolution technology GSConv is introduced into the neck network to replace the Conv layer in the neck network for feature extraction, and the original C2f module is replaced with the VoVGSCSP module, which improves the detection accuracy of the model while reducing computational complexity. Finally, a grouped convolution detection head module GCH is constructed in the head network to further reduce the model’s parameter volume. To verify the effectiveness of the improved parts, experiments were conducted using the PID dataset for ablation studies compared with the YOLOv8 model. The Recall and mAP were increased by 0.3–1.6%, respectively, while the Para was reduced by 22.28%. Additionally, generalization experiments were carried out using the UAPD, RDD2022, and self-built datasets. To further validate the overall effectiveness of the model, comparisons were made with models such as RT-DETR and YOLOv10, with YOLO-DGVG outperforming the comparison models. The model was also deployed on edge computing devices to achieve crack detection in static road surface images.
Introduction
Fractures are a common structural defect, widely observed on various structures such as residential buildings1, bridges2, and pavements3. These defects primarily stem from insufficient structural integrity. In the realm of road infrastructure, the presence of such cracks can significantly reduce both safety and longevity. Consequently, prompt identification and repair of these cracks are crucial. Traditionally, crack detection has been performed manually, a process that is not only time-consuming but also susceptible to human error, often leading to missed detections and false alarms. The advancement of visual sensing technology has made vision-based crack detection on road surfaces more prevalent. This approach optimizes the use of limited maintenance resources and mitigates the risk of catastrophic incidents. In particular, deep learning has made significant strides in automating the detection of road surface cracks with improved accuracy. Previous crack detection research primarily relied on traditional image processing techniques4,5, which typically involved a three-stage workflow: first, reducing image noise through smoothing and filtering techniques; second, identifying cracks using edge detection and threshold segmentation; and finally, classifying the detected cracks using a classifier. However, the performance of these traditional methods heavily depended on the assumptions of the noise model, and their detection accuracy significantly decreased when dealing with complex backgrounds and diverse crack morphologies.
With the development of computer vision and Convolutional Neural Networks (CNNs), researchers have begun to explore deep learning-based crack detection techniques. For instance, Cha et al.6 proposed a method based on the ZF-Net and Faster R-CNN framework, which replaced the softmax layer and fully connected layers to better suit defect detection. However, this method has a large computational load, requires high hardware specifications, and suffers from some missed detections and false alarms when dealing with complex backgrounds and small targets. Maeda et al.7introduced the Pavement-Crack-Maeda13 dataset and developed an efficient crack detection algorithm based on the SSD model. The algorithm performs particularly well when using Inception-V2 and MobileNet as the backbone networks. However, it has some limitations in terms of detection accuracy for cracks. Qiu and Lau8 introduced the YOLOv2-tiny model and demonstrated that improved models based on ResNet50 and YOLOv4-tiny achieve high accuracy and robustness in detecting fine cracks. They also innovatively used drones to collect sidewalk video data at low altitudes, further enhancing detection capabilities. Despite the significant progress made in these studies, crack detection still faces numerous challenges: (1) detecting cracks in complex backgrounds, where background noise can severely interfere with crack identification and localization; (2) the uneven distribution and diverse shapes of cracks, which increase the difficulty of recognizing different types of cracks; and (3) the small pixel area occupied by cracks in images, which leads to poor detection performance for fine cracks.
We suggest the YOLO-DGVG road surface fracture detection technique as a solution to these problems. First, by improving the C2f module in the backbone network and integrating Deformable ConvNets version 2 (DCNv2) into the C2f module, we provide the C2f-DCNv2 module. This allows the convolution kernel to utilize a weight coefficient to decide whether the learnt region is of interest and to learn an offset to better cover the goal. Additionally, we apply a novel lightweight convolution approach called Grouped Spatial Convolution (GSConv) in the neck network to replace the original Conv layer used for feature extraction. Additionally, we have substituted the VoVGSCSP module, a Cross Stage Partial Network based on GSConv, for the original C2f module. It lowers the computational complexity of the model while simultaneously increasing its detection accuracy. Last but not least, we provide the Group Conv Head detection head module(GCH), which lowers model parameters without sacrificing speed or accuracy. We further improve the model’s performance by adding this module. We expect that the improved model will achieve higher detection accuracy with fewer parameters and lower computational complexity than the benchmark model YOLOv8. On a variety of self-built and open-source road fault datasets, the suggested approach achieves good accuracy. The YOLO-DGVG algorithm’s efficiency is confirmed by experimental data.
In conclusion, our work has made the following contributions:
-
Our suggested YOLO-DGVG road surface crack detection network model performs well in identifying road surface cracks by striking a compromise between model complexity and accuracy.
-
By integrating the DCNv2 into the C2f module, we created the C2f-DCNv2 module in consideration of the notable geometric differences of road surface cracks. This module reduces missed detections caused by various geometric aspects of objects by altering the geometry of receptive fields to better cover targets.
-
We added the GSConv and VoVGSCSP modules to the model’s neck network in order to decrease the number of parameters, which reduced computational complexity without sacrificing detection accuracy.
-
In order to minimize model parameters and complexity while maintaining accuracy and speed, we suggested the GCH detection head module, which builds a grouped detection head by enhancing the module structure.
-
Our improved model was deployed on edge computing devices to detect static pavement crack images.
This paper’s remaining sections are arranged as follows: The related work is described in Section 2, followed by a thorough explanation of the above-mentioned methods in Section 3, the datasets used in the experiments, experimental details, ablation studies, comparative experimental results and Edge computing device deployment in Section 4, and our conclusions and future work are finally discussed in Section 5.
Related work
Traditional image processing methods have been used by several researchers in the past to identify fractures. In order to account for the brightness and geometric features of crack images, Zou et al.9 presented the Cracktree model, which combines local patterns with global viewpoints. By reducing output noise, this method improves the continuity of cracks that are identified. A road surface crack detection system based on directional gradient histograms was developed by Kapela et al.10. The project was divided into two parts: data collection and image database analysis. The latter used the directional gradient histogram algorithm to identify cracks. Using grayscale transformation, median filtering, and image enhancement techniques, Qingbo et al.11 improved the processing algorithm to increase accuracy by leveraging the characteristics of road surface crack pictures. Using drones and a crack center point approach, Lei et al.12 made it possible to quickly and accurately identify cracks in gathered footage with little data. Kong and Li13 introduced an image overlay-based non-contact crack detection technique that allows for picture superimposition for crack identification. Despite the wide range of conventional image processing-based crack detection techniques, these methods often have low resilience and perform mediocrely in complicated fracture situations.
With the development of computer vision and the introduction of convolutional neural networks (CNNs), the field of crack detection has seen tremendous progress. Li et al.14 presented a hybrid method for fracture identification that combines deep learning and digital image processing. An edge extraction strategy is used to improve the identification of fine-grained crack edges after the Faster R-CNN algorithm has been used to identify and categorize coarse fracture areas. This method demonstrates the potential of the two-stage detection model in crack detection, but it is computationally complex and difficult to deploy directly on edge devices. Two deep learning techniques were created by Kalfarisi et al.15 for crack segmentation and detection. The first approach combines Structured Random Forest Edge identification (SRFED) with Faster R-CNN for the first step of crack identification inside bounding boxes, followed by crack segmentation inside these boxes. Mask R-CNN is used in the second approach for both crack detection and segmentation tasks. While these methods excel in accuracy, their complex network structures and computational demands limit their application on resource-constrained edge devices. Majidifard et al.16 created a road condition index system using a dataset of 7237 photos of road defects. They developed a number of road condition indexes by using the YOLO model for fracture localization and U-Net for crack segmentation and quantification. However, the traditional YOLO model still has limitations when dealing with small targets and complex backgrounds. In order to further optimize the performance of the model, the researchers introduced the attention mechanism and the feature fusion module to improve the detection accuracy. Alavi et al.17,18 used frequency domain deep learning to offer a real-time concrete bridge deck fracture detection system. In order to get better detection results, this system uses a one-dimensional CNN and Long Short-Term Memory (LSTM) for image processing. It also proposes a damage quantification approach to estimate the length of discovered fractures. This approach improves the efficiency of the model by reducing the amount of data dimensions and computation, and opens up the possibility of edge device deployment. In order to streamline all multi-layer perceptron (MLP) layers for the automated detection of long and complex road cracks, Guo et al.19 introduced the Crack Transformer (CT) model, which uses Swin Transformer as the encoder and decoder. This demonstrates the accuracy and resilience of the Transformers for road crack detection. However, the computational complexity of the Transformer is high, which limits its application on edge devices. The CrackFormer model, which combines SegNet and attention mechanism modules, was proposed by Liu et al.20 and allows for accurate crack identification. For the purpose of detecting road defects, Fang Wan et al.21 presented YOLO LRDD, a lightweight detection technique. This technique uses Shuffle-ECANet, a unique backbone network that maintains accuracy while reducing model size, making it appropriate for mobile deployment. A road damage segmentation model that combines the Pavement Defect Segmentation Capsule Network (PDS-CapsNet) and the Similar Feature Extraction Siamese Network (SFE-SNet) was suggested by Dong et al.22. SFE-SNet adapts to the dynamic appearance changes of road faults in each frame by extracting comparable characteristics of target and reference flows in road movies. PDS-CapsNet eliminates superfluous data according to the target flow’s low-order feature approach. In order to enable autonomous detection, this model played a key role in the development of an automatic road fault segmentation system for vehicular embedding. However, the complexity of the network also increases the computational burden. Zhu et al.23 used drones to gather 3151 photos of road problems, creating the UAPD dataset, which includes six different kinds of road defects. They proposed that road fault evolution can be better understood by using full-size drone-captured road defect photos for training and prediction. The potential of drones in road defect detection was demonstrated.
In terms of accuracy and resilience, although deep learning-based crack detection technology has proven to be significantly superior to traditional methods, existing methods still face computational challenges when deployed on edge devices. Based on the inspiration of these studies, the YOLO-DGVG model aims to solve the computational bottleneck of traditional CNN models when deployed on edge devices by optimizing the network structure and introducing an efficient feature extraction module. On the basis of inheriting the high-efficiency detection mechanism of the YOLO series, YOLO-DGVG further introduces deformable convolution (DCNv2) to enhance the detection ability of small targets. In addition, YOLO-DGVG also draws on lightweight network design to reduce model complexity and make it more suitable for real-time operation on edge devices.
Proposed method
This study optimizes and improves upon the YOLOv824 model, proposing a lightweight road crack detection model based on deformable convolution, named YOLO-DGVG. The specific structure of the YOLOv8 model is shown in the Fig. 1.
The YOLOv8 model mainly consists of three parts: the Backbone, the Neck, and the Head. The Backbone is primarily responsible for feature extraction of the target. The extracted features are then fused through the Neck, and finally, the detection results are outputted by the Head.
C2f-DCNv2 Module
In the YOLOv8 model, the C2f module fuses low-level feature maps and high-level feature maps. Low-level feature maps contain more detailed information but lack semantic and contextual information; in contrast, high-level feature maps contain richer semantic and contextual information but may lose some details. By fusing these two types of feature maps, the C2f module can utilize both detailed and semantic information across different scales, thereby improving the accuracy and robustness of object detection. Specifically, the C2f module connects low-level and high-level feature maps and adjusts their channel numbers and sizes through appropriate convolutional operations. This enables the transfer and fusion of information between feature maps at different levels, enhancing the object detection algorithm’s ability to detect objects at various scales. Figure 2 shows the specific details of the C2f structure.However, traditional convolutional modules can only extract features from specific locations. For the fine structural features of road surface cracks in complex backgrounds, the feature extraction of road cracks will face certain difficulties, and it is prone to introducing too many interfering factors. This results in missed detections, false alarms, and low detection accuracy for road surface cracks.
Incorporating deformable convolutions can mitigate some of the aforementioned challenges by enabling the network to adapt to varying target sizes and shapes through dynamic receptive field sizes and configurations. Deformable convolution is an improved version of traditional convolution operations, designed to enhance the ability of convolutional neural networks (CNNs) to adapt to geometric deformations. It achieves this by introducing learnable offsets that dynamically adjust the sampling positions of the convolution kernel, thereby better capturing complex structures and shape variations in images. This approach not only boosts the precision and reliability of CNNs but also equips them to tackle a broader spectrum of real-world scenarios. A visual representation of deformable convolutions is provided in Fig. 3. In the initial iteration, DCNv125, introduced two innovative modules to bolster the CNN’s capacity to account for transformations: deformable convolutions and deformable ROI pooling. These enhancements are predicated on the concept of refining spatial sampling locations within the network based on task-specific learned displacements, without the need for additional supervisory signals. However, the incorporation of offsets could potentially introduce extraneous information, thereby impacting the accuracy of detection outcomes. Consequently, this research opts for DCNv226, an advanced version of DCNv1 designed to minimize the impact of irrelevant data. DCNv2 expands upon the learning of offsets for each sampling location and incorporates a weighting mechanism to discern regions of interest. If a sampling location does not fall within a region of interest, its corresponding weight is set to zero, as delineated in Eq. (1).
Although the C2f module greatly enhances object detection algorithms’ detection capabilities for objects of different sizes, it has trouble correctly identifying the thin and long structures of road cracks, which make up a small percentage of the image and have complex morphologies. When detecting fine cracks, this leads to a high rate of false positives and false negatives. We have included the C2f-DCNv2 module, a modification of DCNv2 (Deformable Convolutional Networks v2), into our system in order to overcome these challenges. This integration improves the model’s ability to identify tiny targets and more efficiently accounts for the various forms of fractures. This improves the accuracy of road crack identification by allowing the model to more accurately capture the fine characteristics of cracks. Figure 4 presents a thorough representation of the C2f-DCNv2 module.
Feature pyramid optimization
YOLOv8 employs a neck network structure combining FPN (Feature Pyramid Network)27 and PAN (Path Aggregation Network)28 to enhance the model’s multiscale feature fusion capabilities. FPN transfers deep semantic information to shallower layers through a top-down pathway, fusing it with high-resolution shallow features to construct a feature pyramid that combines high semantics and high resolution. This significantly improves the model’s detection performance for small objects while being computationally efficient and adding only a minimal number of parameters. PAN builds upon FPN by adding a bottom-up enhancement pathway, forming bidirectional feature fusion. This further optimizes information flow and gradient propagation, simultaneously conveying semantic and detailed information and enriching the feature representation. In traditional convolutional operations, spatial information is progressively transformed into channel information, and this process leads to partial loss of semantic information with each spatial compression of feature maps and channel expansion.
To address this issue, the traditional convolutional operations in the neck network are replaced with GSConv29 convolutional layers, which preserve the hidden connections between channels as much as possible while maintaining a low time complexity. The specific structure of GSConv is shown in Fig. 5.
The GSConv module is primarily composed of Conv, DSConv, Concat, and Shuffle modules. First, a Conv layer is used to downsample the input through a regular convolution operation. Then, DSConv performs depthwise convolution, and the results of the two Conv layers are concatenated together. Finally, the Shuffle operation interleaves the corresponding channels of the first two convolutions. This method allows the information generated by the regular convolution to permeate every part of the information generated by DSConv, thereby reducing model complexity while maintaining accuracy. The mathematical expression of the GSConv module is shown in Eq. 2).
The GS bottleneck design, introduced based on GSConv and depicted in Fig. 6, aggregates the GSConv module into a one-shot design, resulting in the VoVGSCSP29 module, as shown in Fig. 7. The VoVGSCSP module, comprised of a GSConv module and a single Conv convolution linked through residual connections, extracts features through the cross structure of Conv and GSConv, fusing them with the output of the single Conv, and ultimately connecting to the output through Conv convolution. The VoVGSCSP module maintains enough accuracy while streamlining computation and network design in contrast to the C2f module. According to experimental data, the improved model maintains sufficient accuracy while significantly lowering computational and network structural complexity.
GCH module
In its design, YOLOv8 follows the traditional method of separating the classification and detection heads. This design includes target detection box filtering and loss computation. The TaskAlignedAssigner30 is used to determine the distribution of positive and negative samples during the loss computation, and positive samples are chosen based on a weighted sum of the classification and regression scores. With the exception of the objectness branch, the loss computation is divided into two parts: classification and regression. Binary cross-entropy (BCE) loss is used in the classification branch, whereas Distribution Focal Loss (DFL)31 and CloU loss functions are used in the regression branch. By splitting the heads, YOLOv8 creates prediction boxes while forecasting regression coordinates and classification scores at the same time. The chance of an object’s presence at each pixel is represented by the classification scores, which are included in a two-dimensional matrix. The displacement of the center of the object from each pixel is represented by the regression coordinates, which are shown in a four-dimensional matrix. Finally, the task alignment measure is calculated by YOLOv8 using the TaskAlignedAssigner, which combines classification scores with Intersection over Union (IoU) values. This measure lessens the effect of poor-quality prediction boxes while enabling the simultaneous optimization of classification and localization accuracy. Equation (3) describes IoU, a common measure in object detection. Figure 8 shows the design and complexity of the detecting head of YOLOv8.
In the original model, the head network accounted for a large proportion of the model’s parameters. To minimize the model parameters while maintaining speed and accuracy, the Group Conv Head (GCH) detection head module was constructed to replace the original detection head module. The Convolutional Block Softmax (CBS) modules in both the classification head and detection head branches of YOLOv8 were merged, and grouped convolution32 was used to replace the original standard convolution operations. By introducing grouped convolution and the strategy of merging modules into the model, the number of model parameters was reduced.This enhancement introduces a strategy that leverages grouped convolutions and module merging to curtail the model’s parameter count. The idea behind grouped convolution is to divide the input feature map and the convolutional kernel into g groups. As shown in Fig. 9, group convolution (Group Conv) is the process of convolving the input feature map with the convolutional kernel inside each group. The input feature map has a size of H\(\times\)W\(\times\)c1 and contains a total of c2 convolutional kernels. In this process, h1 and w1 stand for the height and width of the convolutional kernel, c1 and c2 for the number of channels in the input and output feature maps, respectively.The convolutional kernel has a dimension of h1\(\times\)w1\(\times\)c1/g and is divided into c2/g groups. The parameter count of regular convolution is g times that of grouped convolution, as seen by the comparison of parameter values between the two types of convolution operations in formulae (4) and (5).
The Group Conv Head (GCH) detection head module is meticulously engineered to maintain the model’s precision and velocity, ensuring that performance remains unaffected despite a decrease in the number of parameters. This is achieved by integrating the Convolutional Block Softmax (CBS) modules across both the classification and detection head branches, thereby eliminating redundant parameters within the model. Furthermore, the implementation of grouped convolution operations serves to diminish computational complexity. The detailed architecture of the GCH is depicted in Fig. 10.
Results and analysis
This study proposes a road crack detection model named YOLO-DGVG based on the optimization and design of the YOLOv8 network structure. Firstly, in the Backbone part of the model, the original C2f module is replaced with the C2f-DCNv2 module to enhance the model’s feature extraction capability for road cracks. Secondly, in the Neck part, all Conv layers are replaced with GSConv convolutions, and the C2f module is replaced with the VoVGSCSP module. This not only improves the model’s accuracy but also reduces the number of model parameters and its complexity. Finally, in the Head part, the original detection head module is replaced with the GCH detection head module, which further reduces the number of model parameters while maintaining model accuracy. The complete structure of the YOLO-DGVG model is shown in Fig. 11.
Experimental environment
PyTorch served as the experimental platform for the deep learning framework, while Ubuntu18 was utilized as the operating system to confirm the efficacy of the suggested approach. The basic network model was YOLOv8s. The experimental environment’s setup is detailed in Table 1.
All of the experiments were trained using the same hyperparameters. The precise hyperparameters utilized during training are displayed in Table 2.
Datasets
The open-source PID-Pavement-Image-Dataset33, which includes pictures of the same road segment captured by two cameras—a top-down view and a wide-angle view—was utilized in this investigation. We selected a dataset of 7237 wide-angle-views and considered nine types of pavement defects(Fig. 12). At the same time, for the dataset, we performed data enhancement operations such as no processing, random rotation and increased brightness according to the ratio of 4:3:3. The enhanced dataset is shown in Fig. 13. Table 3 displays the total number of distinct kinds of road problems in the dataset. Using a 7:1:2 ratio, the photos are separated into training, validation, and test sets.
Evaluation indicators
A set of evaluation measures has been chosen in order to provide an unbiased assessment of the effectiveness of the road fault detection algorithm.The number of parameters is another way to measure the size and complexity of the model. Equation (6) outlines the details of mAP (mean average precision), which is used to evaluate the model’s accuracy.
The variable N in Eq. (5) represents the total of all the categories that are being considered.The area under the curve, or PA, is plotted with accuracy on the vertical axis and recall on the horizontal. The mean average precision at a 0.5 confidence level is denoted by the notation mAP@0.5. One indicator of the model’s ability to identify positive samples is recall. As further explained in Eq. (7), a higher recall value indicates a more accurate identification of positive cases.
In the context of performance metrics, TP denotes the true positives, which are the instances where the model correctly identifies positive samples. On the other hand, FN denotes the instances in which the model misidentifies positive data as negative. The model’s ability to recognize positive samples is gauged by the recall rate, which ranges from 0 to 1. A greater recall rate indicates that the model is more reliable in identifying real positive examples.
Ablation experiments
This study used ablation experiments to see how well the improvement techniques worked. Ablation experiments evaluate each component’s contribution to the model’s overall performance, which is very helpful for improving optimization methods and model network architecture. Building upon YOLOv8s as the baseline model, ablation experiments were performed by sequentially adding C2f-DCNv2, GSConv+VoVGSCSP to optimize the feature pyramid, and the GCH module. The evaluation metrics data related to these experiments were obtained. The ablation experiment data for this section is specifically presented in Table 4.
The experimental data analysis indicates (Table 4) that the impact of each improved module on model performance shows differentiated characteristics: Firstly, the C2f-DCNv2 module increased mAP@0.5 by 1.2% while essentially keeping the number of Parameters unchanged. Secondly, optimizing the neck network with GSConv and VoVGSCSP led to a slight decrease in mAP@0.5 by 0.6%, but significantly reduced the number of Parameters. Lastly, the GCH detection head module not only increased Recall and mAP@0.5 by 0.3% and 1.6%, respectively, but also reduced the number of Parameters by 22.28%. The YOLO-DGVG model, by integrating the C2f-DCNv2, GSConv+VoVGSCSP, and GCH modules, outperformed the baseline model in terms of Recall, mAP@0.5, and Parameters. For specific precision comparisons, please refer to Fig. 14.
Performance experiments with other datasets
In 2022, Zhu et al. presented the Unmanned Aerial Vehicle Asphalt Pavement Damage dataset (UAPD), encompassing six distinct categories of pavement damage: Transverse Cracks (TC), Longitudinal Cracks (LC), Alligator Cracks (AC), Oblique Cracks (OC), potholes, and repair areas.This dataset was captured by drones at a specific height, which may introduce biases in perspective and lighting conditions. Although the dataset covers six common types of road damage, the sample size for certain categories (such as oblique cracks) may be relatively small, making the model less sensitive to these categories during training. Moreover, the images in the dataset are primarily from specific regions, which may limit the model’s ability to generalize to road damage detection tasks in other areas. The Road Damage Dataset 2022 (RDD2022)34 compiles 47,420 road images from diverse locations including Japan, India, China, the United States, Norway, and the Czech Republic, focusing on four types of road damage: longitudinal cracks, transverse cracks, alligator cracks, and potholes. For experimental purposes, road damage data from China were chosen for model training to affirm the model’s efficacy. Due to the large scale of the dataset, the annotation work may have been completed by multiple individuals, which could lead to inconsistencies in annotation standards and thereby affect the accuracy of model training. Certain types of damage (such as potholes) may be overrepresented in the dataset, while others (such as oblique cracks) may be relatively underrepresented. This imbalance could result in the model’s insufficient detection capability for minority classes. The Self-LCD dataset was employed to assess the model’s performance on real-world road imagery, comprising a total of 398 images. These images were captured at Northeast Electric Power University using an iPhone 13, with each image measuring 720x960 pixels, and they represent a variety of sample sets used to evaluate the model’s capabilities. The dataset was captured by fixed equipment at a specific location, which may introduce biases due to equipment performance and shooting conditions. Additionally, the relatively small size of the dataset may lead to overfitting during model training, limiting its ability to generalize well to broader real-world scenarios.
The UAPD, RDD2022, and Self-LCD datasets were used for training, validation, and testing in order to further confirm the enhancement of the YOLO-DGVG model; the outcomes are displayed in Table 5. The table’s data further supports the superiority of our improved model by showing that the suggested YOLO-DGVG model has fewer parameters and greater accuracy. Figure 15 displays the results of testing the trained model. It is evident that the enhanced pavement crack detection model outperforms the original YOLOv8s model by a wide margin.
Test result: (a) is the benchmark model test results, and (b) is our model test results. In the UAPD dataset, the pink box represents Transverse Crack, the red box represents Longitudinal Crack, and the yellow box represents Repair; in the RDD2022 dataset, the pink box represents Longitudinal Crack, the red box represents Transverse Crack, and the yellow box represents Repair; in the Self-LCD dataset, the red box represents Reflective Crack and the green box represents Longitudinal Crack.
Comparative experiments
In addition to verifying the effectiveness and generalization ability of the model improvements, it is also necessary to evaluate the overall performance of the model. Therefore, this paper also compares the improved YOLO-DGVG model with other mainstream open-source object detection models on the PID dataset. The experiments were conducted at an image resolution of 640x640 to compare the YOLO-DGVG model with other object detection models, using Recall, mAP@0.5, and Parameters as the comparison metrics. During the experiments, the training environment, experimental parameters, and dataset consistency were strictly controlled. The specific comparison results are shown in Table 6.
The YOLO-DGVG network model proposed in this study has demonstrated superior metrics in terms of detection accuracy. Compared with two-stage object detection networks Cascade R-CNN and Double-Head Faster R-CNN, the mAP@0.5 has been improved by 19.1% and 16.2% respectively. When compared with other single-stage object detection networks ATSS, RT-DETR, and RTMDets, the mAP@0.5 has been enhanced by 20.1, 3.8, and 15.1% respectively. In comparison with the recently introduced similar single-stage object detection models YOLOv9s and YOLOv10s, the mAP@0.5 has been increased by 12.3–3.2% respectively. The comparative experiments have validated the performance advantages of the proposed detection model YOLO-DGVG for road crack detection.
The PID road crack dataset includes a total of 9 categories. The mAP values for these 9 categories were collected for the comparison models. It can be seen from Table 7 that the improved model YOLO-DGVG in this study achieved the best performance in 8 of the categories. In Fig. 16, this study uses a scatter plot to more intuitively demonstrate the superiority of the improved model for these 9 categories.
Some of the detection results from the comparative experiments are shown in Fig. 17. The YOLO-DGVG model proposed in this paper has enhanced the detection capabilities of the original YOLOv8s model. The original images in the figure are from the open-source PID dataset. Our model, YOLO-DGVG, has better detection accuracy compared to other models. This is due to its adaptive adjustment to the shape of the targets, with a flexible receptive field, which allows it to better extract features from complex and variable cracks, as well as fine cracks, in complex backgrounds. This leads to the acquisition of accurate crack data, followed by the localization and classification of the cracks.
Based on the results shown in the above figure, our model exhibits lower false negative rates compared to other models. Additionally, it excels in detecting fine cracks with relatively high precision. This further validates the contribution of the proposed YOLO-DGVG model in enhancing detection accuracy and lightweight performance.
Edge computing device deployment
The trained model is ported to a development board for deployment using the NVIDIA Xavier NX Developer Kit. The GPU is equipped with 384 NVIDIA CUDA cores and 48 Tensor Cores, which are energy-efficient and efficient. In terms of video memory, the module is equipped with 8GB of 128-bit LPDDR4 memory, with an access rate of up to 59.7GB/s. NVIDIA Xavier NX also has HDMI and DP interfaces for external displays, as well as Gigabit Ethernet interfaces for data communication with application modules. Table 8 lists the specifications of the developer kit.
The YOLO-DGVG algorithm is used in this paper to run on this platform. The Pytroch deep learning framework is used to detect pavement crack images, and the classification and location information of the cracks in the image are precisely located. Lastly, the target detection results of the image are shown on the HDMI display, which is convenient for further observation and analysis. Figure 18 depicts the hardware simulation environment.
The RDD2022 pavement crack dataset was used for training, and the effect and performance of the improved algorithm on the embedded algorithm were analyzed through experiments. Several different still images were imported on NVIDIA Xavier NX and detected by applying the YOLO-DGVG and YOLOv8s algorithms. Figure 19 displays the particular test findings. The YOLOv8s algorithm’s deployment detection results are shown in (a), while the YOLO-DGVG algorithm’s deployment detection results are shown in (b). The algorithm proposed in this article lays a solid foundation for the subsequent implementation of real-time crack detection in edge computing devices.
Conclusions
YOLO-DGVG is a novel object recognition network based on YOLOv8s, designed for the precise identification and classification of pavement cracks. It is capable of accurately detecting tiny cracks against a variety of complex backgrounds. The model was trained and evaluated using the PID, UAPD, RDD2022, and self-constructed datasets. Experimental results indicate that YOLO-DGVG outperforms the original YOLOv8s model in detecting pavement cracks. Moreover, YOLO-DGVG surpasses other well-known models in terms of detection performance and parameter quantity, as evidenced by the test results from the PID-Pavement-Image-Dataset. The highest mAP value achieved was 84.2%, with a parameter quantity of 8.64 million. Additionally, the trained model was deployed on edge computing devices to detect static pavement crack images on these devices, laying the foundation for real-time detection by subsequent drones and other equipment. The experimental results indicate that YOLO-DGVG has achieved a 1.6% improvement in detection accuracy for road crack detection compared to the original YOLOv8s model. In the field of road crack detection, even a small increase in accuracy can have a significant impact on practical applications. For example, more accurate crack detection in road maintenance and safety assessment can help identify potential safety hazards in a timely manner, thereby enabling corresponding repair measures to be taken and preventing accidents. Moreover, from a technical standpoint, this improvement reflects the optimization of YOLO-DGVG in feature extraction and target localization. In particular, the introduction of DCNv2, GSConv, and the GCH module has played a crucial role in crack detection and feature extraction. DCNv2 enhances the model’s adaptability to complex shapes and irregular cracks through deformable convolutional kernels, enabling the network to better capture the geometric features and texture information of cracks. Meanwhile, GSConv and the GCH module improve the richness and accuracy of feature extraction while maintaining computational efficiency, thereby further enhancing the model’s detection performance. However, we also need to critically evaluate this 1.6% improvement in mAP. Although it does demonstrate the advantages of YOLO-DGVG to some extent, it is important to note that mAP is not the only indicator of model performance. For example, in practical applications, the model’s detection speed and robustness to different lighting and weather conditions are equally important. Compared with previous models in related works, YOLO-DGVG demonstrates significant advantages in both detection performance and the number of model parameters. For example, compared with the hybrid method of deep learning and digital image processing proposed by Li (Reference 14), it has fewer model parameters to adapt to the deployment on edge devices.
The YOLO-DGVG network will be further optimized in the future and combined with the segmentation model to identify and categorize pavement fractures. At the same time, its skeleton will be segmented to assess pavement damage. However, this integration may impact computational efficiency. Segmentation tasks typically require processing pixel-level details, which can significantly increase the computational burden of the model. Additionally, a multi-task learning framework can leverage prior information from the detection task to assist the segmentation task, thereby improving overall efficiency.
Data availability
The datasets generated and/or analysed during the current study are available in the UAPD repository, https://github.com/tantantetetao/UAPD-Pavement-Distress-Dataset. The datasets generated and/or analysed during the current study are available in the Pavement-Image-Dataset (PID) repository, https://github.com/Nan2020/PID-Pavement-Image-Dataset/tree/master. The datasets generated and/or analysed during the current study are available in the RDD2022 repository, https://github.com/sekilab/RoadDamageDetector. The fourth dataset used and/or analysed during the current study is available from the corresponding author on reasonable request.
References
Ai, D., Jiang, G., Lam, S.-K., He, P. & Li, C. Computer vision framework for crack detection of civil infrastructure–a review. Eng. Appl. Artif. Intell. 117, 105478. https://doi.org/10.1016/j.engappai.2022.105478 (2023).
Guo, F., Qian, Y., Liu, J. & Yu, H. Pavement crack detection based on transformer network. Autom. Constr. 145, 104646. https://doi.org/10.1016/j.autcon.2022.104646 (2023).
Li, R. et al. Automatic bridge crack detection using unmanned aerial vehicle and faster R-CNN. Constr. Build. Mater. 362, 129659. https://doi.org/10.1016/j.conbuildmat.2022.129659 (2023).
Ogawa, S., Matsushima, K. & Takahashi, O. Crack detection based on gaussian mixture model using image filtering. In: 2019 International Symposium on Electrical and Electronics Engineering (ISEE), 79–84, https://doi.org/10.1109/ISEE2.2019.8921060 (2019).
Hsieh, Y.-A. & Tsai, Y. J. Machine learning for crack detection: Review and model performance comparison. J. Comput. Civ. Eng. 34, 04020038. https://doi.org/10.1061/(ASCE)CP.1943-5487.0000918 (2020).
Cha, Y.-J., Choi, W., Suh, G., Mahmoudkhani, S. & Büyüköztürk, O. Autonomous structural visual inspection using region-based deep learning for detecting multiple damage types. Comput. Aided Civ. Infrastr. Eng. 33, 731–747. https://doi.org/10.1111/mice.12334 (2018).
Maeda, H., Sekimoto, Y., Seto, T., Kashiyama, T. & Omata, H. Road damage detection and classification using deep neural networks with smartphone images. Comput. Aided Civ. Infrastr. Eng. 33, 1127–1141. https://doi.org/10.1111/mice.12387 (2018).
Qiu, Q. & Lau, D. Real-time detection of cracks in tiled sidewalks using yolo-based method applied to unmanned aerial vehicle (UAV) images. Autom. Constr. 147, 104745. https://doi.org/10.1016/j.autcon.2023.104745 (2023).
Zou, Q., Cao, Y., Li, Q., Mao, Q. & Wang, S. Cracktree: Automatic crack detection from pavement images. Pattern Recognit. Lett. 33, 227–238. https://doi.org/10.1016/j.patrec.2011.11.004 (2012).
Kapela, R. et al. Asphalt surfaced pavement cracks detection based on histograms of oriented gradients. In: 2015 22nd International Conference Mixed Design of Integrated Circuits & Systems (MIXDES), 579–584, https://doi.org/10.1109/MIXDES.2015.7208590 (2015).
Qingbo, Z. Pavement crack detection algorithm based on image processing analysis. In 2016 8th International Conference on Intelligent Human-Machine Systems and Cybernetics (IHMSC), vol. 01, 15–18, https://doi.org/10.1109/IHMSC.2016.96 (2016).
Lei, B., Wang, N., Xu, P. & Song, G. New crack detection method for bridge inspection using UAV incorporating image processing. J. Aerosp. Eng. 31, 04018058. https://doi.org/10.1061/(ASCE)AS.1943-5525.0000879 (2018).
Kong, X. & Li, J. Non-contact fatigue crack detection in civil infrastructure through image overlapping and crack breathing sensing. Autom. Constr. 99, 125–139. https://doi.org/10.1016/j.autcon.2018.12.011 (2019).
Li, C. et al. Tunnel crack detection using coarse-to-fine region localization and edge detection. Wiley Interdisciplinary Rev. Data Mining Knowl. Discovery 9, e1308. https://doi.org/10.1002/widm.1308 (2019).
Kalfarisi, R., Wu, Z. Y. & Soh, K. Crack detection and segmentation using deep learning with 3d reality mesh model for quantitative assessment and integrated visualization. J. Comput. Civ. Eng. 34, 04020010. https://doi.org/10.1061/(ASCE)CP.1943-5487.0000890 (2020).
Majidifard, H., Adu-Gyamfi, Y. & Buttlar, W. G. Deep machine learning approach to develop a new asphalt pavement condition index. Constr. Build. Mater. 247, 118513. https://doi.org/10.1016/j.conbuildmat.2020.118513 (2020).
Zhang, Q., Barri, K., Babanajad, S. K. & Alavi, A. H. Real-time detection of cracks on concrete bridge decks using deep learning in the frequency domain. Engineering 7, 1786–1796. https://doi.org/10.1016/j.eng.2020.07.026 (2021).
Zhang, Q. & Alavi, A. H. Automated two-stage approach for detection and quantification of surface defects in concrete bridge decks. In: Nondestructive Characterization and Monitoring of Advanced Materials, Aerospace, Civil Infrastructure, and Transportation XV, vol. 11592, 108–117, https://doi.org/10.1109/COMPSAC54236.2022.00289 (SPIE, 2021).
Guo, J.-M., Markoni, H. & Lee, J.-D. Barnet: Boundary aware refinement network for crack detection. IEEE Trans. Intell. Transp. Syst. 23, 7343–7358. https://doi.org/10.1109/TITS.2021.3069135 (2022).
Liu, H., Miao, X., Mertz, C., Xu, C. & Kong, H. Crackformer: Transformer network for fine-grained crack detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 3783–3792, https://doi.org/10.1109/ICCV48922.2021.00376 (2021).
Wan, F. et al. YOLO-LRDD: A lightweight method for road damage detection based on improved YOLOv5s. EURASIP J. Adv. Signal Process. 2022, 98. https://doi.org/10.1186/s13634-022-00931-x (2022).
Dong, J. et al. Automatic damage segmentation in pavement videos by fusing similar feature extraction siamese network (SFE-SNet) and pavement damage segmentation capsule network (PDS-CapsNet). Autom. Constr. 143, 104537. https://doi.org/10.1016/j.autcon.2022.104537 (2022).
Zhu, J. et al. Pavement distress detection using convolutional neural networks with images captured via UAV. Autom. Constr. 133, 103991. https://doi.org/10.1016/j.autcon.2021.103991 (2022).
Ultralytics. YOLOv8. GitHub Repository https://github.com/ultralytics (2023).
Dai, J. et al. Deformable convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) https://doi.org/10.1109/ICCV.2017.89 (2017).
Zhu, X., Hu, H., Lin, S. & Dai, J. Deformable convnets v2: More deformable, better results. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) https://doi.org/10.1109/CVPR.2019.00953 (2019).
Lin, T.-Y. et al. Feature pyramid networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) https://doi.org/10.1109/CVPR.2017.106 (2017).
Li, H., Xiong, P., An, J. & Wang, L. Pyramid attention network for semantic segmentation. arXiv preprint arXiv:1805.10180https://doi.org/10.48550/arXiv.1805.10180 (2018).
Li, H. et al. Slim-neck by gsconv: A better design paradigm of detector architectures for autonomous vehicles. arXiv preprint arXiv:2206.02424https://doi.org/10.48550/arXiv.2206.02424 (2022).
Feng, C., Zhong, Y., Gao, Y., Scott, M. R. & Huang, W. Tood: Task-aligned one-stage object detection. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 3490–3499, https://doi.org/10.1109/ICCV48922.2021.00349 (IEEE Computer Society, 2021).
Li, X. et al. Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection. Adv. Neural Inform. Process. Syst. 33, 21002–21012 (2020).
Krizhevsky, A., Sutskever, I. & Hinton, G. E. Imagenet classification with deep convolutional neural networks. Commun. ACM 60, 84–90. https://doi.org/10.1145/3065386 (2017).
Majidifard, H., Jin, P., Adu-Gyamfi, Y. & Buttlar, W. G. Pavement image datasets: A new benchmark dataset to classify and densify pavement distresses. Transp. Res. Record 2674, 328–339. https://doi.org/10.1177/0361198120907283 (2020).
Arya, D. et al. Global road damage detection: State-of-the-art solutions. In: 2020 IEEE International Conference on Big Data (Big Data), 5533–5539, https://doi.org/10.1109/BigData50022.2020.9377790 (2020).
Cai, Z. & Vasconcelos, N. Cascade R-CNN: High quality object detection and instance segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 43, 1483–1498. https://doi.org/10.1109/TPAMI.2019.2956516 (2021).
Wu, Y. et al. Rethinking classification and localization for object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) https://doi.org/10.48550/arXiv.1904.06493 (2020).
Zhang, S., Chi, C., Yao, Y., Lei, Z. & Li, S. Z. Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) https://doi.org/10.48550/arXiv.1912.02424 (2020).
Zhao, Y. et al. Detrs beat yolos on real-time object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 16965–16974, https://doi.org/10.1109/CVPR52733.2024.01605 (2024).
Lyu, C. et al. Rtmdet: An empirical study of designing real-time object detectors, https://doi.org/10.48550/arXiv.2212.07784 (2022).
Wang, C.-Y., Yeh, I.-H. & Mark Liao, H.-Y. Yolov9: Learning what you want to learn using programmable gradient information. In European Conference on Computer Vision, 1–21, https://doi.org/10.48550/arXiv.2402.13616 (Springer, 2025).
Wang, A. et al. Yolov10: Real-time end-to-end object detection, https://doi.org/10.48550/arXiv.2405.14458 (2024).
Acknowledgements
This research was supported by National Natural Science Foundation of China (approval number: 52106080); Jilin Province Science and Technology Development Plan Project (approval number: YDZJ202401640ZYTS); Jilin Provincial Department of Education Science and Technology Research Project (approval number: JJKH20230135KJ); Jilin City Science and Technology Innovation Development Plan Project (approval number: 20240302014); Northeast Electric Power University Teaching Reform Research Project (approval number: J2427).
Author information
Authors and Affiliations
Contributions
Zhuang Li, Conceptual System, Methodology, Verification; Junjie Yang, Writing, Critic and Editor; Heqi Wang, Situation Analysis, Data Management; Xingcan Li, conceptual system, preparation of the first draft Dan Li, Methodology, Verification; Xinhua Wang , Situation Analysis;
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Li, Z., Yang, J., Wang, H. et al. Lightweight pavement crack detection model for edge computing devices. Sci Rep 15, 38179 (2025). https://doi.org/10.1038/s41598-025-22092-1
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41598-025-22092-1
Keywords
This article is cited by
-
YOLO11-WLBS: an efficient model for pavement defect detection
Scientific Reports (2026)



















