Introduction

As a key transportation infrastructure, the road network is of strategic importance for national economic development. However, during the service period, road surface damage is prone to cracks, potholes, and other damages due to the combined effects of vehicle loads and environmental factors, which seriously affect road performance and safety. Therefore, road surface damage detection technology plays an irreplaceable role in road preventive maintenance and safety assurance1.

Traditional manual inspection methods have limitations of high cost, poor efficiency, and high subjectivity, and the implementation process affects traffic flow and safety hazards2. Therefore, the development of intelligent automatic road surface damage detection technology has an important practical significance for improving detection efficiency, reducing maintenance costs, and ensuring traffic safety. Recently, the development of deep learning technology has provided a new technical paradigm for intelligent pavement detection. Currently, deep learning has made breakthroughs in many fields3,4,5, and in the field of computer vision, which has formed three core task systems of image classification6, object detection7, and semantic segmentation8. This study focuses on the problem of object detection in road surface damage detection and proposes a novel network architecture for road surface damage object detection.

Object detection technology can synchronize the localization and recognition function of the target object, and has been widely researched and applied in road surface damage detection. According to the differences in the recognition process, object detection models can be categorized into two-stage models and one-stage models. Among the two-stage models, the most representative ones include classical architectures such as R-CNN9, Fast R-CNN10, and Faster R-CNN11. Based on these models, researchers have carried out several innovative works. For example, Cha et al.12 proposed a multi-damage detection method based on Faster R-CNN, which realized the simultaneous detection of five different types of damage. Ju et al.13 developed a CrackDN network model for the crack detection task, which was improved based on Fast R-CNN. Li et al.14 utilized the R-CNN model to successfully achieve effective detection of six road surface damages. Sun et al.15 proposed a pavement crack detection network using different networks such as VGG1616, ZFNet17, and Resnet5018 as the backbone and optimizing the aspect ratio of the crack candidate frame. Hascoet et al.19 systematically investigated the road damage detection method of Faster R-CNN and deeply discussed the impact of different depth backbones and multiple testing techniques on the detection performance. However, such algorithms generally suffer from the inherent defect of slow inference speed, which limits their application in practical engineering to some extent.

One-stage detection models effectively overcome the limitation of slow inference speed. Although early one-stage models have a gap in detection accuracy compared with two-stage models, the detection accuracy of one-stage models has improved significantly today. Among them, the YOLO series of models20,21,22,23, as a representative network that combines inference speed and detection accuracy, has attracted wide attention in road surface damage detection. Specifically, Zhang et al.24 realized road surface damage detection based on UAV-collected data by improving YOLOv3. Xiang et al.25 innovatively combined Transformer with YOLOv5 to enhance the network model’s ability to extract contextual information about cracks. Diao et al.26, based on YOLOv5, designed a lightweight road damage detection model LE-YOLOv5. Guo et al.27 used a lightweight backbone instead of YOLOv5’s backbone and introduced the attention mechanism. Wang et al.28 proposed an improved YOLOv8 model, which achieves the dual goals of accuracy enhancement and lightweight by optimizing the network modules. In addition, in various road surface damage detection competitions, numerous researchers29,30,31,32 made innovative improvements based on the YOLO model, which significantly improved the model’s detection accuracy of complex road surface damage. These research works fully demonstrate the great potential of one-stage inspection models in practical applications.

Although significant progress has been made in road surface damage detection, due to the wide distribution, large number of damages, and complex detection environment, the existing models are mostly limited to a single scenario, which lacks robustness and generalization ability. Moreover, tiny damages (e.g., cracks 1–2 pixels wide and potholes less than 20 pixels wide) are prone to insufficient feature extraction and feature loss in traditional models, and at the same time, they are susceptible to the interference of noisy features, which leads to high leakage and misdetection rates. In addition, considering the real-time requirements of actual detection scenarios, the computational efficiency of the model still needs to be improved. To address the challenges, a novel road surface damage object detection model (RSDD) is proposed in this study, based on the structural characteristics of road surface damage. Overall, the contributions of this study include:

  1. (1)

    To address the lack of a large-scale image dataset of road surface damage, this study conducted comprehensive image acquisition of various types of road surfaces in a variety of climatic regions, using multiple camera angles and covering different weather conditions and periods. Through a rigorous data collection and annotation process, a total of 10,440 road surface images were acquired, and the refined annotation of 36,579 road surface damage areas was completed. Compared with the publicly available dataset, the scenarios in this dataset are more complex, including more weather conditions and more occluding interfering objects, and involve more fine-grained and small-scale damage objects, which provides data support for the research of road surface damage detection algorithms.

  2. (2)

    Aiming at the leakage and misdetection problems in road surface damage detection, especially the challenges of insufficient characterization, feature loss, and noise interference in tiny damage extraction, this study proposes a novel feature extraction backbone based on the structural properties of the damage. Enhancing the model’s understanding of the road surface detection environment while improving the extraction of tiny damage features effectively improves the detection accuracy of road surface damage, while reducing the number of parameters and computation of the model.

  3. (3)

    The variability of feature information at different stages in the feature pyramid generated by the backbone is addressed, as well as the key issues faced in multi-scale damage object detection. We propose a neck network that combines feature selection and feature fusion to achieve more efficient feature fusion, and a multi-scale detection head is used to realize effective detection of multi-size damage. Experiments demonstrate that the method effectively optimizes the detection performance of the model.

Road surface damage detection network

Figure 1 illustrates the general structure of the RSDD, which consists of a backbone, a neck, and a detection head. Among them, the backbone is enhanced by dual scale convolutional downsampling module (DSCD), multi-strategy feature extraction module (MFE), and sensory field enhancement module (SPPF)33 effectively avoids the problems of feature information loss and feature extraction insufficiency, extracts rich detail information, semantic information and contextual information, and realizes the full understanding of the road environment and damage. The feature fusion neck enhances the information of different stages of the feature maps output from the backbone and enhances the expressive ability of each feature map in the output feature pyramid. Finally, multi-scale detection heads are used on the output feature pyramid for multi-size road damage detection.

Fig. 1
Fig. 1The alternative text for this image may have been generated using AI.
Full size image

The structure of RSDD.

Backbone for feature extraction

The backbone consists of three modules, whose structure is shown in Fig. 2, and the detailed implementation is shown in Table 1. Overall, the backbone consists of four stages, stage 1 consists of two DSCD modules and one MFE_SCR module. This stage is used for initial feature extraction to capture primary features in the input image that are essential for subsequent more advanced feature extraction. Stages 2 and 3 have the same structure, both consisting of a DSCD module and an MFE_PCR module. Stage 4 consists of a DSCD module, an MFE_PCR module, and an SPPF module. Stages 2, 3, and 4 utilize a multi-branch-multi-scale structure to capture multiple sensory field features in the input feature maps, allowing the backbone to extract rich detail, contextual, and global information. The input is downsampled at the beginning of each stage, and except for stage 1, which downsamples the input image twice, all other stages downsample the input feature map once, and the corresponding downsampling multiples of each stage are 4, 8, 16, and 32 times, respectively. Moreover, the channels of the feature maps increase sequentially after the completion of downsampling in each stage, and the channels of the feature maps extracted from the four stages of the backbone are 64, 128, 256, and 512 in that order.

Fig. 2
Fig. 2The alternative text for this image may have been generated using AI.
Full size image

The overall structure of the backbone.

Table 1 Backbone realization details.

Double scale convolution downsampling module

Downsampling is an essential component of the backbone, and the most widely used downsampling methods are pooling and convolution. Since the widths of cracks are usually smaller, the boundaries of potholes are fuzzy, and there are instances of small sizes, this leads to the fact that the feature information of road damage is easy to be lost during the traditional downsampling operation. Moreover, the distribution of cracks is widely spread, the proportion of large-size potholes in the image is also large, and the sensory field of using a 2 × 2 pooling or a 3 × 3 convolution is too small to obtain enough contextual information features, leading to insufficient extraction of features. Convolution using a large convolution kernel can effectively expand the sensory field, but simply utilizing a large kernel convolution can lead to the loss of tiny damage features.

Therefore, this study proposes a series-connected DSCD module, which is applied in down-sampling operation to simultaneously acquire both long-distance-dependent and short-distance-dependent information on road damage features and to efficiently acquire the features of the damage region. As shown in Fig. 3, DSCD adds a series of k × k depthwise convolution (s = 1) to the traditional 3 × 3 convolution (s = 2), where the value of k is greater than 3. The 3 × 3 convolution can obtain short-distance-dependence information to form the feature maps with a smaller sensory field while down-sampling. The k × k depthwise convolution uses the convolution of a larger convolution kernel on the feature map with a smaller sensory field to obtain the long-distance-dependence information of the road damage to get the feature map with a larger sensory field and does not bring too much computational cost. Then, the different information feature maps are channel connection and the information interactions performed using channel shuffle. In DSCD, k × k depthwise convolution can effectively expand the sensory field, but the increase in convolution kernel size will lead to an increase in the computational complexity of the network and the risk of overfitting. Therefore, in this study, an appropriate depthwise convolution kernel size k is selected according to the requirements of the road surface damage detection task.

Fig. 3
Fig. 3The alternative text for this image may have been generated using AI.
Full size image

Structure of DSCD.

Multi-strategy feature extraction module

By downsampling, the sensory field can be increased and the calculation cost can be reduced, but the spatial resolution, spatial information, and part of the detail information of the feature map are also reduced. In the road damage detection task, it is difficult to detect the damage accurately due to the high background noise and low contrast of the pavement. Therefore, a feature information extraction module will follow after each stage of downsampling of the network model is required. For the structural characteristics of cracks and potholes, a parallel multi-scale convolution residual module (PCR) is designed. In this module, multiple parallel convolutions are utilized for feature extraction, and an efficient residual feature extraction method is used. This module significantly enhances the capability of the model to extract multi-scale sensory field information, enabling the model to better understand object damage from multiple contextual information.

Figure 4b shows the specific structure of PCR, where an efficient two-step method is used to extract multiple contextual information inside the residuals. First, a 1 × 1 standard convolutional layer is used to implement the initial feature extraction and generate concise region feature maps. In the second step, 3 × 3 convolution with multiple dilation rates (d) is used to feature extraction with multiple sensory fields for different region features, respectively. After extracting the multi-scale sensory field features, all the features are channel-connected and subjected to a 1 × 1 pointwise convolution process to realize the information interaction in the channel direction and generate the final residuals. Finally, a more robust and comprehensive output feature is constructed by adding residuals to the input, which contains a rich set of detailed and contextual features.

Fig. 4
Fig. 4The alternative text for this image may have been generated using AI.
Full size image

Structure of the multi-strategy feature extraction module. (a) MFE; (b) PCR; (c) SCR.

However, the over-reliance on large-scale sensory fields to obtain more contextual information exceeds the practical requirements. Therefore, it is necessary to reasonably design the sensory fields of each stage module to optimize the feature extraction efficiency. To this end, this study synchronously designs the serial convolution residual module (SCR), whose structure, shown in Fig. 4c, the module utilizes a single branch serial convolution structure. Specifically, it uses two 3 × 3 convolutions for feature extraction and generates residuals, which are then combined with the input. Compared with the PCR module, these two modules have different sizes of sensory fields and are applied to the shallow and deep layers of the backbone, respectively.

In addition, to realize the lightweight of the network model, this study, from the perspective of network structure design, divides the input features into two for different processing. Figure 4a demonstrates the specific structure of MFE, firstly, a channel split is used to divide the input equally into two features of the same size, half of which is subjected to feature extraction by PCR or SCR, and then the extracted feature maps are concatenated with the original input in the channel direction. Finally, a 1 × 1 convolution is used to interact with the channel information and realize the compression of the channel dimension. This method significantly improves the efficiency of feature extraction and speeds up model training and inference. The MFE is named MFE_PCR and MFE_SCR according to the feature extraction module.

At the shallow stage of the backbone, the input feature maps are larger and contain more detailed information. At this point, the feature extraction of MFE_SCR is more effective than MFE_PCR. As the network deepens, the resolution of the extracted features will decrease step by step, and the feature information also changes from detail information to semantic information. To ensure efficient feature extraction efficiency, PCR with different dilation rates is used at other stages of the backbone to extract feature information. Deeper networks are also more receptive to larger sensory fields, as semantic enhancement allows the convolution to make larger spatial connections. Therefore, MFE_SCR is used in stage 1 of backbone and MFE_PCR is used for feature extraction in the last three stages of the network. Where the dilation rate (x, y, z) of PCR is 1, 2, 3 in stage 2. In stages 3 and 4, the dilation rate of PCR is 1, 3, 5 respectively. After research, it has been found that it is difficult for convolution to directly establish connections between spatially distant information, and long-distance connections need the help of short-distance connections, and small-size sensory fields are important in MFE_PCR. Therefore, the convolution branch in MFE_PCR with a dilation rate of 1 has five times as many output channels as the other branches.

Sensory field enhancement module

To extract the global information, the sensory field enhancement module uses SPPF, whose specific structure is illustrated in Fig. 5. Firstly, the feature channel is compressed by a 1 × 1 convolution, which is usually used with a compression ratio of 2. The compressed feature maps are then subjected to three series of pooling operations with the same stride, and each pooling operation extracts feature information that is different from the sensory field. Finally, the pooling results of different sensory fields are fused by channel concatenation and convolution operations to avoid information conflicts caused by simple splicing to aggregate feature representations at different scales. The SPPF effectively improves the global information extraction ability of the backbone, enabling the RSDD model to provide a more comprehensive and deeper understanding of the pavement environment and road surface damage.

Fig. 5
Fig. 5The alternative text for this image may have been generated using AI.
Full size image

Structure of the SPPF.

Feature fusion neck

After the backbone extracts the features of different scales, it is also necessary to construct a feature fusion neck to realize feature enhancement. The neck includes a feature selection module (FSM) and a feature fusion module (FFM), in which the FSM applies different attention mechanisms to select the different scale features extracted by the backbone, and the FFM mainly realizes the information interaction between the different scale features through the operations of downsampling, upsampling, channel connection, and feature fusion basic block (FFB). In this study, the features C2, C3, and C4 extracted from stage 2, stage 3, and stage 4 of the backbone are utilized for selection and fusion, and the final output is a feature pyramid containing several scales.

Feature selection module

To achieve more efficient feature fusion, C2, C3, and C4 features are selected through FSM respectively. By selecting and extracting more significant road surface damage features and applying them to the feature fusion process, the interference information in the background can be effectively suppressed, reducing the impact of background noise on the accuracy of road surface damage detection. The FSM in this study is realized by multiple attention mechanisms, and the attention used in different stages of the backbone varies because the feature maps extracted in different stages focus on different information.

Channel attention

C4 as the deepest feature has the smallest resolution and the richest semantic information, in which the features on each channel can be regarded as a specific category of responses, each of which affects the final semantic prediction to a different extent. Therefore, efficient channel attention (ECA)34 is employed to efficiently capture the relationships among the channels in a feature, thus enhancing the characterization of C4. As shown in Fig. 6, ECA obtains the global average of each channel through global average pooling (GAP), and then generates the channel weights through a 1 × 1 1D convolutional layer. The k in the figure represents the coverage of cross-channel interactions, and k is adapted indeed by the mapping of the channel dimensions. Finally, the features are normalized by a scaling factor (σ) to generate weights to be applied to each channel of the input, enabling the weighting of features from different channels.

Fig. 6
Fig. 6The alternative text for this image may have been generated using AI.
Full size image

Structure of the ECA.

Mixed attention

C3 as an intermediate feature contains both considerable semantic and detail information, which can be efficiently selected by the TripleAttention (TA)35 for both channel features and spatial features. Z-pooling in Fig. 7 reduces the tensor in the channel dimension to two dimensions, performs average pooling and maximum pooling on the tensor in the channel direction, respectively, and connects the obtained features in the channel direction. This operation preserves the rich representation of the actual tensor while reducing the depth and lightening the calculation. As shown in Fig. 7, input a C × H × W feature map. In a branch of the TA, establish the relationship between dimension H and dimension C. First, the feature map is rotated 90° counterclockwise on the H-axis, and the shape of the rotated feature map is changed to W × H × C. Then a 2 × H × C feature is obtained by Z-pooling, subsequently, a 1 × H × C attention weight is generated by k × k standard convolution, normalization process, and activation function. Finally, the weights are weighted with the input and rotated 90° clockwise on the H-axis. In another branch, the relationship between dimension C and dimension W is established, the input is rotated 90° counterclockwise on the W-axis and then processed by Z-pooling, standard convolution, normalization, and activation function to obtain a 1 × C × W attentional weight, which is weighted to the input, and then it is rotated 90° clockwise along the W-axis. In the third branch of TA, an information interaction between the H and W dimensions is established.

Fig. 7
Fig. 7The alternative text for this image may have been generated using AI.
Full size image

Structure of the TA.

Spatial attention

C2 contains rich detail information, and through fully utilizing the shallow detail information can achieve effective fine-grained crack and small-scale pothole detection. However, only limited context information near the damage object is often considered in C2, and without reference to sufficient long-range context, it is possible that fine-grained but large-distributed damage objects such as cracks and fuzzy-boundary damage objects such as potholes may be incorrectly detected. Moreover, the required long-range contexts may vary from one damage to another, requiring different scales of sensory fields to be accurately localized and identified. This study addresses this problem through the large selective kernel attention (LSKA) mechanism36, which consists of a large kernel convolution and a space kernel selection mechanism, as shown in Fig. 8. The large convolution kernels and the expansion rate ensure that the sensory field is sufficiently expanded, and multiple feature maps of the large sensory field are acquired by successive convolutions in LSKA. The space kernel selection mechanism automatically and adaptively assigns convolutional kernels to different objects based on multi-scale features, aiming to enhance the network’s ability to capture the space contextual information associated with damaged objects. The mechanism uses a spatial selectivity strategy to filter out the most discriminative spatial regions from the feature maps generated by the multi-scale large convolutional kernel. Specifically, the feature maps of different sensory fields are first channel-concatenated, and then average pooling and maximum pooling in the channel directions are applied to extract spatial relationships efficiently. To realize the information exchange between different spatial descriptive information, the pooled spatial ensemble features are concatenated in the channel direction, and the ensemble features are transformed into separate spatial attention maps of different sensory field feature maps using convolutional layers and sigmoid activation functions. Then, they are weighted with the feature maps of different sensory fields and fused by point-wise summation to obtain features containing spatial attention. Finally, the initial input is elementwise multiplied with the spatial attention features to get the final output.

Fig. 8
Fig. 8The alternative text for this image may have been generated using AI.
Full size image

Structure of the LSKA.

Feature fusion module

The features after feature selection have better characterization ability, but features with different resolutions have different information, to achieve an excellent detection effect, it is necessary to fuse and communicate the information of different features. However, there are semantic gaps between feature maps at different scales, and a simple fusion approach achieves very limited improvements. Therefore, the main body of feature fusion in this study adopts both top-down and bottom-up feature fusion architectures, as illustrated in Fig. 9, where high-level semantic information is first propagated into shallow features through the top-down feature fusion path to enhance the type recognition ability of all features. Then, the detailed information such as texture and edges is transferred to the deeper features through the bottom-up feature fusion path, which further enhances the localization ability and the detection ability of tiny damages in the whole feature hierarchical structure. In the process of information transfer, FFB is very important, which determines the effect of feature information fusion, for the study of this part of the structure we experimented with six structures as in Fig. 10. The structure d was finally determined as the final FFB, which first goes through a 1 × 1 convolution layer for channel information interaction, and then through a 3 × 3 convolution so that different information can be further communicated.

Fig. 9
Fig. 9The alternative text for this image may have been generated using AI.
Full size image

Structure of the FFM. (a) Top-down feature fusion; (b) Bottom–up feature fusion; (c)FFM.

Fig. 10
Fig. 10The alternative text for this image may have been generated using AI.
Full size image

FFB for different structures.

Multi-scale decoupled detection head

The detection head in this study performs detection in a decoupled manner so that the classification and regression tasks can focus on their respective features separately, effectively improving the performance of road surface damage detection. As shown in Fig. 11, two independent branches are designed to perform classification and regression separately, and the respective information is extracted by two 3 × 3 convolutions and one 1 × 1 convolution on the same output feature map, and finally the regression loss function and classification loss function are calculated separately. This design allows the model to better learn the features and laws of different tasks improve its ability to adapt to complex scenarios, and enhance the flexibility of the model during training and optimization. However, while decoupled detection heads separate tasks, they still share some features during the early stages of feature extraction to balance task independence and feature utilization efficiency. In addition, we perform detection on multiple scales of feature maps output from the neck to realize the detection of different sizes of road surface damage.

Fig. 11
Fig. 11The alternative text for this image may have been generated using AI.
Full size image

Structure of the decoupled detection head.

Dataset and evaluation indicators

Dataset

In this study, we use our own collected dataset for model training, as shown in Fig. 12, which contains road images from nine cities in China and Cambodia, involving highways, urban roads, and township roads. It includes a total of 10,440 road images covering diverse shooting angles, weather conditions, lighting environments, and pavement types, and is labeled with 36,579 instances of road surface damage. Figure 13 illustrates the manual annotation process of the collected road images using the LabelImg image annotation software. First, the target is selected by drawing a bounding box, and then the selected target is classified. The dataset records a total of four types of road surface damage: longitudinal cracks (D00), transverse cracks (D10), alligator cracks (D20), and potholes (D40). Compared to the publicly available dataset, the scenarios in this dataset are more complex, richer in interferences, and involve more tiny damage. Figure 14 summarizes the distribution of various detection objects in the dataset, and the final dataset is grouped into training, validation, and testing sets with a ratio of 7:2:1.

Fig. 12
Fig. 12The alternative text for this image may have been generated using AI.
Full size image

Partial presentation of this study’s dataset.

Fig. 13
Fig. 13The alternative text for this image may have been generated using AI.
Full size image

Labeling process of road surface damage region.

Fig. 14
Fig. 14The alternative text for this image may have been generated using AI.
Full size image

Number of examples of each type of damage in each city.

Evaluation indicators

The evaluation indicators for the object detection model mainly include F1 score, average precision (AP), number of parameters (Params), and computational complexity (GFLOPs). In addition, precision (P) and recall (R) as the basic indicators can also be used as the basis for the evaluation of the model, and the F1 score and AP calculated by P and R are the final indicators of the accuracy of the model detection.

$$\text{P}=\frac{\text{TP}}{\text{TP}+\text{FP}}$$
(1)
$$\text{R}=\frac{\text{TP}}{\text{TP}+\text{FN}}$$
(2)
$$\text{AP}=\underset{0}{\overset{1}{\int }}\text{P}(\text{R})d\text{R}$$
(3)
$$\text{mAP}=\frac{1}{n}\sum_{i=1}^{n}\text{AP}$$
(4)
$$\text{F}1=2\frac{\text{P}\times \text{R}}{\text{P}+\text{R}}$$
(5)

where TP denotes that the prediction result is assigned as a positive sample and the true value is a positive sample, i.e., the number of positive samples that were correctly classified. FP denotes that the prediction result is a positive sample but the true value is a negative sample, i.e., the number of negative samples that were incorrectly classified. FN denotes that the prediction result is a negative sample but the true value is a positive sample, i.e., the number of positive samples that were incorrectly classified. P is the percentage of correctly predicted positive samples out of all positively predicted samples. R is the percentage of positively predicted samples out of all positively predicted samples. AP denotes the area under the P-R curve. mAP denotes the average AP value of the n categories. the F1 scores are a combination of P and R, and are used as a measure of the overall performance and stability of the model.

Model training

In this study, the deep learning framework PyTorch is utilized as the network framework, and the software environment is Windows 10, Python3.9, CUDA11.3, and cuDNN8.2.1.32, and the hardware environment has an Intel i7 10700k as the CPU and an NVIDIA GeForce RTX3060 (12 GB RAM) as the GPU. During training, the input image is set to 640 × 640 and SGD is used as the optimizer for model training. The model training elapsed time is set to 300, the batch size is 16, and the initial learning rate is 0.01. Moreover, data enhancement techniques such as Mosaic data enhancement, random cropping, flipping, rotating, scaling, and brightness adjustment are applied to the training set to enhance the model’s generalization ability, prevent overfitting, and improve its adaptability to various scenarios.

Results

Research on DSCD

In this experiment, the value of k in the DSCD was investigated and the experiment in Table 2 was designed, where the value of k in the DSCD was changed while the other units of the network model remained unchanged. In this case, YOLOv8s was used as the base model (model 0) and DSCD was used as the downsampling module in the backbone to replace the original downsampling. Different k-values were used in different stages of the network, which are given in the table; in the first stage there are two downsampling operations, so two corresponding k-values were set in stage 1. By comparison, it is found that model 8 has the best detection result when the k value of the first two DSCDs in the backbone is 7 and the k value of the last three DSCDs is 5. The average precision mAP50 and mAP50:95 are 66.4% and 35.4%, respectively, and the F1 score reaches 66.5. From the analysis, it can be found that when the size of the input feature map is large, the k-value in DSCD should be larger, but not more than 7. When the size of the input feature map is small, the k-value in DSCD should be smaller, but not less than 5.

Table 2 The influence of k-value on model detection accuracy in different stages of the DSCD.

Research on MFE

This experiment identified the feature extraction strategies for the different stages in the backbone. The experiments in Table 3 show that when only MFE_PCR is used as the feature extraction module, it does not allow the network model to achieve the best detection results. In the first two stages of the network, the convolution using a sensory field that is too large is far beyond the sensory field size requirement. In the last two stages of the backbone, the acceptance of a larger feeling field is enhanced as the depth of the network deepens. Ultimately, the model is optimized for road surface damage detection when MFE_SCR is used in stage 1 of the backbone, MFE_PCR with dilation rates of 1, 2, and 3 in stage 2, and MFE_PCR with dilation rates of 1, 3, and 5 in stages 3 and 4, at which time the average accuracies of mAP50, mAP5050:95, and F1 are 66.1%, 35.0%, and 66.5, respectively.

Table 3 The influence of different feature extraction strategies at each stage on model detection accuracy.

After the feature extraction strategies for each stage of the backbone were determined, the hyperparameters in the MFE_SCR and MFE_PCR modules also needed to be analyzed. It includes whether residual connections are used in MFE_SCR and MFE_PCR, the ratio of the number of output channels of the parallel convolutional layers in MFE_PCR, and the input–output ratio of the first 1 × 1 convolution in MFE_PCR. As shown in Table 4, where Model 1 and Model 2 identify residual connections that contribute to the detection performance of the model. Models 2 through 7 specify the ratio of the number of output channels of the parallel convolution in MFE_PCR, and since it is relatively difficult for the convolution to establish information connections directly over a large space, and long-distance connections require the help of short-distance connections, a small sensory field is important at all stages, the model performs best when the number of convolutional output channels with a dilation rate of 1 is five times the number of output channels of the other concurrent convolutional layers. Models 6, 8, and 9 determine the input–output ratio of the first 1 × 1 convolution in MFE_PCR, and the model has the best detection when its ratio is 2:1. With this experiment, we determined the specific structure of the MFE.

Table 4 The influence of hyperparameters in MFE on model detection accuracy.

Research on FFB

In the feature fusion stage feature maps containing different information are fused through top-down and bottom-up structures, in this experiment we analyzed the effect of different feature fusion base blocks on road surface damage detection performance. Firstly, we used Darknet53 as a backbone to extract road surface image features, generate multi-scale feature maps through feature fusion, and finally detect and localize the damage on the output three-scale feature maps. In the experiments of Table 5, the specific structure of the FFB is discussed, and a variety of FFBs have been introduced in the above paper, it is found by comparison that the best detection accuracy is achieved when the structure of the model is d. This structure can effectively realize the fusion of different feature information and improve the model’s identification and localization accuracy of the damage.

Table 5 The influence of different FFB structures on the evaluation indicators of model performance.

Research on FSM

The FSM in this study mainly selects and enhances the feature maps extracted at different stages in the backbone through different attention, and this part of the study analyzed the effect of different attention on different information features. Experimental results are shown in Table 6, for deep features C4, both channel attention ECA and CAM can effectively enhance the accuracy of the model, and the characterization ability of the feature map can be enhanced by assigning different weights to different channels so that the model focuses on the features of the more important channels. For the intermediate feature C3 that contains both semantic and detailed information, TripleAttention can effectively enhance the characterization ability of C3 by cross-dimensional information interaction that captures the information exchange between spatial and channel dimensions. In contrast, the use of spatial or channel attention alone, as well as the simple combination of spatial and channel attention in tandem or in parallel to enhance the feature map, fail to significantly improve the characterization of the feature map. For shallow feature C2, MSCA and LSKblock can enhance the characterization ability of this feature map. Through the analysis, it can be found that the basic principles of MSCA and LSKblock are similar, both of them acquire multiple long-range contextual information of the object through the convolution of large size, and form different weights in the space. Considering the effect of attention on detection accuracy and model scale, the final choice was to use LSKA for C2, TripleAttention for C3, and ECA for C4 for feature map enhancement.

Table 6 The influence of using different attention at different stages of the backbone on the model performance evaluation indicators.

Comparison of detection performance of different necks

To validate the effectiveness of the neck proposed in this study in road surface damage detection performance, this paper compared the detection performance of different feature fusion neck network models, including FPN, BiFPN, GFPN, EVC, Gold-YOLO, HSFPN, and PAN. This experiment selected Darknet53 as the backbone, adopted a network structure with 3 decoupled detection heads, and performed transformations on the neck structure. As shown in Table 7, this study improved the neck network based on PAN by enhancing the fusion module and the selection model. Ultimately, the detection effect is significantly improved, resulting in a more comprehensive fusion of feature information at all stages. Compared to PAN, the improved method improves mAP50 by 2.4%, precision by 1.4%, and recall by 2.3%, while using less computation and number of parameters.

Table 7 Comparison of performance evaluation indicators of different necks on the test set of this study.

Ablation experiments

To analyze the enhancement of the road surface damage detection performance by each module proposed in this study, systematic ablation experiments were conducted in this part and the experiments are presented in Table 8. Each module was validated based on YOLOv8s, and the experiments show that DSCD improves the model’s accuracy in detecting road surface damage and reduces the computational and parametric quantities of the model. MFE has a similar effect, and when both are applied simultaneously, better detection results can be achieved and the model can be further lightweight. SPPF can effectively enhance the global sensing capability of the model, enabling the detection model to have a more comprehensive understanding of the road surface environment and damage. The introduction of FSM can effectively enhance the detection accuracy of the model with only a slight increase in the number of parameters and computational costs, the FFM simultaneously realizes model lightweight and accuracy improvement. And the MSD-head increases the computational costs but greatly improves the detection accuracy. Through the comprehensive application of each module, this study proposed RSDD, whose model parametric quantity is only 16.5 M, computational complexity is 30.9G, and the average accuracy for road surface damage detection reaches 70.8%. Compared with the baseline YOLOv8s, RSDD effectively achieves higher road surface damage detection accuracy while reducing the cost of model deployment.

Table 8 The influence of each module on the performance evaluation indicators of the model in ablation experiments.

Comparison experiments

To demonstrate the advantages of RSDD, it is analyzed in comparison with four state-of-the-art object detection models (different versions of YOLOv5, YOLOv6, YOLOv8, and YOLOv9). Table 9 shows a detailed comparison of the detection performances of each model on the test set of this study, compared to which RSDD achieved the best performance in terms of mAP50, mAP50:95, and F1-Score. Although it is not the smallest of all models, its detection accuracy is much higher than other smaller detection models.

Table 9 Comparison of performance evaluation indicators of different object detection models on the test set of this study.

In Fig. 15 a comparison between RSDD and each model in terms of the tradeoffs between speed and accuracy and scale and accuracy is shown, from which it is found that RSDD has a better ability to tradeoff between accuracy, scale, and speed. RSDD can achieve higher detection accuracy with fewer parameters and fast inference speed with higher detection accuracy. RSDD is more suitable to be deployed on devices with limited memory and computational power and is more suitable for some task scenarios that require fast detection. In summary, the detection model proposed in this study has a stronger trade-off between accuracy, speed, and scale, and performs best in road surface damage detection tasks.

Fig. 15
Fig. 15The alternative text for this image may have been generated using AI.
Full size image

Comparison of different models in terms of speed-accuracy and scale-accuracy trade-offs. (a) Speed-accuracy; (b) scale-accuracy.

Visualization of detection results

To more visually validate the advantages of RSDD, we collected different road surface damage images and detected them on some roads in Xi’an, China, and visualized the detection results. For the fairness of comparison, YOLO models with a similar scale as the RSDD model, i.e., YOLOv5m, YOLOv6m, YOLOv8m, and YOLOv9, which are the top-ranked versions of the corresponding models in terms of the detection results, were selected for the visualization of the results, respectively.

As shown in Figs. 16, 17, 18, 19, and 20, the visualization of the detection results of different models for a variety of damages in different scenarios is demonstrated. As can be seen in Fig. 16, each model can recognize the location of small-sized potholes in simple scenarios, but RSDD shows the optimal performance in terms of confidence. From Fig. 17a, it can be found that without too many interfering factors, only RSDD and YOLOv5m can effectively detect the location of small-sized pothole, and the confidence level of RSDD is 26% higher than that of YOLOv5m. Moreover, RSDD and YOLOv6m are more accurately localized in tiny longitudinal crack detection. While the sampling time point of Fig. 17b is a little earlier than Fig. 17a, it has a wider view than Fig. (a), which both contain the same pothole, but the location of the pothole is not marked in the labeling of Fig. 17b. However, RSDD can accurately predict the location of potholes and has excellent performance for tiny crack detection. Figure 18 shows the results of the different models on the detection of damage in complex scenarios. The other models have omitted or misdirected the damage, however, the RSDD can still accurately identify these small transverse cracks. In Fig. 19, only the RSDD can recognize the damage that is far away and has thin features. Figure 20 illustrates a complex scenario with the presence of environmental disturbances and at the same time serious road surface damage. From the individual detection results, it can be seen that each model is effective in recognizing road damage for obvious features, but for the small and not obvious features, only the RSDD can detect them effectively. Comparing the visualization of each model in the detection results further proves the advantages of the model proposed in this study in road surface damage detection, which greatly avoids misdetection and leakage detection. In particular, the detection of tiny damage has an obvious improvement effect, which better meets the detection needs in the actual complex background environment.

Fig. 16
Fig. 16The alternative text for this image may have been generated using AI.
Full size image

Visualization of detection results of different detection models for a single category of damage in a scenario-simple environment.

Fig. 17
Fig. 17The alternative text for this image may have been generated using AI.
Full size image

Visualization of detection results of different detection models for multiple categories of damage in a scenario-simple environment. (a) Images after the sampling time; (b) images before the sampling time.

Fig. 18
Fig. 18The alternative text for this image may have been generated using AI.
Full size image

Visualization of detection results of different detection models for a single category of damage in a complex environment.

Fig. 19
Fig. 19The alternative text for this image may have been generated using AI.
Full size image

Visualization of detection results of different detection models for multiple categories of damage in a complex environment.

Fig. 20
Fig. 20The alternative text for this image may have been generated using AI.
Full size image

Visualization of detection results of different detection models for multiple categories of damage in a more complex environment.

Generalizability experiments

To evaluate the generalization ability of RSDD, this study used the publicly available dataset RDD2022 to train each model, which includes road images from six countries, i.e., China, Japan, the Czech Republic, Norway, the United States, and India. In this dataset, we randomly selected 10,000 road images, similar to the previous experimental design, where road damages were categorized as longitudinal cracks, transverse cracks, alligator cracks, and potholes, and the image data were divided into training, validation, and test sets in the ratio of 7:2:1. Table 10 demonstrates the performance evaluation indicators of each model on the test set, and it can be found that RSDD still achieves the best detection accuracy on the RDD2022 test set, with a mAP50 of 61.2%. Moreover, the detection accuracy of alligator cracks and potholes is also the highest among all models. By comparing the detection performance on different datasets, the generalization and effectiveness of RSDD for the road surface damage detection task are further validated.

Table 10 Comparison of performance evaluation indicators of different object detection models on the RDD2022 test set.

Discussion

Road surface damage detection plays a crucial role in highway maintenance, which can reduce maintenance costs and ensure road safety by timely detecting and repairing the damage. The RSDD network proposed in this study realizes more accurate detection of tiny damages on road surfaces at an early stage, which effectively improves the problem of leakage and misdetection of traditional models in the detection of tiny damage. Specifically, the feature extraction backbone in this study innovatively combined the advantages of small sensory field convolution and large sensory field convolution, so that the model can simultaneously capture detail features, multi-scale contextual information, and global semantic information to comprehensively understand the road surface environment and damage features, and to avoid the problems of insufficient feature extraction and feature loss that are commonly found in the traditional methods. In addition, this study adopts different attention mechanisms for feature screening for the multilevel features extracted from the backbone and enhances the characterization ability of feature maps at different scales through the two-way feature fusion strategy of top-down and bottom-up. Finally, this study designed a multi-scale detection head, which effectively solves the problem of large size differences in road surface damage and scale changes due to scene perspective by detecting different sizes of damage on each of the four output layers, and significantly improves the detection performance.

The RSDD network was subjected to comprehensive performance evaluation experiments on both self-constructed and publicly available datasets. The experiments show that RSDD exhibits significant performance advantages over state-of-the-art object detection models in the road surface damage detection task, especially in the detection of tiny damages. The model achieves a better trade-off between accuracy, speed, and scale, and can significantly reduce the number of model parameters and improve the computational efficiency while ensuring higher detection accuracy. Specifically, the RSDD network achieves 70.8% of mAP50 value with only 16.5 M model parameters and a low latency of 4.5 ms. In addition, RSDD shows superior performance in critical evaluation indicators such as Precision, Recall, mAP, and FLOPs, fully verifying its effectiveness and practicality in road surface damage detection tasks.

Although the RSDD network achieves a better balance between accuracy, speed, and scale, in practical application scenarios, some of the devices are limited by computational resources, and is difficult to meet the demand for high computational power, which poses a challenge to its deployment and application. Therefore, to further enhance the practicality of RSDD, future research needs to focus on the lightweight design of the model for a wider range of practical engineering scenarios.

Conclusion

In this study, a new road surface damage detection model is proposed to address the problems of leakage and misdetection that tend to occur in the detection of tiny road surface damage. The model effectively solves the problems of insufficient extraction and loss of tiny damage features caused by downsampling and feature extraction in the traditional method by introducing DSCD and MFE in the backbone. At the same time, FSM and FFM are utilized to enhance the characterization ability of the feature maps at each stage in the feature pyramid and combined with the multi-scale detection head to achieve accurate detection of different sizes of damage, which significantly improves the adaptability of the model to scale changes. The experimental comparison reveals that compared with the existing object detection models, RSDD has obvious advantages in detection accuracy, inference speed, and model scale, especially making breakthroughs in the tiny damage detection task. The model is more suitable to be deployed in resource-constrained devices and provides important technical support for road maintenance. Although RSDD has made significant progress in terms of accuracy, generalization, and real-time performance, further light-weighting of the model is still a focused direction for future research. Follow-up work will focus on optimizing the computational efficiency of the model to enhance its deployment performance on mobile devices.