Abstract
Road cracks pose a significant safety hazard to transportation, making timely detection crucial for traffic safety. Traditional crack segmentation methods face three main issues: (1) Tree shadow background affects crack recognition in real-world environments. (2) Conventional convolutional neural networks fail to detect complete cracks. (3) Direct deconvolution during upsampling results in unclear crack details. To address these challenges, this paper proposes the WFU-Unet model for road crack detection and segmentation. First, the WCM module, constructed with wavelet transform, ConvNext, and MobileNet, reduces shadow interference, enabling the network to distinguish between cracks and tree shadows. Second, the Fuse module replaces traditional convolutional blocks, enhancing the network’s ability to extract crack features. Finally, the Up module substitutes conventional upsampling techniques to minimize spatial information loss of cracks. Experimental results show that the WFU-Unet model achieves a Miou of 81.68%, precision of 91.43%, recall of 86.63%, and F1-Score of 88.93%. Compared to other models, WFU-Unet demonstrates superior generalization ability and segmentation accuracy, making it more suitable for crack detection in shadowed environments.
Similar content being viewed by others
Introduction
As the most fundamental and widespread transportation infrastructure, roads directly affect the efficiency of travel and the functioning of society. However, due to various factors such as temperature fluctuations, foundation settlement, and traffic loads, cracks gradually develop, causing severe damage to road integrity1. Over time, these cracks expand, increasing traffic safety risks, damaging the road structure, and raising maintenance costs. To extend the lifespan of roads, it is essential to detect and repair road cracks promptly. However, due to the uncertainties in road environments, various disturbances complicate the detection process, especially the interference caused by tree shadows during detection. Tree shadows share similarities with cracks, making crack recognition challenging and reducing detection accuracy. Therefore, developing an accurate and effective road crack detection method is crucial for both road structural safety and traffic safety.
In earlier research, many scholars employed digital image processing methods for road crack detection. Liu et al.2 proposed an automatic crack detection method based on adaptive digital image processing, using median filtering subtraction and the Niblack method for local binarization to extract cracks from images. Li et al.3proposed a threshold segmentation method based on adjacent difference histogram statistics, effectively segmenting cracks from the background. Yu et al.4 developed a new unsupervised texture segmentation algorithm that detects feature changes and feature edges to perform segmentation, based on abrupt boundary changes between different textures. Bo et al.5 studied an automatic image segmentation method using region merging, where initially over-segmented images with uniform color regions were processed by iteratively merging regions using statistical tests. Peggy et al.6 proposed an automatic crack detection method for pavement images by analyzing wavelet coefficients to detect whether cracks existed on the image through propagation analysis at different scales. Digital image processing methods achieve high accuracy only when the road background is simple, and crack shapes are uniform. However, in practical scenarios, the uncertainty of road environments and the complexity of road cracks make it challenging for these methods to detect cracks precisely.
With the development of deep learning, research in crack detection has gradually transitioned from traditional digital image processing techniques to methods based on deep learning. Cha et al.7 used convolutional neural networks (CNN) and the sliding window approach to detect surface cracks in concrete without extracting defect features. Xu et al.8 proposed a Mask R-CNN network model, which added a feature pyramid and edge detection branches to improve detection accuracy. Yang et al.9 proposed a fully convolutional network (FCN) that performs pixel-level semantic recognition and segmentation of cracks at multiple scales, with predicted crack segments represented by single-pixel-width skeletons. Ren et al.10 proposed an improved deep fully convolutional neural network (CrackSegNet), for dense pixel-level crack segmentation, significantly enhancing the network’s overall crack segmentation capability. Liu et al.11 used a U-Net network to detect surface cracks in concrete and optimized it using the Adam algorithm. Compared to the DCNN method, this approach demonstrated greater robustness and higher detection accuracy. Zhou et al.12 proposed a new network architecture. It uses a spatial attention mechanism to capture spatial information from low-level feature maps, a channel attention mechanism to obtain high-level contextual features, and replaces traditional spatial pooling with mixed pooling layers, which are fused together for crack prediction. However, the neural networks mentioned above have not yet adequately addressed the following issues in crack segmentation: In real-world road environments, the similarity between tree shadows and cracks can interfere with accurate crack recognition; Cracks appear in irregular shapes, and conventional CNNs struggle to capture all the relevant features of these irregular cracks; During the upsampling process, the direct use of deconvolution leads to unclear segmentation details and poor boundary and fine detail segmentation performance.
To address the issue of interference from other uncertain factors in crack detection under real-world road conditions, Ronneberger et al.13 introduced ResNet into the U-Net model, enhancing the feature extraction capability of U-Net. However, ResNet does not sufficiently handle the interference from tree shadow backgrounds, resulting in the blending of crack edges with the background and affecting detection accuracy. Yuan et al.14 proposed a new residual detail attention module that enhances network performance. This module can accurately localize the spatial position of cracks, refine crack features, and reduce interference from noise. However, it does not incorporate a differentiated module for tree shadow and crack features, leading to insufficient suppression of the tree shadow background. To address the tree shadow interference as a source of noise, and considering that tree shadows are low-frequency components while cracks are high-frequency components, we introduce the WCM module. The module uses wavelet transform to decompose tree shadows and cracks, passing the low-frequency component into the MobileNet module for attenuation, and the high-frequency component into the ConvNext module for enhancement. The results from both modules are then combined to improve the detection accuracy.
To overcome the challenge of incomplete feature extraction for irregular cracks, Liu et al.15 replaced traditional convolutional blocks with depthwise separable convolutions to reduce network parameters and proposed a new activation function, ReMish, to improve the model’s robustness. Yang et al.16 used feature map fusion to combine information from different levels of feature maps into lower-level features for crack detection. In this paper, the Fuse module combines crack features from the encoder with those from the upsampling module to enhance crack feature representation. The InvertedResidual module is embedded in the Fuse module to enhance the network’s crack feature extraction capability.
To tackle the issue of unclear crack segmentation results during the upsampling process, Huang et al.17 employed an iterative upsampling method to refine the coarse feature maps, generating high-resolution crack segmentation maps with high accuracy. However, during the iterative process, some spatial information of the cracks is lost. Shan et al.18 used the UA module, which employs progressive local cross-attention during upsampling, improving boundary definition and surpassing traditional dynamic upsampling methods. Nevertheless, optimization of the module is still required to enhance edge clarity. Additionally, Wang et al.19 proposed Carafe, which performs upsampling by weighted aggregation of surrounding features, enabling adaptive upsampling operations based on content information, thereby improving the quality of the upsampling results, particularly in terms of detail preservation. However, it involves a relatively complex feature reorganization process, which may lead to higher computational costs. In this paper, the Up module is used to replace the traditional upsampling module. The Up module utilizes depthwise separable convolutions and transposed convolutions for upsampling the feature maps, supplementing semantic information, reducing the loss of spatial crack information, and alleviating edge artifacts.
To address challenges such as tree shadow interference, incomplete crack features, and unclear segmentation results in road crack detection, we propose the WFU-Unet model, which integrates wavelet transform with the U-Net network for improved crack segmentation. The key contributions of this paper are:
-
(1)
A road crack dataset with tree shadow backgrounds was constructed, containing 1,384 images of road cracks with complex backgrounds and irregular crack shapes. The cracks in these images were labeled using the Labelme tool, and the dataset was utilized for training network models to address road crack segmentation issues.
-
(2)
Wavelet transform is used to decompose road crack images with tree shadow backgrounds. The high-frequency components were passed through the ConvNext module for enhancement, while the low-frequency components were sent to the MobileNet module for attenuation and max-pooling. The results were then combined, reducing the interference caused by the tree shadow background in crack detection.
-
(3)
The traditional convolutional blocks were replaced by the Fuse module. The Fuse module fused the crack features output by the encoder with those output by the upsampling module, enhancing the expression of crack features. Additionally, the InvertedResidual module was embedded within the Fuse module to improve the network’s ability to extract crack features.
-
(4)
The traditional upsampling method was replaced by the Up module. The Up module used transposed convolutions to enhance the resolution of the feature maps, incorporating depthwise separable convolutions within two paths to capture the spatial information of cracks. The outputs from both paths were fused to generate the final upsampling result. The use of the Up module helped supplement semantic information, reduce spatial information loss of cracks, and alleviate edge artifact effects.
Materials and methods
Data collection
The dataset for this experiment was collected by researchers from the research team in the surrounding areas of Changchun, Jilin Province. To ensure the diversity of the dataset, data collection was conducted during the summer and autumn seasons, covering both morning and afternoon periods. These two seasons are characterized by significant tree shadows, allowing for the capture of data under more complex lighting conditions. The data collection device is a camera mounted on the top rear of a vehicle, as shown in Fig. 1a. The camera used is the Hikvision MV-CA013-20GM industrial camera, with a resolution of \(2048 \times 1024\) pixels, and no other sensors were used. The camera is fixed by a rigid bracket, with the lens vertically oriented towards the road surface, and the height from the ground is consistently maintained at 1.5 m. During data collection, the inspection vehicle travels at a constant speed of 30–40 km/h along the lane to ensure complete crack images are captured, as shown in Fig. 1b. To ensure the geometric accuracy of the images, the camera was pre-calibrated using the Zhang Zhengyou calibration method for intrinsic parameter calibration, with an average re-projection error of 0.15 pixels after calibration. The captured photos were then carefully selected, with 1,384 images featuring irregular crack shapes and inconsistent crack types under shaded backgrounds chosen for the dataset. However, due to the presence of the vehicle’s rear bumper and the excessively high resolution of the images, the photos were cropped. After cropping, the resolution was set to 1024 × 672, as shown in Fig. 1c. The cropped images were then manually annotated using the Labelme tool, with cracks marked in red and the background in black, and saved in JSON format. For training, the dataset was divided into training, validation, and test sets in an 8:1:1 ratio.
Crack image collection.
Methods
WFU-Unet
Due to the similarity between tree shadows and cracks in the road crack images captured by inspection vehicles, tree shadows often act as an interfering factor, hindering crack detection. Wavelet transform, a method that analyzes signals in both time and frequency domains, is used to decompose tree shadows and cracks in the images. Since road cracks are complex and small, convolutional neural networks are employed for feature extraction. U-Net, a CNN-based image segmentation model, has a U-shaped structure that efficiently integrates multi-scale features through upsampling, downsampling, and skip connections.
This paper proposes a road crack segmentation model, WFU-Unet, which integrates wavelet transform and U-Net networks. Its overall structure is shown in Fig. 2a. To address tree shadow interference in road crack detection, the WCM module is introduced. In this module, wavelet transform is applied to decompose the road crack images, with the high-frequency components fed into the ConvNext module and the low-frequency components input into the MobileNet module. Max pooling is then applied, and the results are combined. This operation reduces the impact of tree shadows on crack detection. The ConvNext and MobileNet modules are shown in Fig. 2b. To tackle the issue of incomplete crack feature extraction when detecting irregular cracks, we propose the Fuse module, shown in Fig. 2d. In the Fuse module, the feature map output by upsampling is concatenated with the feature map from the encoder. The InvertResidual module is embedded within the Fuse module to enhance the network’s ability to extract features, which aids in capturing detailed crack information. To address unclear segmentation results during the upsampling process, we use the Up module to replace traditional upsampling operations. The Up module, shown in Fig. 2c, utilizes transposed convolution to enhance the resolution of the feature map, achieving upsampling, and includes depthwise separable convolution to capture the spatial characteristics of the cracks, reducing spatial information loss and mitigating edge artifact effects. The following sections will provide a detailed introduction to the WCM module, Fuse module, and Up module.
The structure of the WFU-Unet network.
WCM module
In actual road crack images, due to the variability in the road environment, tree shadows often serve as interfering backgrounds. The similar shapes of tree shadows and cracks make it difficult for the network to differentiate between them. To reduce misdetection rates in crack segmentation, it is necessary to decompose the crack and tree shadow features during the encoding process, enhance the crack features, and suppress the tree shadow features, enabling accurate segmentation of cracks amidst tree shadow backgrounds.
However, in the U-Net network, there is an inability to distinguish between similar tree shadows and cracks, causing false crack detections. To address this issue, this paper proposes the WCM module, in which wavelet transform is applied to decompose tree shadows and cracks. The high-frequency and low-frequency components from the decomposition are subsequently fed into the ConvNext and MobileNet modules, respectively. The ConvNext module utilizes a multi-level design with \(3 \times 3\) small convolution kernels and Batch Normalization. Its dense, small receptive fields are particularly effective in capturing local details and textures. Additionally, the properties of ReLU activation promote sparse activation, thereby enhancing the retention of crack information. The MobileNet module is employed to process low-frequency information due to its use of large \(7 \times 7\) convolution kernels, which cover a broader receptive field and enable the extraction of global contextual information from the tree shadow background. This allows for the identification of distribution patterns in smooth regions, effectively suppressing low-frequency noise. Additionally, LayerNorm is more suitable for regional feature normalization, preventing the residual tree shadow background caused by normalization. Therefore, both the ConvNext and MobileNet modules work together to enhance high-frequency information and suppress low-frequency information. Finally, the feature maps are concatenated, resulting in a feature map that enhances crack information while weakening tree shadow information, thereby improving the network’s accuracy in crack segmentation.
As shown in Fig. 3, after applying wavelet transform to a road image containing tree shadows and cracks, the result shows that the shadow features are prominent in the low-frequency part, while crack details are distinctly visible in the high-frequency components. This indicates that in the wavelet decomposed crack image, the shadow appears in the low-frequency components, while the cracks are clearly visible in the high-frequency components. Therefore, the low-frequency components obtained from wavelet decomposition are input into the MobileNet module to weaken the tree shadow information, while the high-frequency components are fed into the ConvNext module to enhance the crack information.
Crack image after wavelet decomposition.
The wavelet transform formula is given below:
where I represents the original image, \(cA\) denotes the low-frequency component after wavelet decomposition, \(cH\)represents the high-frequency horizontal component, \(cV\)denotes the high-frequency vertical component, and \(cD\) represents the high-frequency diagonal component. Additionally, \(.=\)denotes the relationship between decomposition and reconstruction in the wavelet transform, not in the mathematical sense of equality, while \(.+\) indicates the union in the context of wavelet transform, not as mathematical addition.
The specific implementation steps of the WCM module are as follows:
(1) Apply first-level wavelet decomposition to the input image:
After the first-level decomposition, cracks remain in the approximation component. Therefore, a second-level wavelet transform is applied to the approximation component \(c{A_1}\) to further extract the cracks. The formula is as follows:
(2) Apply a \(3 \times 3\) separable depthwise convolution followed by a \(1 \times 1\) regular convolution to the low-frequency feature map \(c{A_2}\) to extract crack spatial features and suppress the shadow features.
(3) Normalize the convolved feature map using batch normalization (BN), forcing the data distribution to match a standard normal distribution. The BN normalization formula is as follows:
where \({y_i}\) represents an element in the output map after the BN layer, \({x_i}\) represents an element in the input feature map, \(\mu\) represents variance, \({\sigma ^2}\) represents the mean, \(\gamma\) and \(\beta\)are learnable scaling and shifting parameters that update during training.
(4) Apply the ReLU activation function to the normalized features for a nonlinear transformation. The ReLU activation function formula is as follows:
Where \(y_{i}^{\prime }\) represents the output after ReLU activation.
(5) Repeat steps (2), (3), and (4) to obtain the feature map with weakened shadow information, denoted as \(cA_{1}^{\prime }\).
(6) Apply a \(7 \times 7\) depthwise separable convolution to the high-frequency feature maps \(c{H_2}\),\(c{V_2}\),\(c{D_2}\) to extract crack information, expand the receptive field, and reduce the parameter count.
(7) Normalize the convolved feature map using LayerNorm, then apply a fully connected layer to adjust its dimensionality and capture linear relationships in the input feature map. Finally, perform a nonlinear transformation using the GELU activation function. The GELU activation formula is as follows:
where x represents an element in the input feature map.
(8) After GELU activation, add two fully connected layers with residual connections to enhance the crack features and mitigate the vanishing gradient problem, stabilizing training.
(9) By applying \(2 \times 2\) convolution to adjust the output channel number, the enhanced crack feature maps \(cH_{2}^{\prime }\), \(cV_{2}^{\prime }\), \(cD_{2}^{\prime }\).
(10) Apply the inverse wavelet transform formula 7 to obtain the final feature map \(cA_{1}^{\prime }\):
Fuse module
In the process of road crack segmentation, the irregular shape of cracks often leads to incomplete feature extraction, resulting in suboptimal segmentation performance. To enhance the network’s crack segmentation ability, a new module is introduced to replace the traditional convolutional block. This new module is used to fuse the crack features extracted from both the encoder output and the upsampling module, strengthening the network’s ability to extract features from irregular cracks and enabling the recognition of cracks with various shapes.
To address the issue of incomplete road crack segmentation, this paper proposes the Fuse module. As shown in Fig. 2d, the module adopts a dual-path feature fusion mechanism, which performs cross-scale fusion by combining the crack features output by the encoder with the spatial detail features restored by upsampling in the decoder. This effectively enhances the network’s ability to express crack features. Furthermore, the InvertedResidual module is embedded during the fusion process. This design not only preserves the local feature extraction advantages of traditional convolutional blocks but also significantly improves the network’s capability to represent fine cracks and fractured edges through multi-level feature interactions.
The specific implementation steps of the Fuse module are as follows:
-
(1)
First, the feature map 1 from the encoding module and feature map 2 from the Up module are concatenated along the channel dimension to enhance the representation of crack information.
-
(2)
A \(1 \times 1\) convolution operation is then applied to the concatenated feature map to extract spatial features, reduce the number of channels, and capture crack-specific spatial characteristics.
-
(3)
The feature map is then split into two parts using the split operation, resulting in feature map 3 and feature map 4.
-
(4)
The residual connection further extracts crack features. Feature map 4 is passed through the InvertedResidual block to extract deep spatial features. The output from the InvertedResidual block is added to feature map 3 via a residual connection to form a new feature map.
-
(5)
The updated feature maps are then passed through the InvertedResidual module to extract additional crack features.
-
(6)
Finally, the results from both residual paths are combined to form the final feature map.
The specific implementation steps of the InvertedResidual module are as follows:
-
(1)
First, a \(1 \times 1\) convolution operation is applied to the input feature map to increase its dimensionality, followed by batch normalization. This normalization forces the data distribution to conform to a standard normal distribution.
-
(2)
The ReLU activation function is then used to perform a nonlinear transformation on the normalized features.
-
(3)
Next, the spatial features of the cracks are extracted using ReflectionPad2d for padding, preserving the boundary information of the crack image. A \(3 \times 3\) depthwise separable convolution is applied to extract the spatial features, followed by batch normalization and a ReLU activation for nonlinear transformation.
-
(4)
Finally, a \(1 \times 1\) convolution operation reduces the number of channels back to the original dimensionality, followed by batch normalization to output the final feature map.
Up module
In the U-Net model, deconvolution or interpolation methods are typically used to restore feature maps. However, during crack segmentation, simple deconvolution or interpolation techniques have limited capability in restoring high-resolution details, which can lead to the loss of spatial information and result in unclear crack segmentation. To achieve clear crack segmentation, it is necessary to improve the traditional upsampling methods to reduce the loss of spatial information related to cracks.
To address this issue, the Up module is proposed. This module increases the resolution of the feature maps using transposed convolutions and incorporates depthwise separable convolutions. These operations are applied to both paths to capture the spatial information of cracks. The results from the two paths are then fused to produce the final upsampling outcome. The use of the Up module helps supplement semantic information, reduces the loss of spatial information for cracks, and alleviates edge artifacts. The structure of the Up module is shown in Fig. 2c.
The specific implementation steps of the Up module are as follows:
-
(1)
First, a \(7 \times 7\) depthwise separable convolution is applied to the input feature map to capture the spatial features of the cracks.
-
(2)
Then, LayerNorm normalization is performed on the convolved feature map to adjust the feature distribution to a standard normal distribution.
-
(3)
Next, two linear transformations followed by a GELU activation function are applied to the normalized features to perform nonlinear transformations, further enhancing the representation of the cracks.
-
(4)
Through a residual connection, the feature map output from the previous step is added to the initial feature map to generate a new feature map.
-
(5)
The spatial resolution of the feature map is restored using ConTranspose2d, improving the boundary accuracy of the feature map and enhancing the network’s ability to express crack details.
-
(6)
Steps (1), (2), (3), and (4) are repeated to generate the final feature map.
Experimental environment and setup
Experimental environment
All experiments in this study were conducted on the same hardware and software platform. Table 1 presents the hardware and software environments used in this experiment.
Training method
To avoid mismatches in image height and width, we resize all input images to a uniform size of \(512 \times 512\). During training, the batch size is set to 2, the momentum parameter is 0.9, and the learning rate is \({e^{ - 4}}\). Adam is used as the optimization algorithm, which combines the advantages of momentum and RMSProp, enabling fast convergence and adapting to learning rates with varying gradients. Additionally, the cross-entropy loss function is employed as the loss function, and the model is trained for a total of 200 epochs. In the experiment, the road crack detection dataset is split into training, validation, and test sets in an 8:1:1 ratio. Table 2 summarizes the experimental parameters and settings.
The Fig. 4 shows that the loss value decreases sharply during the initial 0–5 epochs and gradually stabilizes around the 100th epoch, indicating that the network model achieves rapid convergence.
Training loss and validation loss.
Experimental results and analysis
Experimental evaluation metrics
Since this study focuses on crack segmentation, Miou, a commonly used evaluation metric in semantic segmentation, is adopted as the primary metric to evaluate model performance. Additionally, as crack segmentation is a binary classification problem, Precision, Recall, and F1-Score are also employed to assess the model’s performance. The definitions of these evaluation metrics are as follows:
Assuming crack pixels are considered positive samples and non-crack pixels are negative samples, the definitions are as follows: TP (True Positive) denotes the number of crack pixels correctly predicted as crack pixels; FP (False Positive) represents the number of non-crack pixels incorrectly predicted as crack pixels; FN (False Negative) indicates the number of crack pixels incorrectly predicted as non-crack pixels; and TN (True Negative) refers to the number of non-crack pixels correctly predicted as non-crack pixels.
Based on these definitions, Miou represents the ratio of the intersection to the union between the predicted and ground truth values for both crack and non-crack pixels, averaged over all classes. Precision is defined as the ratio of correctly predicted crack pixels to the total number of pixels predicted as crack pixels. Recall measures the ratio of correctly predicted crack pixels to the total number of actual crack pixels. Finally, the F1-score is the harmonic mean of Precision and Recall.
Ablation experiment
To validate the effectiveness of the proposed model, ablation experiments were conducted. Under the same experimental conditions, we only varied the combinations of network modules to compare the performance of each module. In these experiments, a custom road crack dataset was used, with the U-Net network serving as the backbone architecture. The WCM, Fuse, and Up modules were paired in different combinations to evaluate the performance of various network models.
The segmentation prediction results are shown in Table 3, U-Net, as the backbone network, has better segmentation effect for the crack images with obvious crack features and less interference from the background tree shade, such as columns (a) and (b), however, when the cracks are complex and fine, such as columns (c), (d), and (e), the U-Net cannot segment the cracks accurately, and there are a large number of misdetections and miss-detections. After adding the WCM module to the U-Net network, the false positives in crack detection are reduced due to the influence of the WCM module. When the Fuse module is incorporated into the U-Net network, a noticeable reduction in missed detections of cracks is observed; however, false positives still exist when compared to the W + F module, indicating that the WCM module can reduce false positives. Upon adding the Up module to the U-Net network, crack details are better presented under the effect of the Up module, indicating that the Up module helps to minimize the loss of spatial information in cracks. Adding Up module and Fuse module to the U-Net network enhances the optimization ability of the network under the effect of ConTranspose2d and depth-separable convolution, which reduces the loss of crack information to a certain extent, but the tree shade still interferes with the detection of cracks to a certain extent in the detection results, as shown in column (a), for the edges of the road cracks and the fine cracks, it is better than the U-Net detection, but it will have the problem of false detection. Adding the coding module and Fuse module to the U-Net network, with the WCM module, the network is able to accurately segment the cracks for the interference of tree shade, and the problem of misdetection becomes less, but the lack of the Up module, in the segmentation process of the cracks, there will be problems such as unclear segmentation, as shown in column (e). The WCM module and Up module are added to the U-Net network. With the WCM module, the network is able to accurately segment the cracks for the interference of the tree shade, and the misdetection problem becomes less, but the Up module is missing, and in the segmentation process of the cracks, problems such as incomplete segmentation will occur, as shown in columns (d) and (e). Finally, by adding WCM module, Fuse module and Up module to the U-Net network, it can be clearly seen that the segmentation of cracks is more effective for different road environments.
As shown in Table 4, the modules added in this study contribute to the improvement of network performance. In the U-Net network, the addition of the WCM, Fuse, and Up modules results in significant improvements in Miou, Precision, Recall, and F1-Score, indicating that these modules enhance the accuracy of crack detection. Among these, the performance improvement is most notable after adding the Up module. The WCM module helps reduce the interference from tree shadow backgrounds. When the WCM module is absent (U + F + Unet), compared to the U-Net network, the Miou increases by 5.57%, Precision by 2.13%, Recall by 2.69%, and F1-Score by 3.64%. The Fuse module aids in the extraction of small crack features. In the absence of the Fuse module (WCM + U + Unet), compared to the U-Net network, Miou improves by 7.75%, Precision by 6.68%, Recall by 3.26%, and F1-Score by 3.72%. The Up module helps minimize the loss of crack features during upsampling. When the Up module is absent (WCM + F + Unet), compared to the U-Net network, Miou increases by 6.35%, Precision by 6.01%, Recall by 3.95%, and F1-Score by 2.77%. When all three modules are incorporated into the U-Net network, except for Recall, the network achieves the highest performance across all evaluation metrics compared to the previous four groups, demonstrating that the network can accurately detect road cracks under tree shadow backgrounds.
Comparison with other methods
To more effectively validate the performance and advantages of WFU-Unet, the model was compared to several other network architectures, including U-Net20, U-Net + + 21, R2U-Net22, Improved U-Net Model 123, and Improved U-Net Model 224. All models were trained in the same environment, with a batch size of 4 and 200 epochs.
The trained models were used to predict images from the custom dataset under various road conditions, and the performance metrics are presented in Table 5. In the table, we calculated the Miou, Precision, Recall, and F1-Score for each model. Upon comparison, only the recall of Improved U-Net Model 1 exceeds that of the model proposed in this paper. However, the proposed model outperforms the other networks in terms of Miou, Precision, and F1-Score. Overall, the proposed model is better suited for road crack detection on roads with shadowed backgrounds.
Generalization experiment
To validate that the model proposed in this paper has good generalization ability, the Crack50025 crack dataset and the CFD26 dataset were combined into a mixed dataset to train the U-Net, DeepLab27, PSPNet28, U-Net + + 21, R2U-Net22, FU-Unet and WFU-Unet. The mixed dataset was split into training, validation, and test sets at a ratio of 8:1:1. The experimental environment and configuration were the same as those used in the previous experiments, and the experimental results are shown in Table 6. From the data, it can be observed that WFU-Unet outperforms almost all metrics of the U-Net network compared to the other six models, demonstrating that WFU-Unet has strong generalization capability. Among them, FU-Unet shows performance metrics similar to WFU-Unet, indicating that the Fuse and Up modules in this study contribute to more complete and clearer crack segmentation. Due to the relatively low tree shadow interference in the mixed dataset, compared to previous experiments on a self-constructed dataset, this further validates the significant role of the WCM module in crack detection under tree shadow backgrounds.
To better demonstrate the strong generalization capability of the WFU-Unet model, some sample results from this experiment were selected for comparison, as shown in Table 7. From the experimental results, it can be observed that in the absence of a tree shadow background and in backgrounds similar to cracks, the U-Net network is able to roughly detect the shape of the cracks but suffers from missed detections, as shown in (d), (e), and (f) of Table 7. PSPNet performs the worst, with incomplete segmentations in all cases (a)-(f) of Table 7. The DeepLab network performs relatively well but also has missed detections, as shown in (b) and (e) of Table 7. The performance of U-Net + + and R2U-Net are roughly the same, with both having fewer missed and false detections. Compared to U-Net + + and R2U-Net, FU-Net provides more detailed crack delineation, demonstrating that the Up and Fuse modules contribute to better crack characterization. The performance of the WFU-Unet network is the best, with most predicted maps closely matching the ground truth labels. Furthermore, the WFU-Unet network is able to accurately segment images with backgrounds similar to cracks, demonstrating strong capability in handling misclassification issues, as shown in (f) of Table 7. This further confirms the excellent generalization ability of the proposed model across different datasets.
Practical application testing
To verify the generalization ability of the proposed method, application tests for road crack detection were carried out in sunny weather with tree shadows on road segments in Changchun City, including the Daxiang, Gongnan, Gongshuang, and Baquan to Lianhua Mountain sections. First, the WFU-Unet network was integrated into a road crack detection system on an inspection vehicle. This system comprises a road crack collection module, a processing module, and an upper-level computer. Road cracks are captured by an inspection vehicle equipped with a camera, model Hikvision MV-CA013-20GM, with a resolution of \(2048 \times 1024\). The vehicle captures an image every 2 s along each designated route, and the collected images are then uploaded to a server for road crack detection. The upper-level system can display the crack detection results. A schematic diagram of the system is shown in Fig. 5. The system does not involve the absolute geographic coordinate mapping of the cracks. In practical applications, if geographic localization is required, crack locations can be estimated by synchronizing the vehicle’s GPS coordinates with the timestamp of the crack images and combining this data with the vehicle’s speed to estimate the crack positions.
System schematic.
Table 8 presents the results of three sets of WFU-Unet network applications. Group (a) represents transverse cracks under shadow, group (b) represents longitudinal cracks under shadow, and group (c) represents network-like cracks under shadow. The results show that the proposed method can effectively segment road cracks under shadow with a low false negative rate, demonstrating strong generalization ability and suitability for various types of road crack detection tasks.
Discussion
This paper established a road crack dataset obtained by an inspection vehicle, which contains images of various types of road cracks under tree shadow backgrounds. Through a series of comparative experiments, the effectiveness of the proposed WFU-Unet network for road crack detection on inspection vehicles under tree shadow conditions was demonstrated. This network addresses issues such as false detections caused by tree shadows, incomplete segmentation of irregular cracks, and unclear crack boundaries. However, there are still several challenges in road crack detection that require further research:
-
(1)
The WFU-Unet network is quite large, resulting in longer training times. Future research should focus on reducing the network’s parameter size while maintaining detection accuracy.
-
(2)
As shown in Table 9, in scenarios where sunlight shines on the road surface, both tree shadow and overexposure interference often occur simultaneously. The WFU-Unet network performs poorly in segmenting cracks in overexposed areas. Therefore, an efficient module needs to be incorporated into the network in the future to improve crack segmentation under overexposed conditions.
-
(3)
The method proposed in this paper is primarily designed for pixel-level crack detection and does not support the extraction of geometric features such as crack width, length, or depth. In the future, further research is needed to measure these geometric features and integrate them into the existing system.
-
(4)
The method proposed in this paper is specifically designed for road crack detection under tree shadow backgrounds and is not applicable to all road scenarios. Regarding the issues of false positives or missed detections that may arise when the model encounters different scenarios, further improvements to the model will be necessary in the future to enhance its robustness.
These issues need to be addressed in future research.
Conclusion
To enable fast and accurate road crack detection under tree shadow conditions, this paper proposes the WFU-Unet road crack segmentation and detection network by combining deep learning with road surface images captured by an inspection vehicle. The road crack dataset for this experiment was created using an inspection vehicle equipped with a camera and labeled with the Labelme tool. The dataset was then input into the WFU-Unet network for training. The U-Net architecture serves as the backbone of the WFU-Unet, with a wavelet transform added in the WCM module, and the ConvNext and MobileNet modules incorporated. These enhancements enable the network to distinguish between cracks and tree shadows, reducing the interference from tree shadow backgrounds and decreasing false detections. Additionally, replacing the traditional convolution block with Fuse improves the network’s ability to extract crack features. The Up module was used to replace traditional upsampling methods, supplementing crack information, reducing spatial information loss, and alleviating edge artifacts. Experimental results show that the WFU-Unet network achieves a Miou of 81.68%, precision of 91.43%, recall of 86.63%, and F1-Score of 88.93%. Moreover, in generalization experiments, the network’s performance remains stable, demonstrating its good applicability. Finally, we integrated the WFU-Unet network with the inspection vehicle to create an IoT-based inspection system, which was tested in real-world applications and performed excellently. Thus, the WFU-Unet combined with the inspection vehicle can be applied to large-scale road crack detection tasks, enabling rapid and accurate identification of road cracks and early-stage repairs.
Data availability
The publicly available dataset used in this paper can be accessed at the following link: https://github.com/cuilimeng/CrackForest-datasetand GitHub - css1025gwwww/road-crack. The custom dataset can be obtained from the corresponding author via email: mzliuyunqing@163.com.
References
Meng, S. Standards for Quality Inspection and Evaluation of Highway Engineering. (ed. Meng, S.) 29–30 (China Communications Press, 2017).
Liu, Y., Cho, S., Spencer, B. F. & Fan, J. Automated assessment of cracks on concrete surfaces using adaptive digital image processing. Smart Struct. Syst. 14, 719–741. https://doi.org/10.12989/sss.2014.14.4.719 (2014).
Li, Q. & Liu, X. Novel approach to pavement image segmentation based on neighboring difference histogram method. 2008 Congress Image Signal. Process. 2, 792–796. https://doi.org/10.1109/CISP.2008.13 (2008).
Yu, X., Juha, Y. & Yuan, B. A new algorithm for texture segmentation based on edge detection. Pattern Recognit. 24, 1105–1112. https://doi.org/10.1016/0031-3203(91)90125-O (1991).
Peng, B., Zhang, L. & Zhang, D. Automatic image segmentation by dynamic region merging. IEEE Trans. Image Process. 20, 3592–3605. https://doi.org/10.1109/TIP.2011.2157512 (2010).
Subirats, P., Dumoulin, J., Legeay, V. & Barba, D. Automation of pavement surface crack detection using the continuous wavelet transform. Proc. 2006 Int. Conf. Image Process. 3037-3040 https://doi.org/10.1109/ICIP.2006.313007 (2006).
Cha, Y. J., Choi, W. & Büyüköztürk, O. Deep learning-based crack damage detection using convolutional neural networks. Comput. Aided Civ. Infrastruct. Eng. 32, 361–378. https://doi.org/10.1111/mice.12263 (2017).
Xu, Y., Li, D., Xie, Q., Wu, Q. & Wang, J. Automatic defect detection and segmentation of tunnel surface using modified mask R-CNN. Measurement 178, 109316. https://doi.org/10.1016/j.measurement.2021.109316 (2021).
Yang, X. et al. Automatic pixel-level crack detection and measurement using fully convolutional network. Comput. Aided Civ. Infrastruct. Eng. 33, 1090–1109. https://doi.org/10.1111/mice.12412 (2018).
Ren, Y. et al. Image-based concrete crack detection in tunnels using deep fully convolutional networks. Constr. Build. Mater. 234, 117367. https://doi.org/10.1016/j.conbuildmat.2019.117367 (2020).
Liu, Z., Cao, Y., Wang, Y. & Wang, W. Computer vision-based concrete crack detection using U-net fully convolutional networks. Autom. Constr. 104, 129–139. https://doi.org/10.1016/j.autcon.2019.04.005 (2019).
Zhou, Q., Qu, Z. & Cao, C. Mixed pooling and richer attention feature fusion for crack detection. Pattern Recognit. Lett. 145, 96–102. https://doi.org/10.1016/j.patrec.2021.02.005 (2021).
Ronneberger, O., Fischer, P. & Brox, T. U-Net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention 9351, 234–241. https://doi.org/10.48550/arXiv.1505.04597 (2015).
Yuan, G. et al. CurSeg: A pavement crack detector based on a deep hierarchical feature learning segmentation framework. IET Intell. Transp. Syst. 16, 782–799. https://doi.org/10.1049/itr2.12173 (2022).
Liu, X. et al. DS-MENet for the classification of citrus disease. Front. Plant. Sci. 13, 884464. https://doi.org/10.3389/fpls.2022.884464 (2022).
Yang, J. et al. Concrete crack segmentation based on UAV-enabled edge computing. Neurocomputing 485, 233–241. https://doi.org/10.1016/j.neucom.2021.03.139 (2022).
Huang, J. & Liu, G. An iterative up-sampling convolutional neural network for glass curtain crack detection using unmanned aerial vehicles. J. Build. Eng. 97, 110814. https://doi.org/10.1016/j.jobe.2024.110814 (2024).
Shan, J., Huang, Y., Jiang, W. & DCUFormer enhancing pavement crack segmentation in complex environments with dual-cross/upsampling attention. Expert Syst. Appl. 264, 125891. https://doi.org/10.1016/j.eswa.2024.125891 (2024).
Wang, J. et al. CARAFE: Content-aware reassembly of features. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3007–3016. https://doi.org/10.1109/ICCV.2019.00310 (2019).
Ronneberger, O., Fischer, P. & Brox, T. U-Net: convolutional networks for biomedical image segmentation. Med. Image Comput. Comput. Assist. Intervent. 9351, 234–241. https://doi.org/10.48550/arXiv.1505.04597 (2015).
Zhou, Z., Siddiquee, M. R., Tajbakhsh, N. & Liang, J. U-Net++: A nested U-Net architecture for medical image segmentation. Deep Learn. Med. Image Anal. (DLMIA) MICCAI 2018 11045, 13–11. https://doi.org/10.48550/arXiv.1807.10165 (2018).
Mubashar, M. et al. R2U++: A multiscale recurrent residual U-Net with dense skip connections for medical image segmentation. Neural Comput. Appl. 34, 17723–17739. https://doi.org/10.48550/arXiv.2206.01793 (2022).
Liu, Y. et al. Road crack detection based on the fusion of separable convolution and wavelet transform. Comput. Sci. 51, 314–322. https://doi.org/10.11896/jsjkx.240100141 (2024).
Zhang, Q. et al. Improved U-net network asphalt pavement crack detection method. PLoS ONE 19 https://doi.org/10.1371/journal.pone.0300679 (2024).
Yang, F. et al. Feature pyramid and hierarchical boosting network for pavement crack detection. IEEE Trans. Intell. Transp. Syst. 21, 1525–1535. https://doi.org/10.1109/TITS.2019.2910595 (2019).
Shi, Y., Cui, L., Qi, Z., Meng, F. & Chen, Z. Automatic road crack detection using random structured forests. IEEE Trans. Intell. Transp. Syst. 17, 3434–3445. https://doi.org/10.1109/TITS.2016.2552248 (2016).
Chen, L. C. et al. Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 40, 834–848. https://doi.org/10.1109/TPAMI.2017.2699184 (2018).
Zhao, H., Shi, J., Qi, X., Wang, X. & Jia, J. Pyramid scene parsing network. Proc. 2017 IEEE Conf. Comput. Vis. Pattern Recognit. 6230-6239 https://doi.org/10.1109/CVPR.2017.660 (2017).
Funding
This work is supported by National Natural Science Foundation of China (grant no. 42204144).
Author information
Authors and Affiliations
Contributions
Q.Z.: Designed methodological framework and manuscript writing. S. H.: Analyzing the issues; perform corresponding experiments; finish writing the manuscript together. H. W. and Z. J.: Validation of the proposed method; Completion of a large number of comparative tests. S. Z. and Y. L.: Provide resources and assist with the advancement of thesis work.All the authors contributed extensively to the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Zhang, Q., Huang, S., Wang, H. et al. Segmentation detection method in tree-shaded environment for road cracks collected by inspection vehicle on WFU-Unet. Sci Rep 15, 11760 (2025). https://doi.org/10.1038/s41598-025-96219-9
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41598-025-96219-9







