Introduction

With the continuous development of infrastructure construction and transportation in China, the scale of concrete bridge construction has been progressively expanding. By the end of 2023, the total number of highway bridges in China reached 1.0793 million, with a total extension of 95.2882 million meters1. With the increase of operation years, concrete bridges may suffer from various diseases, which not only affect the durability and reliability of the bridge, but may even lead to bridge collapse accidents. Therefore, timely detection of diseases and corresponding reinforcement measures are crucial for ensuring the safety of concrete bridges. The traditional bridge inspection method mainly relies on manual inspection, which has the disadvantages of strong subjectivity, high risk, heavy workload, and low efficiency. Surface inspection of bridges can be achieved through technology such as drones and wall climbing robots, but manual disease recognition of the photos collected by the machines is still required, which is still inefficient.

The use of image processing method can replace manual recognition and realize automatic identification of diseases in images to some extent. Perry et al.2 used the Canny edge detection operator to segment cracks in digital images containing concrete cracks. Scholars Liu et al.3, Humpe et al.4, Wang et al.5, and Jiang et al.6 respectively used the Otsu threshold segmentation method and its variants to segment cracks on the surface of bridge structures. In addition, Morgenthal et al.7 used the multi-scale centerline detection method to extract cracks from bridge inspection images. Peng et al.8 used the maximum entropy method to achieve pixel level segmentation of concrete cracks. The implementation difficulty of the above methods is relatively low, with clear physical meanings, but they are easily affected by noise in the image and are only suitable for detecting diseases with clear edges (such as cracks), their applicability to area diseases and other complex diseases is poor.

In recent years, deep learning has been widely applied in bridge surface disease detection due to its high performance and strong flexibility. Lin et al.9 integrated FPN10 structure based on Fast-RCNN network framework, achieving object detection of multiple diseases including concrete crack, spalling, efflorescence, corrosion stains, and exposed reinforcement rebar. Yang et al.11 introduced the atrous spatial pyramid pooling (ASPP) module into YOLOv3 and adopted the transfer learning training strategy to identify concrete bridge surface diseases, achieving an increase of 1.3% in mPA. Mu et al.12 proposed a adaptive cropping shallow attention network based on the YOLOv5 object detection framework, which effectively improved the detection accuracy of steel structure surface diseases and bolt diseases. Wan et al.13 proposed a BR-DETR model for bridge surface diseases based on object detection transformers, which introduced modules such as deformable convolution and convolutional project attention, effectively improving the accuracy of disease detection.

The above methods can quickly detect surface diseases of bridges, but cannot provide more detailed information such as area. To obtain quantitative parameters such as the area of bridge surface diseases, semantic segmentation technology has been applied in the field of bridge surface disease identification. Rubio et al.14 proposed a pixel level identification method for concrete bridge delamination and rebar exposure based on FCN, using pre-trained VGG as the feature extractor to improve the average accuracy of delamination and rebar exposure. Li et al.15 proposed a pixel level crack segmentation network SCCDNet based on CNN, the network uses a depthwise separable convolution to improve the accuracy of crack segmentation and reduce the complexity of the model. Ding et al.16 proposed an independent boundary refinement transformer for crack segmentation in drone captured images based on Swin-Transformer, which can quantify crack widths less than 0.2 mm.

The Google Brain team developed a series of Deeplab models17,18,19,20, among which DeeplabV3 + adopts the encoder-decoder architecture, introduces the atrous spatial pyramid pooling, and expands the receptive field through parallel or series dilated convolutions with different dilation rate, effectively reducing detail loss and further improving model accuracy. Fu et al.21 proposed a bridge crack semantic segmentation algorithm based on an improved DeeplabV3 + network, which changed the ASPP module in the original network from a parallel structure of each branch to a dense connection form, effectively improving the mIoU of the model. Zhang et al.22 proposed the FDA-Deeplab model, which introduced dual attention mechanism, integrated high-level and low-level features, and used the sample difficulty weight adjustment factor to solve the problem of sample imbalance. Jia et al.23 built a semantic segmentation network model based on DeeplabV3 + and ResNet50, optimized the probability threshold of pixel categories, and improved the pixel level detection accuracy. Although these methods have improved the segmentation accuracy, existing models still have problems such as local detail loss, large number of parameters, and slow inference speed. Moreover, most algorithms are aimed at the identification and extraction of a single disease, and there is relatively little research on the identification of multiple diseases. In practical situations, a bridge usually exhibits multiple diseases simultaneously, and different diseases with similar characteristics can interfere with each other during the identification process, increasing the difficulty of detection.

In response to the above issues, this paper proposed a lightweight semantic segmentation method for concrete bridge surface diseases based on improved DeeplabV3+. Firstly, the original backbone network was replaced with an improved MobileNetV324, which replaced the attention module, optimized the time-consuming layer, accelerated training speed, reduced parameter size, and achieved lightweight requirements. Secondly, the CSF-ASPP module was designed, which achieved cross scale fusion through cascading, added a convolution branch, modified the dilation rate, and replaced traditional convolutions with depthwise over-parameterized convolutionals (DO-Convs)25, this enhancement significantly improves the model’s multi-scale feature extraction ability, detection ability for small area diseases, and anti-interference ability. Finally, the focal loss function26 was adopted to solve the category imbalance problem by paying more attention to difficult samples. The improved model has better real-time performance and higher accuracy, reduces the performance requirements of edge artificial intelligence devices, is suitable for complex diseases segmentation environments, and can provide reliable support for subsequent quantitative analysis. It has broad potential and prospects for application in embedded devices, mobile devices, and real-time systems.

Research method

Method overview

The network structure of the improved DeeplabV3 + proposed in this paper is shown in Fig. 1. The improved MobileNetV3 is used as the backbone network to extract the low-level and high-level features of input image, and high-level features are passed into the CSF-ASPP module. The CSF-ASPP module processes the high-level features in parallel using six ways, including 1 × 1 convolution for feature extraction, four 3 × 3 DO-Convs with dilation rate of 4, 8, 12, 16 for cross pixel feature extraction, and image pooling. The input of the DO-Conv layers is the concatenation of the feature layer output by the dilated convolution in the upper layer of each branch and the feature layer output by the backbone network the channel dimension. Six processing methods are used to obtain six corresponding feature layers, which are stacked and adjusted for channel number using 1 × 1 convolution to obtain a high-level feature layer. After upsampling the high-level feature layer, it is fused with the low-level feature layer that has been adjusted using 1 × 1 convolution to change the number of channel, resulting in a feature layer that contains both low-level features and high-level features of the input image. Then, 3 × 3 convolution is used for feature extraction, and the result is adjusted to the same size as the input image by upsampling before outputting. Additionally, the focal loss function is used to calculate the prediction error and adjust the network parameters.

Fig. 1
figure 1

Improved DeeplabV3 + network structure.

Replace the backbone network

Due to the large number of parameters in the backbone network Xception27 of DeeplabV3 + model, applying the model to the semantic segmentation task of bridge surface diseases requires a large overall computational load and is time-consuming. Therefore, we adopt the lightweight MobileNetV3 network as the backbone network. MobileNetV3 uses the depthwise separable convolution of MobileNetV128 and the inverted residual structure with linear bottleneck of MobileNetV229. Based on this, the h-swish function is used instead of the swish activation function, and the attention module SENet is added.

The swish activation function can effectively improve the accuracy of the network, but it requires a large amount of computation and is not suitable for lightweight networks. The h-swish function has a similar effect to the swish activation function, but with much lower computational complexity. Therefore, MobileNetV3 uses the h-swish function with simple differentiation to replace the swish activation function. The computation formulas for the swish and h-swish activation functions are shown in Eqs. (1 and 2), respectively.

$$swish\left( x \right) = x\frac{1}{{1 + e^{x} }}$$
(1)
$$\it h - swish\left( x \right) = x\frac{{\text{Re} LU6\left( {x + 3} \right)}}{6}$$
(2)

MobileNetV3 adopts the channel attention module SENet, which performs global average pooling on the input feature layer and performs two fully connected operations. Then, the normalized weights are multiplied element-wise with the original input feature layer for the generalization model. The structure of SENet is shown in Fig. 2a. The ECA-Net module30 is improved from two aspects: avoiding dimensionality reduction and cross channel information exchange. The fully connected layers of SENet are removed to avoid the impact of dimensionality reduction on the prediction results. The structure of ECA-Net is shown in Fig. 2b. Therefore, we replaced the SENet module with the ECA-Net module and truncated the time-consuming layers in the backbone network to significantly reduce model parameters and computational complexity, making it possible to significantly reduce the training and prediction time required in bridge surface disease segmentation tasks. Table 1 shows the structure of the improved MobileNetV3 model.

Fig. 2
figure 2

Comparison of SENet and ECA-Net structures.

Table 1 The structure of the improved MobileNetV3 model.

CSF-ASPP module

The structure of the original ASPP module is shown in Fig. 3a, its larger dilation rate convolution kernel is beneficial for extracting information on larger areas of diseases. However, in practical engineering, the surface diseases of concrete bridges often have smaller areas, which can easily lead to the loss of detailed and spatial information during feature extraction. In addition, the mutual interference between multiple diseases can lead to insufficient context information mining in the deep layers of the network. In response to the above issues, we made improvements to the ASPP module, and the improved CSF-ASPP module structure is shown in Fig. 3b.

Fig. 3
figure 3

Comparison of ASPP module before and after improvement.

We added the dilated convolution branch 5, and modified the dilation rate of the original dilated convolutions from 6, 12, 18 to 4, 8, 12, 16. The modified module has more convolutional kernels of different scale, which can extract more features of different scales, thereby enhancing the model’s ability to identify and extract diseases of different sizes. At the same time, in order to solve the interference problem between multiple diseases, we borrowed the idea of DenseNet31 model structure, redesigned the relationship between each branch, implemented branch cascading, enhanced the complementarity between multi-scale features, and improved the anti-interference ability of the model. Finally, we used depthwise over-parameterized convolutional instead of the original module’s 3 × 3 standard convolution. DO-Conv adds an additional depthwise convolution to the existing convolutional layers, improving the model’s feature representation ability. DO-Conv can be converted into traditional convolution operations, so replacing traditional convolutions in a model with DO-Convs does not increase computational requirements. DO-Conv combines ordinary convolution and depthwise convolution through kernel combination, as illustrated in Fig. 4.

Fig. 4
figure 4

DO-Conv kernel combination.

In Fig. 4, \(\:{\varvec{D}}^{\text{T}}\) is the transpose of \(\:\varvec{D}\) in the feature crosses of DO-Conv; \(\:\varvec{P}\) is a two-dimensional tensor, whose shape is determined by the size and stride of the convolution kernel; \(\:\varvec{W}\) is the three-dimensional weight; \(\:{C}_{\text{i}\text{n}}\) is the number of input channels, \(\:{C}_{\text{o}\text{u}\text{t}}\) is the number of output channels, the size of the convolution kernel is \(\:M\times\:N\), \(\:{D}_{\text{m}\text{u}\text{l}}\) is the depth multiplier, and each input channel is converted to \(\:{D}_{\text{m}\text{u}\text{l}}\) dimensional features. \(\:\circ\:\) represents depthwise convolution, \(\:\ast\:\) represents standard convolution. Compared to traditional convolution, DO-Conv uses more parameters for training without increasing the computational complexity of network inference, which not only accelerates convergence speed but also improves network performance. The computational cost of the forward propagation of the DO-Conv kernel combination method is shown in Eq. (3).

$$\:\left\{\begin{array}{c}W^{\prime}={\varvec{D}}^{T}\circ\:W:{D}_{\text{m}\text{u}\text{l}}\times\:\left(M\times\:N\right)\times\:{C}_{\text{i}\text{n}}\times\:{C}_{\text{o}\text{u}\text{t}}\\\:O=W^{\prime}\ast\:P:{C}_{\text{o}\text{u}\text{t}}\times\:{C}_{\text{i}\text{n}}\times\:H\times\:W\times\:\left(M\times\:N\right)\end{array}\right.$$
(3)

Loss function

The original DeeplabV3 + network model uses the cross-entropy loss function to calculate prediction errors, performs backpropagation, adjusts network parameters, and the formula for calculating the cross-entropy loss function is shown in Eq. (4).

$$\:{L}_{CE}=-\frac{1}{N}\sum\:_{i=1}^{N}\sum\:_{c=1}^{K}{y}_{i,c}\text{l}\text{og}\left({p}_{i,c}\right)$$
(4)

In the formula: \(\:N\) is the total number of pixels; \(\:c\) is a semantic segmentation category; \(\:K\) is the total number of semantic segmentation categories; \(\:{y}_{i,c}\) is signum function, which takes 1 if the true class of pixel \(i\) belongs to \(c,\), and 0 if not; \({p}_{i,c}\) is the predicted probability that pixel \(\:i\) belongs to category \(\:c\).

In the actual bridge detection, there are various types and shapes of surface diseases, and the disease area is far less than the non disease area. Therefore, most of the pixels in the dataset of surface diseases in concrete bridges belong to the background category, while the pixels of surface diseases are very rare. This extreme imbalance of samples leads to the model tending to optimize the predictive performance of the majority class, while neglecting the identification of minority classes. Using the cross-entropy loss function cannot effectively balance the learning of fewer class samples. The focus loss function26 introduces a moderation factor to reduce the loss contribution of easy to classify samples, thereby placing more training focus on difficult to classify samples, aiming to increase the model’s attention to difficult to classify or minority class samples. The calculation formula for the focus loss function is shown in Eq. (5).

$$\:{L}_{FL}=-{\alpha\:}_{t}{\left(1-{p}_{t}\right)}^{\gamma\:}\text{l}\text{og}\left({p}_{t}\right)$$
(5)

In the formula: \(\:{p}_{t}\) is the predicted probability of the true category of the sample; \({\alpha}_{t}\) is a balance factor used to adjust the loss of the model on each category; \(\gamma\) is an adjustable focus parameter that makes the model to focus more on difficult to classify samples and improve its generalization ability.

By paying more attention to difficult to identify samples, focus loss function can effectively solve the imbalance of disease types in the data. Especially when there are few disease samples, it can ensure the model’s ability to identify these minority classes (diseases), reduce the model’s bias towards the majority class (normal area), and make the model more adaptable and robust in complex backgrounds, thereby improving the overall performance of bridge disease segmentation. Therefore, we replaced the loss function of the model with the focus loss function to solve the problem of sample imbalance.

Experimental results and analysis

Dataset of concrete bridge surface diseases

The dataset of surface diseases in concrete bridges comes from RGB visible light images of bridge diseases manually captured by a traffic inspection company. In order to ensure the diversity and generalization of data, bridge disease images are collected from different bridge inspection personnel without fixed shooting conditions, such as focal length, object distance, shooting equipment, etc., and there is no fixed resolution size for disease images. The principle of filtering images is that the disease area is clear and the image resolution is high. In addition, these images also contain different lighting conditions and shooting angles, which can ensure the adaptability of the model to different complex environments. Finally, 420 images of common bridge surface diseases including spalling, exposed reinforcement rebar, efflorescence, and crack were manually selected, and the pixel level labeling of the surface disease areas was carried out using the software Labelme. Examples of concrete bridge surface disease images and their corresponding labels are shown in Fig. 5.

Fig. 5
figure 5

Examples of concrete bridge surface disease images and their corresponding labels.

We used built-in modules in Python to expand images, mainly using geometric transformation and color space transformation. The former includes vertical flip, rotation (90°,180°, − 90°), and the latter includes contrast enhancement and brightness enhancement. Finally, 2520 subgraphs with six times the original dataset were obtained based on the above six expansion methods, and the dataset was divided into training set and verification set in an 8:2 ratio.

Experimental environment and parameter setting

The experiments used 64-bit Windows10 operating system, Intel i9-14900 K CPU, NVIDIA RTX4080 GPU with 24GB memory. The model framework was built based on Pytorch1.11, and the GPU running platform was NVIDIA CUDA11.3. The input image size was set to 512 × 512, the batch size was set to 10, and the learning rate range was set to [0.00001,0.001]. The optimizer used momentum stochastic gradient descent (MSGD). The momentum parameter was set to 0.9, the weight decay was set to 0.001, the cosine annealing strategy was chosen as the learning rate decay method, and the training epoch was set to 150.

Evaluation metrics

In the experiments, we used intersection over union (IoU), mean intersection over union (mIoU), pixel accuracy (PA), mean pixel accuracy (mPA), parameters, and frame per second (FPS) as evaluation metrics.

IoU is the ratio of the intersection and union of two sets of true and predicted labels for a specific category, and mIoU is the average IoU of all categories. The mathematical expressions for IoU and mIoU are shown in Eqs. (6 and 7).

$$\:IoU=\frac{{p}_{ii}}{\sum\:_{j=0}^{k}{p}_{ij}+\sum\:_{j=0}^{k}{p}_{ji}-{p}_{ii}}$$
(6)
$$mIoU = \frac{1}{{k + 1}}\sum\limits_{{i = 0}}^{k} {IoU}$$
(7)

In the formula, \(\:{p}_{ij}\) represents the number of pixels where the real label is \(\:i\) and the predicted category is \(\:j\), \(\:k\) represents the largest number of valid category labels, and \(\:k+1\) represents the total number of categories.

PA is the proportion of correctly predicted pixels for a certain category to the total number of pixels in that category, and mPA is the average value of PA for all categories. The mathematical expressions for PA and mPA are shown in Eqs. (8 and 9).

$$\:PA=\frac{\sum\:_{i=0}^{k}{p}_{ii}}{\sum\:_{i=0}^{k}\sum\:_{j=0}^{k}{p}_{ij}}$$
(8)
$$mPA = \frac{1}{{k + 1}}\sum\limits_{{i = 0}}^{k} {\frac{{p_{{ii}} }}{{\sum\nolimits_{{j = 0}}^{k} {p_{{ij}} } }}}$$
(9)

Comparison of training processes

To visually compare the performance of DeeplabV3 + and the improved DeeplabV3+, we present the changes in loss values and mIoU values during the training process for both models in Fig. 6. As shown in Fig. 6, it can be observed that the training loss of the improved model decreases significantly faster than the original model in the initial stage, and stabilizes within fewer epochs. The overall loss value is also significantly lower than that of the original model. At the 150th epoch, the improved model recorded a loss of 0.032, compared to 0.074 for the original model. In addition, the improved model achieved higher mIoU values in the early stages of training and demonstrated higher and more stable segmentation accuracy with less fluctuation in subsequent epochs. The improved model achieved the mIoU of 75.24% at the 150th epoch, while the original model was only 71.51%. This indicates that the improved DeeplabV3 + exhibits better convergence performance on the concrete bridge surface disease dataset, and has better ability to capture multi-scale information and detailed features.

Fig. 6
figure 6

Model training process curve.

Backbone network comparison experiments

In order to test the impact of different backbone networks on the performance of DeeplabV3 + and the effectiveness of the improvement of MobileNetV3 proposed in this paper, we selected Xception network27, EfficientNet V2 network32, Resnet101 network33, MobileNetV3 network24, and improved MobileNetV3 network as backbone networks respectively, and performed comparative experiments under the same environmental parameters. The experimental results are shown in Table 2.

Table 2 Performance comparison of different backbone networks.

As can be seen from Table 2, when the improved MobileNetV3 was used as the backbone network, all four evaluation metrics improved compared with other backbone networks. Compared to Resnet101 and MobileNetV3, the mIoU increased by 5.87% and 2.31%, respectively, and the mPA increased by 9.50% and 5.62%, respectively. Additionally, the number of parameters is the lowest and the FPS is the highest. Compared with the backbone network Xception and Efficientnet V2, the mIoU and mPA of the improved MobileNetV3 were also slightly improved, and the parameter sizes were reduced by 90.44% and 88.02%, respectively. By synthesizing the four evaluation metrics, the improved MobileNetV3 can significantly reduce the number of network parameters while ensuring the segmentation performance of the model, effectively achieving the lightweight of the model.

Comparative experiments of CSF-ASPP module optimization

The CSF-ASPP module was optimized based on the ASPP structure, achieving optimal performance by designing cascade structure, using DO-Conv, increasing the number of branches, and adjusting the dilation rate. To verify the improved performance, the CSF-ASPP module comparative experiments were conducted using the improved MobileNetV3 as the backbone network. The experimental results are shown in Table 3.

Table 3 Performance comparison experiments of CSF-ASPP module.

According to Table 3, when using the cascaded structure, both mIoU and mPA were higher than the original ASPP structure, and the performance was optimal when the convolution type was DO-Conv, the number of branches was 4, and the dilation rate was (4, 8, 12, 16), with the mIoU and mPA reaching 74.60% and 83.99%, respectively. Compared with the original ASPP module, the CSF-ASPP module improved the mIoU and mPA by 2.81% and 3.15% respectively, proving that the CSF-ASPP module has better performance in semantic segmentation of concrete bridge surface diseases than the original ASPP module.

Ablation experiments

In order to further verify the effectiveness of the improved MobileNetV3 module, CSF-ASPP module, and focus loss function in segmenting four types of diseases including spalling, exposed reinforcement rebar, efflorescence, and crack, DeeplabV3 + was defined as the baseline method and ablation experiments were conducted. The experimental results are shown in Table 4.

Table 4 Statistics of evaluation metrics for ablation experiments.

By analyzing Table 4, it can be seen that compared to the baseline method, using the improved MobileNetV3 as the backbone network can significantly improve the model inference speed and to some extent enhance accuracy. After introducing the CSF-ASPP module, the inference speed slightly decreased, but the segmentation accuracy for each diseases was significantly improved. Among them, the segmentation accuracy for efflorescence was improved the most, with the IoU and PA increasing by 6.36% and 8.10%, respectively. This is because the CSF-ASPP module endows the model with stronger multi-scale feature extraction and interference resistance capabilities, notably enhancing the segmentation accuracy for diseases with blurred edges such as efflorescence. After replacing the loss function with the focus loss function, the mIoU and mPA increased by 0.64% and 0.69% respectively, with minimal impact on inference speed. In summary, the comprehensive performance of the model with three improvements is the best, and the inference speed is greatly improved while the accuracy is also significantly improved.

Comparative experiments of different semantic segmentation models

To verify the effectiveness of the improved DeeplabV3+, comparative experiments were conducted with U-Net34, HRNet35, PSPNet36, SETR37, SegFormer38, Mask2Former39, PIDNet40, and DeeplabV3+20. The comparison results are shown in Table 5.

Table 5 Performance comparison results of different models.

From Table 5, it can be seen that although our method is slightly inferior to PSPNet in terms of parameters and FPS evaluation metrics, it performs the best in both mIoU and mPA evaluation metrics, at 75.24% and 84.68%, respectively. Compared with the baseline method DeeplabV3+, our method improved the mIoU and mPA by 3.73% and 4.21% respectively, reduced the parameter size by 90.33%, and increased the FPS by 36.22. The improved model ensures the smaller number of parameters and faster inference speed while improving segmentation accuracy, resulting in the best overall performance. In order to more intuitively show the differences in the segmentation effects of different methods on the surface diseases of concrete bridges, we selected six representative images for comparative experiments and visualized them, as shown in Fig. 7. Figure 7a are the disease images, 7b are the label images, and 7c–k are the semantic segmentation results obtained using U-Net, HRNet, PSPNet, SETR, SegFormer, Mask2Former, PIDNet, DeeplabV3+, and our method, respectively.

Fig. 7
figure 7figure 7

Visual comparison of segmentation results.

From Fig. 7, it can be seen that after training, the improved DeeplabV3 + produces segmentation results that are closer to the true values of the labels, and can more accurately and correctly segment and classify the surface disease of concrete bridges. The edge pixel detail detection is better than other models, and the segmentation outline is clear and obvious, with strong robustness.

Conclusion

In order to achieve high-precision and lightweight identification of surface diseases in concrete bridges, this paper proposed a lightweight semantic segmentation method for concrete bridge surface diseases based on improved DeeplabV3+. Firstly, replaced the original backbone network with the lightweight MobileNetV3, then used the ECA-Net module to replace the SENet module in the MobileNetV3 network, and deleted the time-consuming layers in the network, significantly reducing the model parameters. Next, designed the CSF-ASPP module by adding a convolutional branch and cascading adjacent branches to expand the receptive field and enhance pixel connectivity. Then, set the dilation rate to [4,8,12,16] and replaced Convs with DO-Convs to improve model performance. Finally, replaced the cross entropy loss function with the focus loss function to enhance training on the small number of samples. Experimental results indicate that the improved DeeplabV3 + outperforms the baseline method in segmentation accuracy for four types of concrete bridge surface diseases including spalling, exposed reinforcement rebar, efflorescence, and crack, with mIoU and mPA reaching 75.24% and 84.68%, respectively, outperforming other semantic segmentation models. The model has a parameter count of 6.97 × 106 and achieves an FPS of 52.64. Compared with the baseline method, the improved DeeplabV3 + shows increases of 3.73% and 4.21% in mIoU and mPA, reduces the parameter count by 90.33%, and increases the FPS by 36.22, thereby enabling efficient and rapid detection of concrete bridge surface diseases with a relatively small model size.

The dataset used in this study comes from images collected by a single traffic inspection company during the actual bridge inspection process. The disease samples are concentrated in temperate monsoon climate regions and lack images of diseases in extreme environments such as tropical and cold zones (such as peeling caused by freeze-thaw cycles and accelerated rebar corrosion in high temperature and high humidity areas). These images may have certain limitations in covering larger areas and richer scenes, which may affect the model’s generalization ability to bridges with significantly different environmental conditions or building materials.

In future research, we plan to collaborate with more bridge inspection agencies to collect more image data from different regions and bridges, and implement an active learning framework. This will further enrich the diversity and representativeness of the dataset, enabling better adaptation to the various complex environments encountered in practical engineering applications. Furthermore, we are considering deploying the proposed model to concrete bridge inspection drones. A multi-scale inference strategy will be implemented to dynamically adjust input resolution based on disease density, thereby reducing redundant computational overhead and balancing speed with accuracy. Leveraging drone metadata and Structure-from-Motion (SfM) techniques, we will achieve precise spatial calibration to convert segmentation masks into quantitative parameters (such as crack width and spalling area). This pipeline will provide scientific metrics for bridge maintenance decision-making, ensuring structural reliability and long-term durability.