Introduction

Bolt connections serve as critical components in joining elements within steel structure bridges1. The integrity of these connections is essential to the overall structural reliability. Nevertheless, the vast quantity of bolts involved, for instance, the Nujiang Four-Track Bridge in Baoshan City, Yunnan Province, China, utilizes over 800,000 sets2, increases the likelihood of bolt loss, which can result from installation omissions or loosening under long-term cyclic loading.

The loss of bolts compromises the load-bearing capacity of the connection, thereby threatening the overall structural integrity. Accordingly, there is a critical need for methods capable of quickly and precisely pinpointing missing bolts and quantifying their numbers; such data provides a definitive basis for prioritizing replacements to maintain structural integrity. While traditional practices depend on periodic, labor-intensive manual checks3,4,5, these approaches suffer from high costs, poor efficiency, and variable outcomes tied directly to operator skill. More recently, the field of artificial intelligence has yielded promising new solutions, particularly in deep learning-based image detection, which offers an automated alternative for identifying bolt defects.

The application of computer vision and deep learning for automatic bolt condition assessment has been extensively investigated. Early research conducted preliminary investigations into image-based identification of bolt defects; for instance, Cha et al6. successfully demonstrated the potential of traditional image processing combined with SVMs, using the Hough transform for robust feature extraction. As deep learning emerged, CNNs became the dominant architecture. Wang et al7., Zhao et al8., and Zhou et al9. all validated CNNs’ powerful capability for feature learning, applying them to detect various defects including loosening and absence, particularly in demanding contexts like bridge infrastructure. To overcome limitations, researchers developed targeted improvements. Li et al10. enhanced computational efficiency and data requirements by incorporating time-frequency analysis into their CNN model. Yang et al11. refined detection accuracy by combining geometric transformations with IoU-guided selection. The pursuit of practical, versatile solutions led to several key developments: Ni et al12. created a multi-classification system able to distinguish between different fault types (corrosion, loosening) using an enhanced YOLOv5s, while Chen et al13. prioritized deployment feasibility with a lightweight YOLOv5 variant designed for mobile platforms. Efforts to enable real-time operation were realized by Pan et al14., who developed an integrated system for live tracking. Crucially, Lao et al15. addressed a fundamental challenge by analyzing and mitigating the impact of variable imaging conditions — such as focal length, angle, and illumination — through adaptive preprocessing and model tuning, thereby significantly improving the robustness of visual inspection systems.

While these studies have substantially advanced the field of deep learning-based bolt defect detection and introduced viable solutions for structural health monitoring, significant challenges persist. The accuracy of any detection system is susceptible to considerable variability in bolt appearance within images, stemming from diverse shooting angles, variable lighting conditions, and partial occlusions. Moreover, real-world applications present additional complexities; steel structures are situated in diverse environments such as urban areas, mountains, rivers, which introduce intricate and cluttered backgrounds. Compounding this issue is the common practice of coating steel components in distinctive paints during fabrication to prevent corrosion. Collectively, these factors can severely degrade detection performance. Consequently, the practical efficacy and robustness of the previously discussed methods have yet to be fully validated within complex, real-world engineering environments.

Our investigation addresses the identified limitations by first constructing a diverse bolt image dataset gathered from three sources: field images of operational suspension bridges and transmission towers, and controlled images from a custom-designed bolt joint model. Following augmentation via both traditional and deep learning techniques, we performed a comparative analysis of YOLO-series models (v5, v8, and v10). Building on the results, an enhanced YOLOv8-based architecture is proposed, which integrates a Swin-Transformer16 backbone and multi-scale and detail enhancement module (MEDM) to boost detection robustness in cluttered backgrounds and from multi-angle perspectives. The proposed model was validated through controlled experiments and real-world applications across various scenarios.

Dataset construction

Database collection

Bolt image data were collected through three primary methods: on-site acquisition from suspension bridge, field photography of steel transmission towers, and image capture from a custom-built bolt joint model.

The first subset comprises 757 images of bolts from various structural components of a steel suspension bridge, as illustrated in Fig. 1.

Fig. 1
Fig. 1The alternative text for this image may have been generated using AI.
Full size image

Bolt image database of steel suspension bridge: (a) Cable-hanger connection; (b) Top chord joint of main truss; (c) Lateral bracing diagonal-to-top chord connection.

The second subset includes 252 images of bolts captured at transmission tower sites, shown in Fig. 2.

Fig. 2
Fig. 2The alternative text for this image may have been generated using AI.
Full size image

Transmission tower bolt database: (a) Single-bolt K-joint; (b) Single-bolt T-joint; (c) Single-bolt X-joint.

The third subset contains 988 images acquired from a specially designed bolt joint assembly, depicted in Fig. 3.

Fig. 3
Fig. 3The alternative text for this image may have been generated using AI.
Full size image

Database of self-assembled bolt joint model: (a) Original color; (b) Gray coating.

Image enhancement and generative adversarial network (GANs)

To enrich the diversity and robustness of our training dataset, we implemented a two-pronged augmentation strategy utilizing conventional image enhancement techniques and advanced deep learning models based on GANs17.

Image enhancement

To improve the diversity and robustness of the bolt image dataset, several enhancement techniques were applied, including the addition of Gaussian noise to simulate real-world sensor interference18, low-pass filtering to emulate blur under motion or defocus19, and color space conversion to enhance invariance to illumination changes20. Example results of these enhancement operations are shown in Fig. 4.

Fig. 4
Fig. 4The alternative text for this image may have been generated using AI.
Full size image

Examples of image enhancement: (a) Original image; (b) Gaussian noise; (c) Low-pass filtering; (d) Color space conversion.

Images generated by generative adversarial networks

Figure 5 presents examples of bolt images generated by the GAN. The GAN was designed to vary background and contextual elements for data augmentation while preserving the key structural features of the bolts themselves.

Fig. 5
Fig. 5The alternative text for this image may have been generated using AI.
Full size image

Examples of images generated by the GAN: (a) Original image; (b) GAN-generated vegetation background; (c) GAN-generated misty background.

A total of 5,057 bolt images were collected using the aforementioned methods. The dataset information is summarized in Table 1. These images were divided into a training set and a validation set in an 8:2 ratio. After completing the image collection, the X-AnyLabeling automatic annotation software was employed to label the dataset, which was then used to train the model. The software was used to annotate the bolt images with two types of labels for analysis: “screw” (bolt present) and “noscrew” (bolt missing).

Table 1 Composition of the Multi-Source Bolt Image Dataset.

Comparative study on detection performance of different models

The dataset compiled through the aforementioned methods was used to train several models. A comparative analysis of their detection performance was conducted to inform subsequent model improvements.

Selected comparative models

Model training was carried out using the YOLO (You Only Look Once) framework21—a convolutional neural network-based object detection algorithm. Among its various versions, YOLOv5, YOLOv8, and YOLOv10 represent the most stable and widely adopted iterations and were therefore selected for comparison.

YOLOv5 model

YOLOv5 employs a modular architecture consisting of Input, Backbone, Neck, and Prediction Head. A distinctive feature of its Backbone is the Focus module, which performs slicing operations to enhance feature extraction while preserving contextual information22. In this study, the YOLOv5s variant was adopted as one baseline model to provide a performance benchmark for subsequent comparisons. The overall structure is shown in Fig. 6.

Fig. 6
Fig. 6The alternative text for this image may have been generated using AI.
Full size image

Network structure of YOLOv5s.

YOLOv8 model

Compared to YOLOv5, YOLOv8 introduces the C2f module in place of the C3 module23. The C2f incorporates more skip connections and an additional split operation, while reducing convolutional operations in branch layers. This design lowers computational cost while improving gradient flow and feature representation capability. Based on its balanced performance in accuracy and efficiency observed in preliminary experiments, YOLOv8 was selected as the foundation for our proposed improved model. A schematic of these improvements is shown in Fig. 7.

Fig. 7
Fig. 7The alternative text for this image may have been generated using AI.
Full size image

Architectural improvements in YOLOv8.

YOLOv10 model

As the latest iteration in the series, YOLOv10 achieves a balance between efficiency and accuracy through architectural refinements including enhanced feature fusion and lightweight design24. It was included in this study to compare its performance against YOLOv5 and YOLOv8 in the task of bolt and bolt-missing detection.

Model training parameters

Hyperparameters play a critical role in deep learning algorithms, governing model architecture configurations and profoundly influencing computational efficiency, final performance metrics, and convergence behavior. Given that the computational complexities of the three compared models (YOLOv5, YOLOv8, and YOLOv10) are comparable under equivalent parameterization, this study establishes a valid baseline for horizontal performance comparison. To optimize training regimens adaptively, we implemented an automated optimizer selection strategy: AdamW is deployed for training scenarios with ≤ 10⁴ iterations, while Stochastic Gradient Descent (SGD) is utilized for instances exceeding 10⁴ iterations. The complete set of model-specific hyperparameters is detailed in Table 2.

Table 2 Model training parameters.

Using the parameters defined in Table 2, all three models underwent identical training protocols. Their respective detection performances were subsequently evaluated using standardized metrics including recall, precision25, and mean Average Precision (mAP)26.

Recall and precision analysis

This section analyzes the performance of the trained models using standard metrics designed for object detection tasks.

Recall

Recall measures the proportion of actual positive instances correctly identified by the model. It is defined as the number of true positive predictions divided by the total number of actual positives, as expressed in Eq. (1):

$$Recall{\text{=}}\frac{{TP}}{{TP+FN}}$$
(1)

where TP (True Positives) denotes the number of correctly detected bolt instances, and FN (False Negatives) refers to the number of actual bolts that were missed by the model.

Precision

Precision measures the proportion of detected targets that are correctly recognized as “screw”, calculated as the number of images correctly recognized as “screw” divided by the total number of images identified as “screw”, as shown in Eq. (2):

$$Precision=\frac{{TP}}{{TP+FP}}$$
(2)

where FP (False Positives) indicates the number of incorrect bolt predictions, which includes both detections in background areas and false detections on non-bolt objects.

The recall and precision values of the three models are summarized in Table 3. As shown in Table 3, YOLOv5, YOLOv8, and YOLOv10 all reached a precision of 0.9. The recall rates of YOLOv5 and YOLOv8 differ by 0.05, while the gap between YOLOv10 and YOLOv8 is 0.08—greater than that between YOLOv10 and YOLOv5. A comprehensive evaluation reveals that the YOLOv8 model achieves superior performance in terms of both precision and recall.

Table 3 Model recall and precision rates.

Mean average precision (mAP) comparative analysis

Mean Average Precision (mAP) serves as the primary metric in this study for quantitatively evaluating and comparing the overall detection performance of the YOLOv5, YOLOv8, and YOLOv10 models on the task of bolt and missing-bolt detection. The mAP is computed as:

$$mAP=\frac{1}{n}\sum\limits_{{k=1}}^{{k=n}} {A{P_k}}$$
(3)

where AP (Average Precision) for a specific category is derived from the area under its precision-recall curve, typically computed as the average of precision values at a set of recall levels. The variable n represents the total number of categories (e.g., “screw” and “no screw” in this work), and APₖ denotes the Average Precision of the k-th category.

(4)

To ensure a consistent and objective assessment, the Intersection over Union (IoU)27 metric was employed to determine whether a predicted bounding box correctly localized a bolt. The IoU measures the overlap between a predicted bounding box and its corresponding ground truth, calculated as:

Here, the “Area of Overlap” refers to the spatial intersection between the predicted and ground truth bounding boxes, while the “Area of Union” represents their combined area. A detection is considered a true positive only when the IoU between the predicted bounding box and the ground truth exceeds the specified threshold (0.5 for mAP@0.5, and 0.5 to 0.95 for mAP@0.5:0.95), as defined in Eq. (4).

The compiled bolt image dataset was used to train the three YOLO models, and their mAP results are summarized in Table 4. The mAP values are reported under two common IoU threshold schemes: mAP@0.5 (IoU threshold = 0.5) and mAP@0.5:0.95 (average mAP over IoU thresholds from 0.5 to 0.95 in steps of 0.05).

Table 4 Mean average precision (mAP) comparison of YOLO models.

As illustrated in the table, YOLOv8 achieves the highest performance on both mAP@0.5 and mAP@0.5:0.95, indicating its superior capability in accurately identifying bolts and missing bolts under varying detection thresholds.

The performance gap between mAP@0.5 (0.91) and mAP@0.5:0.95 (0.61) for YOLOv8 is primarily due to reduced bounding box localization precision under stricter IoU thresholds, which is significantly influenced by the variable shooting angles in our dataset.

Figure 8(a-b) illustrates the detection results for a bolt imaged at a direct, frontal angle. In Fig. 8(a) (ground-truth bounding box) and Fig. 8(b) (model’s predicted bounding box), the boxes align closely, resulting in a high IoU value.

In contrast, Fig. 8(c-d) shows the results for a bolt captured at an oblique angle. The ground-truth box in Fig. 8(c) is annotated tightly around the nut. However, the model’s predicted box in Fig. 8(d) encloses not only the nut but also a portion of the exposed threaded rod. This enlargement causes the predicted box to be larger than the ground-truth box, leading to a lower IoU.

This effect is more pronounced at even more extreme angles, approaching 90° (essentially perpendicular to the rod), as shown in Fig. 8(e-f). The ground-truth box in Fig. 8(e) contains only the nut and the exposed end of the bolt. The predicted box in Fig. 8(f), however, incorporates additional background areas surrounding the fastener, further reducing the IoU.

Critically, across all three scenarios, the model correctly identifies the presence of the bolt. The drop in mAP@0.5:0.95 reflects the challenge of achieving precise box localization under diverse angles, not a failure in bolt recognition.

Fig. 8
Fig. 8The alternative text for this image may have been generated using AI.
Full size image

Bolt detection performance under different shooting angles: (a-b) Ground-truth and predicted bounding boxes at frontal angle; (c-d) Ground-truth and predicted bounding boxes at oblique angle; (e-f) Ground-truth and predicted bounding boxes at extreme angle (90°).

Computational speed and parameter count analysis

Computational efficiency, measured by inference speed and parameter count, serves as a crucial metric for evaluating the practical deployment potential of deep learning models. This section presents a comparative analysis of the computational performance of the YOLOv5, YOLOv8, and YOLOv10 models, with detailed metrics summarized in Table 5.

Table 5 Computational performance and complexity of YOLO models.

As shown in the table, YOLOv8 achieves the shortest total processing time (3.9 ms) among the three models. It is 11.4 ms faster than YOLOv5 (15.3 ms) and 16.7 ms faster than YOLOv10 (20.6 ms) in end-to-end inference. Furthermore, YOLOv8 requires a relatively short training time while maintaining a competitive parameter count and computational complexity (GFLOPs).

In summary, YOLOv8 demonstrates a favorable balance between detection performance and computational efficiency. It achieves higher frames per second (FPS) and reduced training time compared to YOLOv5 and YOLOv10, making it a suitable baseline for subsequent improvements. Future work will focus on enhancing the YOLOv8 architecture to further optimize its inference speed without compromising detection accuracy.

Model improvement

To enhance the computational efficiency of the YOLOv8 model for bolt-missing detection and improve its accuracy and robustness in detecting small objects under complex scenarios, this study introduces two key modifications: the integration of the Swin-Transformer network and MEDM. These enhancements are designed to strengthen the model’s capability to recognize bolts under challenging visual conditions.

Integration of the swin-transformer network

The Swin-Transformer architecture is integrated into the YOLOv8 backbone to enhance its capability of modeling long-range dependencies and complex spatial contexts, which are critical for recognizing bolts in cluttered backgrounds. Its core innovation lies in a shifted window mechanism that efficiently computes self-attention within non-overlapping local windows while still enabling cross-window communication28. This design offers a superior balance between computational complexity and the ability to capture global features compared to standard convolutional operators or full self-attention. The overall structure of the Swin-Transformer and its fundamental building block are illustrated in Figs. 9 and 10, respectively.

Fig. 9
Fig. 9The alternative text for this image may have been generated using AI.
Full size image

Overall architecture of the swin-transformer.

Fig. 10
Fig. 10The alternative text for this image may have been generated using AI.
Full size image

Structure of the swin-transformer Block.

In this study, the Swin-Transformer module was integrated into the 8th layer of the YOLOv8 backbone—between the high-level feature extractor and the Spatial Pyramid Pooling Fusion (SPPF) module. This placement was determined by analyzing the network’s feature hierarchy. The early convolutional layers (1–7) extract low-level local features efficiently, but inserting the Swin-Transformer here would be computationally costly due to large feature maps, with limited benefit as global context is not yet needed. The 8th layer, however, represents a mid-to-high-level stage where local features are well integrated. Adding the Swin-Transformer here enriches these semantically meaningful features through self-attention, thereby improving multi-scale fusion in the subsequent SPPF module.

The configuration [−1, 3, SwinTransformer, [1024, True]] was employed, where 1024 maintains dimensional consistency with adjacent layers, and True activates the window-based multi-head attention mechanism. Stacking the module three times enhances its representational capacity. This hybrid design allows the network to leverage convolutional layers (Conv and C2f) for local feature extraction and spatial hierarchy modeling, while the Swin-Transformer captures long-range dependencies and global context.

Integration of the multi-scale and detail-enhanced module

To improve the detection of bolts under challenging visual conditions, a MEDM is introduced. This module enhances multi-scale feature representation and emphasizes fine structural details through a dedicated edge reinforcement mechanism, enabling more accurate localization and detection of small and partially obscured objects.

The MEDM consists of three parallel branches, each performing convolution followed by average pooling. An edge enhancer, implemented as a residual connection combining multiple average pooling layers and a convolutional layer, is applied within each branch. Features from all branches are merged via a 1 × 1 convolution to form a unified multi-scale representation, which is then refined using a SimAM attention mechanism for adaptive feature weighting. This design strengthens the network’s ability to capture and fuse information across scales without introducing significant computational overhead. The structure of MEDM is depicted in Fig. 11.

Fig. 11
Fig. 11The alternative text for this image may have been generated using AI.
Full size image

Architecture of the MEDM.

The MEDM module is inserted between the Neck and Head components of YOLOv8. This placement enhances multi-scale feature fusion, preserves spatial and semantic information, and improves gradient flow, particularly beneficial for detecting small objects in cluttered environments. As a result, the module significantly boosts the model’s accuracy and robustness across diverse and complex scenarios.

The overall architecture of the proposed improved model, incorporating both the Swin-Transformer block and the MEDM module into YOLOv8, is illustrated in Fig. 12.

Fig. 12
Fig. 12The alternative text for this image may have been generated using AI.
Full size image

Architecture of the improved YOLOv8 network.

Model evaluation and ablation experiments

To comprehensively evaluate the performance of the improved model, we analyzed its training convergence, compared the classification performance of the baseline and improved models using confusion matrices, and conducted ablation experiments to quantify the contribution of each proposed component.

Model training and convergence analysis

The training and validation loss curves, along with the mAP progression, are shown in Fig. 13. The loss curves (Fig. 13a) for both training and validation sets decrease rapidly and stabilize after approximately 20 epochs. The validation loss remains consistently lower than the training loss, indicating stable model convergence without signs of overfitting. The mAP curves (Fig. 13b) show a steady increase with the training epochs, where mAP@0.5 approaches 1.0 and mAP@0.5:0.95 plateaus around 0.6, demonstrating the model’s effective learning on the complex bolt dataset.

Fig. 13
Fig. 13The alternative text for this image may have been generated using AI.
Full size image

Training and validation performance curves: (a) Loss; (b) mAP.

Confusion matrix analysis

The class-wise confusion matrices for the baseline YOLOv8 and the proposed improved model on the test set are presented in Fig. 14. The matrices detail the model performance across three categories: “screw” (intact bolt), “noscrew” (missing bolt), and ‘background’.

Fig. 14
Fig. 14The alternative text for this image may have been generated using AI.
Full size image

Confusion matrices on the test set: (a) Baseline YOLOv8; (b) Proposed Improved YOLOv8.

A comparison reveals that the improved model (Fig. 14b) correctly identifies 6,521 “screw” instances, a significant increase from the 6,286 identified by the baseline (Fig. 14a). More critically, the improved model substantially reduces key misclassifications essential for safety inspection: the number of “noscrew” instances misclassified as “background” (a critical miss) drops sharply from 61 to 19. This demonstrates that the proposed architectural enhancements are particularly effective in improving the discrimination between intact and missing bolts, thereby enhancing inspection reliability.

Ablation experiments

Ablation experiments were conducted to validate the individual contributions of the Swin-Transformer and the Multi-scale and Detail Enhancement Module (MEDM). The results are summarized in Table 6.

Table 6 Results of ablation experiments.

The results demonstrate the distinct role of each component:

  1. 1.

    The Swin-Transformer module primarily improves Precision (from 0.90 to 0.94), indicating its effectiveness in reducing false positives by modeling global contextual relationships.

  2. 2.

    The MEDM helps maintain high precision while slightly improving the mAP@0.5 (from 0.91 to 0.92), which is attributed to its enhanced capability in detecting multi-scale and small objects.

  3. 3.

    The full model, integrating both components, achieves the best performance across the key metrics of Precision (0.97), Recall (0.90), and mAP@0.5 (0.94). This confirms that the Swin-Transformer and MEDM complement each other, with the former enhancing global feature representation and the latter refining multi-scale local detail fusion.

Experimental verification

Accuracy evaluation under multiple angles

This subsection presents an evaluation of the detection accuracy for the proposed improved model, based on YOLOv8 and enhanced with the Swin-Transformer Network and MEDM, across various viewing angles. Quantitative results are summarized in Table 7, while Fig. 15 illustrates example detection outcomes for bolts captured at 0°, 15°, 45°, and 60°.

Fig. 15
Fig. 15The alternative text for this image may have been generated using AI.
Full size image

Detection results of bolts under multiple viewing angles: (a) 0°; (b)15°; (c)45°; and (d) 60°.

As shown in Table 7, the model maintains high accuracy at moderate angles, achieving 100% detection rates at 15°, 30°, and 45°, though performance slightly decreases at more extreme angles such as 60° (92%), 75° (90%), and 85° (91%). These results demonstrate the robustness of the proposed method across a range of realistic viewing conditions.

Table 7 Detection accuracy of the proposed model under multiple angles.

Accuracy evaluation under different lighting conditions

This study evaluates the detection accuracy of the improved model under varying lighting conditions. Brightness levels of 30%, 50%, 120%, and 300% were tested to validate model robustness. Example detection results under these conditions are illustrated in Fig. 16, and detailed quantitative accuracy metrics are provided in Table 8.

Table 8 Detection accuracy under varying illumination conditions.

As summarized in Table 8, the model consistently achieves 100% detection accuracy across most brightness levels, including 30%, 50%, 100%, 120%, and 300%. A slight performance decrease to 94% is observed at the highest tested luminance level of 350 cd/m². These results indicate strong adaptability of the proposed method to significant variations in illumination.

Fig. 16
Fig. 16The alternative text for this image may have been generated using AI.
Full size image

Detection examples under different illumination levels: (a) 30%; (b) 50%; (c) 120%; (d) 300%.

Accuracy evaluation under complex backgrounds

To validate the model’s performance under realistic engineering scenarios, we simulated the common practice of applying anti-corrosion coatings to bridge steelwork29. Model structures were painted in red, white, and blue to represent variously coated bolts. These were photographed against complex natural backgrounds (grassy terrain and mixed grass-ground environments) to evaluate bolt detection performance. Figure 17 illustrates the detection effects under these challenging visual conditions, and Table 9 provides a quantitative summary of the accuracy rates across the different background types.

Table 9 Detection accuracy under complex backgrounds with different coating colors.

The experimental results demonstrate that the proposed model achieves consistently high detection accuracy across diverse coating colors and complex natural backgrounds, with all values exceeding 97.2%. This performance highlights the model’s strong generalization capability and practical applicability in real-world bridge inspection scenarios.

Fig. 17
Fig. 17The alternative text for this image may have been generated using AI.
Full size image

Detection results under complex backgrounds with different coating colors: (a-c) grass background; (d-f) mixed grass and ground background.

Engineering application

Engineering overview

This study applies the trained model to a steel suspension bridge that has been in service for 15 years since its completion in 2010, in order to validate its effectiveness in real-world engineering scenarios. The key structural parameters are as follows: a span of 252.0 m between tower centers, a main cable sag of 25.2 m, and a deck width of 7.0 m30. The east bank tower (29.94 m in height) is situated in Beichuan County, while the west bank tower (32.80 m in height) is located in the Qianyuan Mountain Tourist Scenic Area. The bridge spans the river, effectively connecting these two regions.

The structural system includes 61 pairs of hangers spaced at 4.0 m intervals and a steel truss stiffening girder. The girder consists of two main trusses spaced 7.2 m apart center-to-center, with a vertical height of 2.2 m between the top and bottom chords. Cross-frames are installed at hanger attachment points with a longitudinal spacing of 4.0 m. The stiffening girder is composed of 60 standard segments, each 4.0 m in length, and two non-standard end segments, each measuring 4.78 m. The bridge elevation is shown in Fig. 18.

Fig. 18
Fig. 18The alternative text for this image may have been generated using AI.
Full size image

Elevation view of the bridge.

All truss members are connected at nodal plates using Grade 10.9 S M20 high-strength bolts, with over 50,000 such bolts used throughout the entire bridge. After 15 years of service, a visual inspection conducted in 2023 identified signs of loosening in certain bolts, suggesting a potential risk of bolt loss or dislodgment. Manually examining such a large number of bolts for defects such as missing fasteners is highly challenging.

To address this issue, a DJI drone was deployed to capture high-resolution images of the bridge, and the model developed in this study was applied to automatically detect bolts and identify missing fasteners. This approach is critical because the inspection environment poses two major challenges: complex visual backgrounds and multi-angle viewing requirements, which align with the core focus of this study.

The visual complexity arises from the varying surface conditions and coatings, which is shown in Fig. 19: the hangers are painted white for visibility, while the stiffening girder has a gray coating and is further complicated by extensive water stains, rust marks, and other environmental residues. These factors create a highly heterogeneous background that complicates automated detection.

Fig. 19
Fig. 19The alternative text for this image may have been generated using AI.
Full size image

The varying surface conditions and coatings: (a) hangers with white coating; (b) stiffening girder with gray coating.

Furthermore, the inspection inherently involves multi-angle image acquisition. The complex three-dimensional geometry of the bridge structure—including the diverse orientations of hangers, girder surfaces, and nodal connections—requires imagery to be captured from various viewpoints to ensure comprehensive coverage. Operational constraints, such as obstacle avoidance and flight safety regulations, further necessitate that the UAV capture imagery from a range of angles and perspectives, rather than from a single, ideal viewpoint.

Image acquisition

A DJI drone was deployed for image acquisition in this study. The key specifications of the drone are summarized in Table 10. Certain performance parameters were operated under reduced settings to accommodate constraints present in the field environment.

Table 10 DJI drone performance specifications.

Owing to limitations in satellite signal stability and obstacle avoidance capabilities, the DJI drone was unable to operate safely within the internal truss structure. As a result, the inspection was focused on externally accessible regions, particularly the outer surfaces of the truss and the cable suspension zones.

Inspection results and statistical analysis

The on-site inspection results are summarized in Table 11, with representative detection visualizations provided in Fig. 20.

Table 11 Detection performance across structural components.

In the critical hanger section, all 868 designated bolt sets were successfully detected, achieving a perfect detection rate. The favorable imaging conditions in this region—characterized by high contrast between the bolts and a relatively uniform background, larger bolt size (M22), and wider spacing—collectively contributed to this optimal performance, as visually evidenced in Fig. 20 (a) and (b).

For the more challenging stiffening girder section, which features smaller bolts (M20), denser arrangements, and cluttered backgrounds, the system identified 11,768 out of 11,904 bolt sets, yielding a detection rate of 98.86%. One instance of a missing bolt was successfully flagged within this segment, as shown in Fig. 20.

Fig. 20
Fig. 20The alternative text for this image may have been generated using AI.
Full size image

Bolt detection examples in two structural regions: (a) Hanger region original image; (b) Hanger region detection result; (c) Stiffening girder original image; (d) Stiffening girder detection result.

To provide a comprehensive and objective assessment of the model’s capability across all conditions, the confusion matrix for the complete test set is presented in Fig. 21. This set encompasses both the simple hanger region and the complex stiffening girder area. The matrix confirms the model’s robust but not perfect performance in the general case, offering a realistic representation of its overall reliability.

Fig. 21
Fig. 21The alternative text for this image may have been generated using AI.
Full size image

Comparison of confusion matrices: (a) Hanger cable clamp region only; (b) Comprehensive test set (includes hanger region and other structural components).

Across the entire bridge, 12,636 out of 12,772 bolt sets were identified, resulting in an overall detection rate of 98.94%. One missing bolt was confirmed, indicating a structural integrity rate of 99.99% for the inspected fasteners. These results demonstrate the high practical efficacy and robustness of the proposed method for automated bolt inspection in large-scale infrastructure, while the isolated missed detections highlight the persistent challenges posed by complex structural geometries.

Conclusion

This study developed and validated an improved deep learning-based framework for automated bolt and bolt-missing detection, specifically designed to address the challenges of complex backgrounds and multi-angle perspectives inherent in real-world bridge inspection scenarios. The main conclusions are summarized as follows:

  1. 1.

    A comparative analysis of YOLOv5, YOLOv8, and YOLOv10 models identified YOLOv8 as the most effective baseline, achieving an mAP@0.5 of 0.91, a recall of 0.85, and a precision of 0.9. Its superior balance of detection accuracy and computational efficiency made it the optimal choice for subsequent enhancements.

  2. 2.

    An improved model was proposed based on YOLOv8 by incorporating a Swin-Transformer network for global feature extraction and a MEDM for refined detail processing. The enhanced model demonstrated robust performance across challenging conditions: it maintained over 90% accuracy at angles of 15°, 30°, and 45°; achieved 100% accuracy under lighting intensities from 30% to 100%; and attained accuracy exceeding 97.2% for bolts with blue, red, and white coatings against complex grassy and mixed backgrounds.

  3. 3.

    In a full-scale engineering application on a suspension bridge, the proposed model successfully analyzed over 12,772 bolt sets, achieving a detection rate of 98.94% and identifying one critical missing bolt. This result validates the model’s practical effectiveness and high reliability in real-world inspection scenarios.

Despite the high overall detection rate, the isolated missed detections highlight the persistent challenges posed by complex structural geometries and real-world variability. Future work will focus on conducting comprehensive comparative experiments with other candidate models, such as Faster R-CNN with FPN, as well as with successively released YOLO iterations. This comprehensive evaluation will validate the generalizability of our approach and ultimately enhance bolt and defect detection in complex environments.