Abstract
Bolted connections are widely adopted as primary structural joints in engineering infrastructure. However, conventional manual inspection remains labor-intensive and time-consuming. Deep learning–based automated defect detection faces significant challenges due to pronounced image variability induced by variable camera angles, lighting conditions, partial occlusions, and complex backgrounds. This study addresses these issues by constructing a diverse bolt image dataset compiled from three sources: on-site acquisitions from suspension bridges, field photography of steel transmission towers, and controlled laboratory imaging of a custom-fabricated bolt assembly model. To enhance data robustness, we employed image enhancement techniques and generative adversarial networks (GANs) for data augmentation. A comparative analysis was conducted among three mainstream object-detection models—YOLOv5, YOLOv8, and YOLOv10—using recall, precision, and mean average precision (mAP) as evaluation metrics. Building upon the superior performance of YOLOv8 (mAP = 0.91, recall = 0.85, precision = 0.9), we proposed an enhanced architecture integrating a Swin-Transformer backbone and a novel Multi-Scale and Detail-Enhanced Module (MEDM) to specifically improve missing-bolt detection in challenging visual contexts. The improved model demonstrated consistent accuracy across diverse scenarios: 100% at 15°, 30°, and 45° viewing angles; > 94% under 30%, 50%, and 100% illumination levels; and > 97.2% for colored coatings (blue, red, white) against complex grassy and mixed backgrounds. In practical engineering deployment, the model achieved a 98.94% detection rate across 12,772 bolt sets, successfully identifying one instances of missing bolts. These findings validate the proposed approach’s effectiveness for real-world structural health monitoring.
Similar content being viewed by others
Introduction
Bolt connections serve as critical components in joining elements within steel structure bridges1. The integrity of these connections is essential to the overall structural reliability. Nevertheless, the vast quantity of bolts involved, for instance, the Nujiang Four-Track Bridge in Baoshan City, Yunnan Province, China, utilizes over 800,000 sets2, increases the likelihood of bolt loss, which can result from installation omissions or loosening under long-term cyclic loading.
The loss of bolts compromises the load-bearing capacity of the connection, thereby threatening the overall structural integrity. Accordingly, there is a critical need for methods capable of quickly and precisely pinpointing missing bolts and quantifying their numbers; such data provides a definitive basis for prioritizing replacements to maintain structural integrity. While traditional practices depend on periodic, labor-intensive manual checks3,4,5, these approaches suffer from high costs, poor efficiency, and variable outcomes tied directly to operator skill. More recently, the field of artificial intelligence has yielded promising new solutions, particularly in deep learning-based image detection, which offers an automated alternative for identifying bolt defects.
The application of computer vision and deep learning for automatic bolt condition assessment has been extensively investigated. Early research conducted preliminary investigations into image-based identification of bolt defects; for instance, Cha et al6. successfully demonstrated the potential of traditional image processing combined with SVMs, using the Hough transform for robust feature extraction. As deep learning emerged, CNNs became the dominant architecture. Wang et al7., Zhao et al8., and Zhou et al9. all validated CNNs’ powerful capability for feature learning, applying them to detect various defects including loosening and absence, particularly in demanding contexts like bridge infrastructure. To overcome limitations, researchers developed targeted improvements. Li et al10. enhanced computational efficiency and data requirements by incorporating time-frequency analysis into their CNN model. Yang et al11. refined detection accuracy by combining geometric transformations with IoU-guided selection. The pursuit of practical, versatile solutions led to several key developments: Ni et al12. created a multi-classification system able to distinguish between different fault types (corrosion, loosening) using an enhanced YOLOv5s, while Chen et al13. prioritized deployment feasibility with a lightweight YOLOv5 variant designed for mobile platforms. Efforts to enable real-time operation were realized by Pan et al14., who developed an integrated system for live tracking. Crucially, Lao et al15. addressed a fundamental challenge by analyzing and mitigating the impact of variable imaging conditions — such as focal length, angle, and illumination — through adaptive preprocessing and model tuning, thereby significantly improving the robustness of visual inspection systems.
While these studies have substantially advanced the field of deep learning-based bolt defect detection and introduced viable solutions for structural health monitoring, significant challenges persist. The accuracy of any detection system is susceptible to considerable variability in bolt appearance within images, stemming from diverse shooting angles, variable lighting conditions, and partial occlusions. Moreover, real-world applications present additional complexities; steel structures are situated in diverse environments such as urban areas, mountains, rivers, which introduce intricate and cluttered backgrounds. Compounding this issue is the common practice of coating steel components in distinctive paints during fabrication to prevent corrosion. Collectively, these factors can severely degrade detection performance. Consequently, the practical efficacy and robustness of the previously discussed methods have yet to be fully validated within complex, real-world engineering environments.
Our investigation addresses the identified limitations by first constructing a diverse bolt image dataset gathered from three sources: field images of operational suspension bridges and transmission towers, and controlled images from a custom-designed bolt joint model. Following augmentation via both traditional and deep learning techniques, we performed a comparative analysis of YOLO-series models (v5, v8, and v10). Building on the results, an enhanced YOLOv8-based architecture is proposed, which integrates a Swin-Transformer16 backbone and multi-scale and detail enhancement module (MEDM) to boost detection robustness in cluttered backgrounds and from multi-angle perspectives. The proposed model was validated through controlled experiments and real-world applications across various scenarios.
Dataset construction
Database collection
Bolt image data were collected through three primary methods: on-site acquisition from suspension bridge, field photography of steel transmission towers, and image capture from a custom-built bolt joint model.
The first subset comprises 757 images of bolts from various structural components of a steel suspension bridge, as illustrated in Fig. 1.
Bolt image database of steel suspension bridge: (a) Cable-hanger connection; (b) Top chord joint of main truss; (c) Lateral bracing diagonal-to-top chord connection.
The second subset includes 252 images of bolts captured at transmission tower sites, shown in Fig. 2.
Transmission tower bolt database: (a) Single-bolt K-joint; (b) Single-bolt T-joint; (c) Single-bolt X-joint.
The third subset contains 988 images acquired from a specially designed bolt joint assembly, depicted in Fig. 3.
Database of self-assembled bolt joint model: (a) Original color; (b) Gray coating.
Image enhancement and generative adversarial network (GANs)
To enrich the diversity and robustness of our training dataset, we implemented a two-pronged augmentation strategy utilizing conventional image enhancement techniques and advanced deep learning models based on GANs17.
Image enhancement
To improve the diversity and robustness of the bolt image dataset, several enhancement techniques were applied, including the addition of Gaussian noise to simulate real-world sensor interference18, low-pass filtering to emulate blur under motion or defocus19, and color space conversion to enhance invariance to illumination changes20. Example results of these enhancement operations are shown in Fig. 4.
Examples of image enhancement: (a) Original image; (b) Gaussian noise; (c) Low-pass filtering; (d) Color space conversion.
Images generated by generative adversarial networks
Figure 5 presents examples of bolt images generated by the GAN. The GAN was designed to vary background and contextual elements for data augmentation while preserving the key structural features of the bolts themselves.
Examples of images generated by the GAN: (a) Original image; (b) GAN-generated vegetation background; (c) GAN-generated misty background.
A total of 5,057 bolt images were collected using the aforementioned methods. The dataset information is summarized in Table 1. These images were divided into a training set and a validation set in an 8:2 ratio. After completing the image collection, the X-AnyLabeling automatic annotation software was employed to label the dataset, which was then used to train the model. The software was used to annotate the bolt images with two types of labels for analysis: “screw” (bolt present) and “noscrew” (bolt missing).
Comparative study on detection performance of different models
The dataset compiled through the aforementioned methods was used to train several models. A comparative analysis of their detection performance was conducted to inform subsequent model improvements.
Selected comparative models
Model training was carried out using the YOLO (You Only Look Once) framework21—a convolutional neural network-based object detection algorithm. Among its various versions, YOLOv5, YOLOv8, and YOLOv10 represent the most stable and widely adopted iterations and were therefore selected for comparison.
YOLOv5 model
YOLOv5 employs a modular architecture consisting of Input, Backbone, Neck, and Prediction Head. A distinctive feature of its Backbone is the Focus module, which performs slicing operations to enhance feature extraction while preserving contextual information22. In this study, the YOLOv5s variant was adopted as one baseline model to provide a performance benchmark for subsequent comparisons. The overall structure is shown in Fig. 6.
Network structure of YOLOv5s.
YOLOv8 model
Compared to YOLOv5, YOLOv8 introduces the C2f module in place of the C3 module23. The C2f incorporates more skip connections and an additional split operation, while reducing convolutional operations in branch layers. This design lowers computational cost while improving gradient flow and feature representation capability. Based on its balanced performance in accuracy and efficiency observed in preliminary experiments, YOLOv8 was selected as the foundation for our proposed improved model. A schematic of these improvements is shown in Fig. 7.
Architectural improvements in YOLOv8.
YOLOv10 model
As the latest iteration in the series, YOLOv10 achieves a balance between efficiency and accuracy through architectural refinements including enhanced feature fusion and lightweight design24. It was included in this study to compare its performance against YOLOv5 and YOLOv8 in the task of bolt and bolt-missing detection.
Model training parameters
Hyperparameters play a critical role in deep learning algorithms, governing model architecture configurations and profoundly influencing computational efficiency, final performance metrics, and convergence behavior. Given that the computational complexities of the three compared models (YOLOv5, YOLOv8, and YOLOv10) are comparable under equivalent parameterization, this study establishes a valid baseline for horizontal performance comparison. To optimize training regimens adaptively, we implemented an automated optimizer selection strategy: AdamW is deployed for training scenarios with ≤ 10⁴ iterations, while Stochastic Gradient Descent (SGD) is utilized for instances exceeding 10⁴ iterations. The complete set of model-specific hyperparameters is detailed in Table 2.
Using the parameters defined in Table 2, all three models underwent identical training protocols. Their respective detection performances were subsequently evaluated using standardized metrics including recall, precision25, and mean Average Precision (mAP)26.
Recall and precision analysis
This section analyzes the performance of the trained models using standard metrics designed for object detection tasks.
Recall
Recall measures the proportion of actual positive instances correctly identified by the model. It is defined as the number of true positive predictions divided by the total number of actual positives, as expressed in Eq. (1):
where TP (True Positives) denotes the number of correctly detected bolt instances, and FN (False Negatives) refers to the number of actual bolts that were missed by the model.
Precision
Precision measures the proportion of detected targets that are correctly recognized as “screw”, calculated as the number of images correctly recognized as “screw” divided by the total number of images identified as “screw”, as shown in Eq. (2):
where FP (False Positives) indicates the number of incorrect bolt predictions, which includes both detections in background areas and false detections on non-bolt objects.
The recall and precision values of the three models are summarized in Table 3. As shown in Table 3, YOLOv5, YOLOv8, and YOLOv10 all reached a precision of 0.9. The recall rates of YOLOv5 and YOLOv8 differ by 0.05, while the gap between YOLOv10 and YOLOv8 is 0.08—greater than that between YOLOv10 and YOLOv5. A comprehensive evaluation reveals that the YOLOv8 model achieves superior performance in terms of both precision and recall.
Mean average precision (mAP) comparative analysis
Mean Average Precision (mAP) serves as the primary metric in this study for quantitatively evaluating and comparing the overall detection performance of the YOLOv5, YOLOv8, and YOLOv10 models on the task of bolt and missing-bolt detection. The mAP is computed as:
where AP (Average Precision) for a specific category is derived from the area under its precision-recall curve, typically computed as the average of precision values at a set of recall levels. The variable n represents the total number of categories (e.g., “screw” and “no screw” in this work), and APₖ denotes the Average Precision of the k-th category.

To ensure a consistent and objective assessment, the Intersection over Union (IoU)27 metric was employed to determine whether a predicted bounding box correctly localized a bolt. The IoU measures the overlap between a predicted bounding box and its corresponding ground truth, calculated as:
Here, the “Area of Overlap” refers to the spatial intersection between the predicted and ground truth bounding boxes, while the “Area of Union” represents their combined area. A detection is considered a true positive only when the IoU between the predicted bounding box and the ground truth exceeds the specified threshold (0.5 for mAP@0.5, and 0.5 to 0.95 for mAP@0.5:0.95), as defined in Eq. (4).
The compiled bolt image dataset was used to train the three YOLO models, and their mAP results are summarized in Table 4. The mAP values are reported under two common IoU threshold schemes: mAP@0.5 (IoU threshold = 0.5) and mAP@0.5:0.95 (average mAP over IoU thresholds from 0.5 to 0.95 in steps of 0.05).
As illustrated in the table, YOLOv8 achieves the highest performance on both mAP@0.5 and mAP@0.5:0.95, indicating its superior capability in accurately identifying bolts and missing bolts under varying detection thresholds.
The performance gap between mAP@0.5 (0.91) and mAP@0.5:0.95 (0.61) for YOLOv8 is primarily due to reduced bounding box localization precision under stricter IoU thresholds, which is significantly influenced by the variable shooting angles in our dataset.
Figure 8(a-b) illustrates the detection results for a bolt imaged at a direct, frontal angle. In Fig. 8(a) (ground-truth bounding box) and Fig. 8(b) (model’s predicted bounding box), the boxes align closely, resulting in a high IoU value.
In contrast, Fig. 8(c-d) shows the results for a bolt captured at an oblique angle. The ground-truth box in Fig. 8(c) is annotated tightly around the nut. However, the model’s predicted box in Fig. 8(d) encloses not only the nut but also a portion of the exposed threaded rod. This enlargement causes the predicted box to be larger than the ground-truth box, leading to a lower IoU.
This effect is more pronounced at even more extreme angles, approaching 90° (essentially perpendicular to the rod), as shown in Fig. 8(e-f). The ground-truth box in Fig. 8(e) contains only the nut and the exposed end of the bolt. The predicted box in Fig. 8(f), however, incorporates additional background areas surrounding the fastener, further reducing the IoU.
Critically, across all three scenarios, the model correctly identifies the presence of the bolt. The drop in mAP@0.5:0.95 reflects the challenge of achieving precise box localization under diverse angles, not a failure in bolt recognition.
Bolt detection performance under different shooting angles: (a-b) Ground-truth and predicted bounding boxes at frontal angle; (c-d) Ground-truth and predicted bounding boxes at oblique angle; (e-f) Ground-truth and predicted bounding boxes at extreme angle (90°).
Computational speed and parameter count analysis
Computational efficiency, measured by inference speed and parameter count, serves as a crucial metric for evaluating the practical deployment potential of deep learning models. This section presents a comparative analysis of the computational performance of the YOLOv5, YOLOv8, and YOLOv10 models, with detailed metrics summarized in Table 5.
As shown in the table, YOLOv8 achieves the shortest total processing time (3.9 ms) among the three models. It is 11.4 ms faster than YOLOv5 (15.3 ms) and 16.7 ms faster than YOLOv10 (20.6 ms) in end-to-end inference. Furthermore, YOLOv8 requires a relatively short training time while maintaining a competitive parameter count and computational complexity (GFLOPs).
In summary, YOLOv8 demonstrates a favorable balance between detection performance and computational efficiency. It achieves higher frames per second (FPS) and reduced training time compared to YOLOv5 and YOLOv10, making it a suitable baseline for subsequent improvements. Future work will focus on enhancing the YOLOv8 architecture to further optimize its inference speed without compromising detection accuracy.
Model improvement
To enhance the computational efficiency of the YOLOv8 model for bolt-missing detection and improve its accuracy and robustness in detecting small objects under complex scenarios, this study introduces two key modifications: the integration of the Swin-Transformer network and MEDM. These enhancements are designed to strengthen the model’s capability to recognize bolts under challenging visual conditions.
Integration of the swin-transformer network
The Swin-Transformer architecture is integrated into the YOLOv8 backbone to enhance its capability of modeling long-range dependencies and complex spatial contexts, which are critical for recognizing bolts in cluttered backgrounds. Its core innovation lies in a shifted window mechanism that efficiently computes self-attention within non-overlapping local windows while still enabling cross-window communication28. This design offers a superior balance between computational complexity and the ability to capture global features compared to standard convolutional operators or full self-attention. The overall structure of the Swin-Transformer and its fundamental building block are illustrated in Figs. 9 and 10, respectively.
Overall architecture of the swin-transformer.
Structure of the swin-transformer Block.
In this study, the Swin-Transformer module was integrated into the 8th layer of the YOLOv8 backbone—between the high-level feature extractor and the Spatial Pyramid Pooling Fusion (SPPF) module. This placement was determined by analyzing the network’s feature hierarchy. The early convolutional layers (1–7) extract low-level local features efficiently, but inserting the Swin-Transformer here would be computationally costly due to large feature maps, with limited benefit as global context is not yet needed. The 8th layer, however, represents a mid-to-high-level stage where local features are well integrated. Adding the Swin-Transformer here enriches these semantically meaningful features through self-attention, thereby improving multi-scale fusion in the subsequent SPPF module.
The configuration [−1, 3, SwinTransformer, [1024, True]] was employed, where 1024 maintains dimensional consistency with adjacent layers, and True activates the window-based multi-head attention mechanism. Stacking the module three times enhances its representational capacity. This hybrid design allows the network to leverage convolutional layers (Conv and C2f) for local feature extraction and spatial hierarchy modeling, while the Swin-Transformer captures long-range dependencies and global context.
Integration of the multi-scale and detail-enhanced module
To improve the detection of bolts under challenging visual conditions, a MEDM is introduced. This module enhances multi-scale feature representation and emphasizes fine structural details through a dedicated edge reinforcement mechanism, enabling more accurate localization and detection of small and partially obscured objects.
The MEDM consists of three parallel branches, each performing convolution followed by average pooling. An edge enhancer, implemented as a residual connection combining multiple average pooling layers and a convolutional layer, is applied within each branch. Features from all branches are merged via a 1 × 1 convolution to form a unified multi-scale representation, which is then refined using a SimAM attention mechanism for adaptive feature weighting. This design strengthens the network’s ability to capture and fuse information across scales without introducing significant computational overhead. The structure of MEDM is depicted in Fig. 11.
Architecture of the MEDM.
The MEDM module is inserted between the Neck and Head components of YOLOv8. This placement enhances multi-scale feature fusion, preserves spatial and semantic information, and improves gradient flow, particularly beneficial for detecting small objects in cluttered environments. As a result, the module significantly boosts the model’s accuracy and robustness across diverse and complex scenarios.
The overall architecture of the proposed improved model, incorporating both the Swin-Transformer block and the MEDM module into YOLOv8, is illustrated in Fig. 12.
Architecture of the improved YOLOv8 network.
Model evaluation and ablation experiments
To comprehensively evaluate the performance of the improved model, we analyzed its training convergence, compared the classification performance of the baseline and improved models using confusion matrices, and conducted ablation experiments to quantify the contribution of each proposed component.
Model training and convergence analysis
The training and validation loss curves, along with the mAP progression, are shown in Fig. 13. The loss curves (Fig. 13a) for both training and validation sets decrease rapidly and stabilize after approximately 20 epochs. The validation loss remains consistently lower than the training loss, indicating stable model convergence without signs of overfitting. The mAP curves (Fig. 13b) show a steady increase with the training epochs, where mAP@0.5 approaches 1.0 and mAP@0.5:0.95 plateaus around 0.6, demonstrating the model’s effective learning on the complex bolt dataset.
Training and validation performance curves: (a) Loss; (b) mAP.
Confusion matrix analysis
The class-wise confusion matrices for the baseline YOLOv8 and the proposed improved model on the test set are presented in Fig. 14. The matrices detail the model performance across three categories: “screw” (intact bolt), “noscrew” (missing bolt), and ‘background’.
Confusion matrices on the test set: (a) Baseline YOLOv8; (b) Proposed Improved YOLOv8.
A comparison reveals that the improved model (Fig. 14b) correctly identifies 6,521 “screw” instances, a significant increase from the 6,286 identified by the baseline (Fig. 14a). More critically, the improved model substantially reduces key misclassifications essential for safety inspection: the number of “noscrew” instances misclassified as “background” (a critical miss) drops sharply from 61 to 19. This demonstrates that the proposed architectural enhancements are particularly effective in improving the discrimination between intact and missing bolts, thereby enhancing inspection reliability.
Ablation experiments
Ablation experiments were conducted to validate the individual contributions of the Swin-Transformer and the Multi-scale and Detail Enhancement Module (MEDM). The results are summarized in Table 6.
The results demonstrate the distinct role of each component:
-
1.
The Swin-Transformer module primarily improves Precision (from 0.90 to 0.94), indicating its effectiveness in reducing false positives by modeling global contextual relationships.
-
2.
The MEDM helps maintain high precision while slightly improving the mAP@0.5 (from 0.91 to 0.92), which is attributed to its enhanced capability in detecting multi-scale and small objects.
-
3.
The full model, integrating both components, achieves the best performance across the key metrics of Precision (0.97), Recall (0.90), and mAP@0.5 (0.94). This confirms that the Swin-Transformer and MEDM complement each other, with the former enhancing global feature representation and the latter refining multi-scale local detail fusion.
Experimental verification
Accuracy evaluation under multiple angles
This subsection presents an evaluation of the detection accuracy for the proposed improved model, based on YOLOv8 and enhanced with the Swin-Transformer Network and MEDM, across various viewing angles. Quantitative results are summarized in Table 7, while Fig. 15 illustrates example detection outcomes for bolts captured at 0°, 15°, 45°, and 60°.
Detection results of bolts under multiple viewing angles: (a) 0°; (b)15°; (c)45°; and (d) 60°.
As shown in Table 7, the model maintains high accuracy at moderate angles, achieving 100% detection rates at 15°, 30°, and 45°, though performance slightly decreases at more extreme angles such as 60° (92%), 75° (90%), and 85° (91%). These results demonstrate the robustness of the proposed method across a range of realistic viewing conditions.
Accuracy evaluation under different lighting conditions
This study evaluates the detection accuracy of the improved model under varying lighting conditions. Brightness levels of 30%, 50%, 120%, and 300% were tested to validate model robustness. Example detection results under these conditions are illustrated in Fig. 16, and detailed quantitative accuracy metrics are provided in Table 8.
As summarized in Table 8, the model consistently achieves 100% detection accuracy across most brightness levels, including 30%, 50%, 100%, 120%, and 300%. A slight performance decrease to 94% is observed at the highest tested luminance level of 350 cd/m². These results indicate strong adaptability of the proposed method to significant variations in illumination.
Detection examples under different illumination levels: (a) 30%; (b) 50%; (c) 120%; (d) 300%.
Accuracy evaluation under complex backgrounds
To validate the model’s performance under realistic engineering scenarios, we simulated the common practice of applying anti-corrosion coatings to bridge steelwork29. Model structures were painted in red, white, and blue to represent variously coated bolts. These were photographed against complex natural backgrounds (grassy terrain and mixed grass-ground environments) to evaluate bolt detection performance. Figure 17 illustrates the detection effects under these challenging visual conditions, and Table 9 provides a quantitative summary of the accuracy rates across the different background types.
The experimental results demonstrate that the proposed model achieves consistently high detection accuracy across diverse coating colors and complex natural backgrounds, with all values exceeding 97.2%. This performance highlights the model’s strong generalization capability and practical applicability in real-world bridge inspection scenarios.
Detection results under complex backgrounds with different coating colors: (a-c) grass background; (d-f) mixed grass and ground background.
Engineering application
Engineering overview
This study applies the trained model to a steel suspension bridge that has been in service for 15 years since its completion in 2010, in order to validate its effectiveness in real-world engineering scenarios. The key structural parameters are as follows: a span of 252.0 m between tower centers, a main cable sag of 25.2 m, and a deck width of 7.0 m30. The east bank tower (29.94 m in height) is situated in Beichuan County, while the west bank tower (32.80 m in height) is located in the Qianyuan Mountain Tourist Scenic Area. The bridge spans the river, effectively connecting these two regions.
The structural system includes 61 pairs of hangers spaced at 4.0 m intervals and a steel truss stiffening girder. The girder consists of two main trusses spaced 7.2 m apart center-to-center, with a vertical height of 2.2 m between the top and bottom chords. Cross-frames are installed at hanger attachment points with a longitudinal spacing of 4.0 m. The stiffening girder is composed of 60 standard segments, each 4.0 m in length, and two non-standard end segments, each measuring 4.78 m. The bridge elevation is shown in Fig. 18.
Elevation view of the bridge.
All truss members are connected at nodal plates using Grade 10.9 S M20 high-strength bolts, with over 50,000 such bolts used throughout the entire bridge. After 15 years of service, a visual inspection conducted in 2023 identified signs of loosening in certain bolts, suggesting a potential risk of bolt loss or dislodgment. Manually examining such a large number of bolts for defects such as missing fasteners is highly challenging.
To address this issue, a DJI drone was deployed to capture high-resolution images of the bridge, and the model developed in this study was applied to automatically detect bolts and identify missing fasteners. This approach is critical because the inspection environment poses two major challenges: complex visual backgrounds and multi-angle viewing requirements, which align with the core focus of this study.
The visual complexity arises from the varying surface conditions and coatings, which is shown in Fig. 19: the hangers are painted white for visibility, while the stiffening girder has a gray coating and is further complicated by extensive water stains, rust marks, and other environmental residues. These factors create a highly heterogeneous background that complicates automated detection.
The varying surface conditions and coatings: (a) hangers with white coating; (b) stiffening girder with gray coating.
Furthermore, the inspection inherently involves multi-angle image acquisition. The complex three-dimensional geometry of the bridge structure—including the diverse orientations of hangers, girder surfaces, and nodal connections—requires imagery to be captured from various viewpoints to ensure comprehensive coverage. Operational constraints, such as obstacle avoidance and flight safety regulations, further necessitate that the UAV capture imagery from a range of angles and perspectives, rather than from a single, ideal viewpoint.
Image acquisition
A DJI drone was deployed for image acquisition in this study. The key specifications of the drone are summarized in Table 10. Certain performance parameters were operated under reduced settings to accommodate constraints present in the field environment.
Owing to limitations in satellite signal stability and obstacle avoidance capabilities, the DJI drone was unable to operate safely within the internal truss structure. As a result, the inspection was focused on externally accessible regions, particularly the outer surfaces of the truss and the cable suspension zones.
Inspection results and statistical analysis
The on-site inspection results are summarized in Table 11, with representative detection visualizations provided in Fig. 20.
In the critical hanger section, all 868 designated bolt sets were successfully detected, achieving a perfect detection rate. The favorable imaging conditions in this region—characterized by high contrast between the bolts and a relatively uniform background, larger bolt size (M22), and wider spacing—collectively contributed to this optimal performance, as visually evidenced in Fig. 20 (a) and (b).
For the more challenging stiffening girder section, which features smaller bolts (M20), denser arrangements, and cluttered backgrounds, the system identified 11,768 out of 11,904 bolt sets, yielding a detection rate of 98.86%. One instance of a missing bolt was successfully flagged within this segment, as shown in Fig. 20.
Bolt detection examples in two structural regions: (a) Hanger region original image; (b) Hanger region detection result; (c) Stiffening girder original image; (d) Stiffening girder detection result.
To provide a comprehensive and objective assessment of the model’s capability across all conditions, the confusion matrix for the complete test set is presented in Fig. 21. This set encompasses both the simple hanger region and the complex stiffening girder area. The matrix confirms the model’s robust but not perfect performance in the general case, offering a realistic representation of its overall reliability.
Comparison of confusion matrices: (a) Hanger cable clamp region only; (b) Comprehensive test set (includes hanger region and other structural components).
Across the entire bridge, 12,636 out of 12,772 bolt sets were identified, resulting in an overall detection rate of 98.94%. One missing bolt was confirmed, indicating a structural integrity rate of 99.99% for the inspected fasteners. These results demonstrate the high practical efficacy and robustness of the proposed method for automated bolt inspection in large-scale infrastructure, while the isolated missed detections highlight the persistent challenges posed by complex structural geometries.
Conclusion
This study developed and validated an improved deep learning-based framework for automated bolt and bolt-missing detection, specifically designed to address the challenges of complex backgrounds and multi-angle perspectives inherent in real-world bridge inspection scenarios. The main conclusions are summarized as follows:
-
1.
A comparative analysis of YOLOv5, YOLOv8, and YOLOv10 models identified YOLOv8 as the most effective baseline, achieving an mAP@0.5 of 0.91, a recall of 0.85, and a precision of 0.9. Its superior balance of detection accuracy and computational efficiency made it the optimal choice for subsequent enhancements.
-
2.
An improved model was proposed based on YOLOv8 by incorporating a Swin-Transformer network for global feature extraction and a MEDM for refined detail processing. The enhanced model demonstrated robust performance across challenging conditions: it maintained over 90% accuracy at angles of 15°, 30°, and 45°; achieved 100% accuracy under lighting intensities from 30% to 100%; and attained accuracy exceeding 97.2% for bolts with blue, red, and white coatings against complex grassy and mixed backgrounds.
-
3.
In a full-scale engineering application on a suspension bridge, the proposed model successfully analyzed over 12,772 bolt sets, achieving a detection rate of 98.94% and identifying one critical missing bolt. This result validates the model’s practical effectiveness and high reliability in real-world inspection scenarios.
Despite the high overall detection rate, the isolated missed detections highlight the persistent challenges posed by complex structural geometries and real-world variability. Future work will focus on conducting comprehensive comparative experiments with other candidate models, such as Faster R-CNN with FPN, as well as with successively released YOLO iterations. This comprehensive evaluation will validate the generalizability of our approach and ultimately enhance bolt and defect detection in complex environments.
Data availability
The data that support the findings of this study are available from the corresponding author upon reasonable request.
References
Wang, T., Song, G., Liu, S., Li, Y. & Xiao, H. Review of bolted connection monitoring. Int. J. Distrib. Sens. Netw. 9, 871213. https://doi.org/10.1155/2013/871213 (2013).
Xinhua News Agency, The world’s longest-spanning railway arch bridge has been successfully closed. Accessed 05 Oct 2025 (2018).https://www.yidaiyilu.gov.cn/p/74297.html
Suda, M. et al. Development of ultrasonic axial bolting force inspection system for turbine bolts in thermal power plants. Jsme Int. J. ser. Solid Mech. Strength. Mater. 35, 216–219. https://doi.org/10.1299/jsmea1988.35.2_216 (1988).
Yang, J. & Chang, F. K. Detection of bolt loosening in C-C composite thermal protection panels: II. Experimental verification. Smart Mater. Struct. 15, 591–599 (2006). http://stacks.iop.org/SMS/15/591
Okugawa, M. Bolt loosening detection method by using smart washer adopted 4SID. Japan Soc. Mech. Eng. https://doi.org/10.2514/6.2004-1981 (2003).
Cha, Y. J., You, K. & Choi, W. Vision-based detection of loosened bolts using the Hough transform and support vector machines. Autom. Constr. 71, 181–188. https://doi.org/10.1016/j.autcon.2016.06.008 (2016).
Wang, B. L. & Yu, L. Loose fault detection for fastening bolts of medium and low speed maglev F_rail. Information Technology. 43, 88–92 + 97. 10. 13274/j. cnki. hdzj. 08. 021 (2019). (2019).
Zhao, X. X., Qian, S. S. & Liu, X. G. Image identification method for high-strength bolt missing on railway bridge based on convolution neural network. China Railway Sci. 39, 56–62. https://doi.org/10.3969/j.issn.1001-4632.2018.04.09 (2018).
Zhou, J. & Huo, L. Computer vision-based detection for delayed fracture of bolts in steel bridges. J. Sens. https://doi.org/10.1155/2021/8325398 (2021).
Li, X. X., Li, D., Ren, W. X. & Zhang, J. S. Loosening Identification of multi-bolt connections based on wavelet transform and ResNet-50 convolutional neural network. Sensors 22, 6825. https://doi.org/10.3390/s22186825 (2022).
Yang, Z., Zhao, Y. & Xu, C. Detection of missing bolts for engineering structures in natural environment using machine vision and deep learning. Sensors 23, 5655. https://doi.org/10.3390/s23125655 (2023).
Ni, Y., Mao, J. & Wang, Y. X. Z. Corroded and loosened bolt detection of steel bolted joints based on imp-roved you only look once network and line segment detector. Smart structures and systems 32, 23–35. https://doi.org/10.12989/sss.2023.32.1.023 (2023).
Chen, X. R., Zhou, Y., Zhao, Y. T. & Yan, X. F. An improved YOLOv5-based bolt missing detection method for mobile terminals. Mod. Manuf. Eng. 11, 108–114. https://doi.org/10.16731/j.cnki.1671-3133.2022.11.018 (2022).
Pan, X., Tavasoli, S. & Yang, T. Y. Autonomous 3D vision-based bolt loosening assessment using micro aerial vehicles. Computer-Aided Civil and Infrastructure Engineering 38, 2443–2454. https://doi.org/10.1111/mice.13023 (2023).
Lao, W., Cui, C., Zhang, D., Zhang, Q. & Bao, Y. Computer vision-based autonomous method for quantitative detection of loose bolts in bolted connections of steel structures. Struct. Control Health Monit. https://doi.org/10.1155/2023/8817058 (2023).
Liu, Z. et al. Swin transformer: Hierarchical vision transformer using shifted windows. Proc. IEEE/CVF Int. Conf. Comput. Vis. https://doi.org/10.48550/arXiv.2103.14030 (2021).
Aggarwal, A., Mittal, M. & Battineni, G. Generative adversarial network: An overview of theory and applications. Int. J. Inform. Manage. Data Insights. https://doi.org/10.1016/j.jjimei.2020.100004 (2021).
Liao, C. et al. Benchmarking multi-modal semantic segmentation under sensor failures: Missing and noisy m-odality robustness. Proc. Comput. Vis. Pattern Recognit. Conf. https://doi.org/10.48550/arXiv.2503.18445 (2025).
Vasiljevic, I., Chakrabarti, A. & Shakhnarovich, G. Examining the impact of blur on recognition by convolutional networks. (2016). https://doi.org/10.48550/arXiv.1611.05760
Wang, W., Chen, Z., Yuan, X. & Wu, X. Adaptive image enhancement method for correcting low-illumination images. Inf. Sci. 496, 25–41. https://doi.org/10.1016/j.ins.2019.05.015 (2019).
Redmon, J., Divvala, S., Girshick, R. & Farhadi, A. You only look once: Unified, real-time object detection. Proceedings of the IEEE conference on computer vision and pattern recongnition. (2016). https://doi.org/10.48550/arXiv.1506.02640
Sun, Y., Li, M., Dong, R., Chen, W. & Jiang, D. Vision-based detection of bolt loosening using YOLOv5. Sensors 22, 5184. https://doi.org/10.3390/s22145184 (2022).
Yu, F. et al. Imaging-based instance segmentation of pavement cracks using an improved YOLOv8 network. Struct. Control Health Monit. 1660649. https://doi.org/10.1155/stc/1660649 (2025).
Lei, W. et al. Vision-based real-time bolt loosening detection by identifying anti-loosening lines. Sensors 24, 6747. https://doi.org/10.3390/s24206747 (2024).
Sajjadi, M. S. M. et al. Assessing generative models via precision and recall. Advances in neural information processing systems. (2018). https://doi.org/10.48550/arXiv.1806.00035
Wang, B. A parallel implementation of computing mean average precision. ArXiv https://doi.org/10.48550/arXiv.2206.09504 (2022).
Yu, J., Jiang, Y., Wang, Z., Cao, Z. & Huang, T. Unitbox: An advanced object detection network. Proc. 24th ACM Int. Conf. Multimedia. https://doi.org/10.1145/2964284.2967274 (2016).
Liu, Z. et al. Swin transformer: Hierarchical vision transformer using shifted windows. Proceedings of the IEEE/CVF international conference on computer vision. 10012–10022. (2021). https://doi.org/10.48550/arXiv.2103.14030
Du, W. F. Anti-corrosion coating treatment measures for highway and bridge guardrail. Total Corrosion Control 37, 121–124. https://doi.org/10.13726/j.cnki.11-2706/tq.2023.05.121.04 (2023).
Li, Q. S. Design for Nezha suspension bridge in Jiangyou city of Sichuan province. Transport Research 09, 47–50. https://doi.org/10.3869/j.issn.1002-4786.2011.09.016 (2011).
Funding
Project Supported by Sichuan Science and Technology Program, under grant No. (2025ZYDF080).
Author information
Authors and Affiliations
Contributions
Ying Gu (Y.G.): Conceptualization, methodology, investigation, supervision, project administration. Jingyu Song (J.Y.S): Data curation, data augmentation, formal analysis. Dongmei Peng (D.M.P): Model development, comparative experiments, validation. Chao Kong (C.K.) & Songbo Ren (S.B.R): Funding acquisition, visualization, writing–review & editing. All authors have reviewed and approved the final version of the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Gu, Y., Peng, D., Song, J. et al. Image-based detection of bolts and bolt-missing defects in multi-angle and complex background scenarios. Sci Rep 16, 11590 (2026). https://doi.org/10.1038/s41598-026-41036-x
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41598-026-41036-x























