Image-based detection of bolts and bolt-missing defects in multi-angle and complex background scenarios

Gu, Ying; Peng, Dongmei; Song, Jingyu; Ren, Songbo; Kong, Chao

doi:10.1038/s41598-026-41036-x

Download PDF

Article
Open access
Published: 02 March 2026

Image-based detection of bolts and bolt-missing defects in multi-angle and complex background scenarios

Ying Gu¹,
Dongmei Peng¹,
Jingyu Song²,
Songbo Ren¹ &
…
Chao Kong¹

Scientific Reports volume 16, Article number: 11590 (2026) Cite this article

9201 Accesses
Metrics details

Subjects

Abstract

Bolted connections are widely adopted as primary structural joints in engineering infrastructure. However, conventional manual inspection remains labor-intensive and time-consuming. Deep learning–based automated defect detection faces significant challenges due to pronounced image variability induced by variable camera angles, lighting conditions, partial occlusions, and complex backgrounds. This study addresses these issues by constructing a diverse bolt image dataset compiled from three sources: on-site acquisitions from suspension bridges, field photography of steel transmission towers, and controlled laboratory imaging of a custom-fabricated bolt assembly model. To enhance data robustness, we employed image enhancement techniques and generative adversarial networks (GANs) for data augmentation. A comparative analysis was conducted among three mainstream object-detection models—YOLOv5, YOLOv8, and YOLOv10—using recall, precision, and mean average precision (mAP) as evaluation metrics. Building upon the superior performance of YOLOv8 (mAP = 0.91, recall = 0.85, precision = 0.9), we proposed an enhanced architecture integrating a Swin-Transformer backbone and a novel Multi-Scale and Detail-Enhanced Module (MEDM) to specifically improve missing-bolt detection in challenging visual contexts. The improved model demonstrated consistent accuracy across diverse scenarios: 100% at 15°, 30°, and 45° viewing angles; > 94% under 30%, 50%, and 100% illumination levels; and > 97.2% for colored coatings (blue, red, white) against complex grassy and mixed backgrounds. In practical engineering deployment, the model achieved a 98.94% detection rate across 12,772 bolt sets, successfully identifying one instances of missing bolts. These findings validate the proposed approach’s effectiveness for real-world structural health monitoring.

A real-time industrial safety automation using YOLO architectures leveraging diverse chromatic domains

Article Open access 04 February 2026

Advancing e-waste classification with customizable YOLO based deep learning models

Article Open access 25 May 2025

Conveyor belt foreign object detection method based on improved YOLOv11 and ESRGAN

Article Open access 12 May 2026

Introduction

Bolt connections serve as critical components in joining elements within steel structure bridges¹. The integrity of these connections is essential to the overall structural reliability. Nevertheless, the vast quantity of bolts involved, for instance, the Nujiang Four-Track Bridge in Baoshan City, Yunnan Province, China, utilizes over 800,000 sets², increases the likelihood of bolt loss, which can result from installation omissions or loosening under long-term cyclic loading.

The loss of bolts compromises the load-bearing capacity of the connection, thereby threatening the overall structural integrity. Accordingly, there is a critical need for methods capable of quickly and precisely pinpointing missing bolts and quantifying their numbers; such data provides a definitive basis for prioritizing replacements to maintain structural integrity. While traditional practices depend on periodic, labor-intensive manual checks^3,4,5, these approaches suffer from high costs, poor efficiency, and variable outcomes tied directly to operator skill. More recently, the field of artificial intelligence has yielded promising new solutions, particularly in deep learning-based image detection, which offers an automated alternative for identifying bolt defects.

The application of computer vision and deep learning for automatic bolt condition assessment has been extensively investigated. Early research conducted preliminary investigations into image-based identification of bolt defects; for instance, Cha et al⁶. successfully demonstrated the potential of traditional image processing combined with SVMs, using the Hough transform for robust feature extraction. As deep learning emerged, CNNs became the dominant architecture. Wang et al⁷., Zhao et al⁸., and Zhou et al⁹. all validated CNNs’ powerful capability for feature learning, applying them to detect various defects including loosening and absence, particularly in demanding contexts like bridge infrastructure. To overcome limitations, researchers developed targeted improvements. Li et al¹⁰. enhanced computational efficiency and data requirements by incorporating time-frequency analysis into their CNN model. Yang et al¹¹. refined detection accuracy by combining geometric transformations with IoU-guided selection. The pursuit of practical, versatile solutions led to several key developments: Ni et al¹². created a multi-classification system able to distinguish between different fault types (corrosion, loosening) using an enhanced YOLOv5s, while Chen et al¹³. prioritized deployment feasibility with a lightweight YOLOv5 variant designed for mobile platforms. Efforts to enable real-time operation were realized by Pan et al¹⁴., who developed an integrated system for live tracking. Crucially, Lao et al¹⁵. addressed a fundamental challenge by analyzing and mitigating the impact of variable imaging conditions — such as focal length, angle, and illumination — through adaptive preprocessing and model tuning, thereby significantly improving the robustness of visual inspection systems.

While these studies have substantially advanced the field of deep learning-based bolt defect detection and introduced viable solutions for structural health monitoring, significant challenges persist. The accuracy of any detection system is susceptible to considerable variability in bolt appearance within images, stemming from diverse shooting angles, variable lighting conditions, and partial occlusions. Moreover, real-world applications present additional complexities; steel structures are situated in diverse environments such as urban areas, mountains, rivers, which introduce intricate and cluttered backgrounds. Compounding this issue is the common practice of coating steel components in distinctive paints during fabrication to prevent corrosion. Collectively, these factors can severely degrade detection performance. Consequently, the practical efficacy and robustness of the previously discussed methods have yet to be fully validated within complex, real-world engineering environments.

Our investigation addresses the identified limitations by first constructing a diverse bolt image dataset gathered from three sources: field images of operational suspension bridges and transmission towers, and controlled images from a custom-designed bolt joint model. Following augmentation via both traditional and deep learning techniques, we performed a comparative analysis of YOLO-series models (v5, v8, and v10). Building on the results, an enhanced YOLOv8-based architecture is proposed, which integrates a Swin-Transformer¹⁶ backbone and multi-scale and detail enhancement module (MEDM) to boost detection robustness in cluttered backgrounds and from multi-angle perspectives. The proposed model was validated through controlled experiments and real-world applications across various scenarios.

Dataset construction

Database collection

Bolt image data were collected through three primary methods: on-site acquisition from suspension bridge, field photography of steel transmission towers, and image capture from a custom-built bolt joint model.

The first subset comprises 757 images of bolts from various structural components of a steel suspension bridge, as illustrated in Fig. 1.

The second subset includes 252 images of bolts captured at transmission tower sites, shown in Fig. 2.

The third subset contains 988 images acquired from a specially designed bolt joint assembly, depicted in Fig. 3.

Image enhancement and generative adversarial network (GANs)

To enrich the diversity and robustness of our training dataset, we implemented a two-pronged augmentation strategy utilizing conventional image enhancement techniques and advanced deep learning models based on GANs¹⁷.

Image enhancement

To improve the diversity and robustness of the bolt image dataset, several enhancement techniques were applied, including the addition of Gaussian noise to simulate real-world sensor interference¹⁸, low-pass filtering to emulate blur under motion or defocus¹⁹, and color space conversion to enhance invariance to illumination changes²⁰. Example results of these enhancement operations are shown in Fig. 4.

Images generated by generative adversarial networks

Figure 5 presents examples of bolt images generated by the GAN. The GAN was designed to vary background and contextual elements for data augmentation while preserving the key structural features of the bolts themselves.

A total of 5,057 bolt images were collected using the aforementioned methods. The dataset information is summarized in Table 1. These images were divided into a training set and a validation set in an 8:2 ratio. After completing the image collection, the X-AnyLabeling automatic annotation software was employed to label the dataset, which was then used to train the model. The software was used to annotate the bolt images with two types of labels for analysis: “screw” (bolt present) and “noscrew” (bolt missing).

Table 1 Composition of the Multi-Source Bolt Image Dataset.

Full size table

Comparative study on detection performance of different models

The dataset compiled through the aforementioned methods was used to train several models. A comparative analysis of their detection performance was conducted to inform subsequent model improvements.

Selected comparative models

Model training was carried out using the YOLO (You Only Look Once) framework²¹—a convolutional neural network-based object detection algorithm. Among its various versions, YOLOv5, YOLOv8, and YOLOv10 represent the most stable and widely adopted iterations and were therefore selected for comparison.

YOLOv5 model

YOLOv5 employs a modular architecture consisting of Input, Backbone, Neck, and Prediction Head. A distinctive feature of its Backbone is the Focus module, which performs slicing operations to enhance feature extraction while preserving contextual information²². In this study, the YOLOv5s variant was adopted as one baseline model to provide a performance benchmark for subsequent comparisons. The overall structure is shown in Fig. 6.

YOLOv8 model

Compared to YOLOv5, YOLOv8 introduces the C2f module in place of the C3 module²³. The C2f incorporates more skip connections and an additional split operation, while reducing convolutional operations in branch layers. This design lowers computational cost while improving gradient flow and feature representation capability. Based on its balanced performance in accuracy and efficiency observed in preliminary experiments, YOLOv8 was selected as the foundation for our proposed improved model. A schematic of these improvements is shown in Fig. 7.

YOLOv10 model

As the latest iteration in the series, YOLOv10 achieves a balance between efficiency and accuracy through architectural refinements including enhanced feature fusion and lightweight design²⁴. It was included in this study to compare its performance against YOLOv5 and YOLOv8 in the task of bolt and bolt-missing detection.

Model training parameters

Hyperparameters play a critical role in deep learning algorithms, governing model architecture configurations and profoundly influencing computational efficiency, final performance metrics, and convergence behavior. Given that the computational complexities of the three compared models (YOLOv5, YOLOv8, and YOLOv10) are comparable under equivalent parameterization, this study establishes a valid baseline for horizontal performance comparison. To optimize training regimens adaptively, we implemented an automated optimizer selection strategy: AdamW is deployed for training scenarios with ≤ 10⁴ iterations, while Stochastic Gradient Descent (SGD) is utilized for instances exceeding 10⁴ iterations. The complete set of model-specific hyperparameters is detailed in Table 2.

Table 2 Model training parameters.

Full size table

Using the parameters defined in Table 2, all three models underwent identical training protocols. Their respective detection performances were subsequently evaluated using standardized metrics including recall, precision²⁵, and mean Average Precision (mAP)²⁶.

Recall and precision analysis

This section analyzes the performance of the trained models using standard metrics designed for object detection tasks.

Recall

Recall measures the proportion of actual positive instances correctly identified by the model. It is defined as the number of true positive predictions divided by the total number of actual positives, as expressed in Eq. (1):

$$Recall{\text{=}}\frac{{TP}}{{TP+FN}}$$

(1)

where TP (True Positives) denotes the number of correctly detected bolt instances, and FN (False Negatives) refers to the number of actual bolts that were missed by the model.

Precision

Precision measures the proportion of detected targets that are correctly recognized as “screw”, calculated as the number of images correctly recognized as “screw” divided by the total number of images identified as “screw”, as shown in Eq. (2):

$$Precision=\frac{{TP}}{{TP+FP}}$$

(2)

where FP (False Positives) indicates the number of incorrect bolt predictions, which includes both detections in background areas and false detections on non-bolt objects.

The recall and precision values of the three models are summarized in Table 3. As shown in Table 3, YOLOv5, YOLOv8, and YOLOv10 all reached a precision of 0.9. The recall rates of YOLOv5 and YOLOv8 differ by 0.05, while the gap between YOLOv10 and YOLOv8 is 0.08—greater than that between YOLOv10 and YOLOv5. A comprehensive evaluation reveals that the YOLOv8 model achieves superior performance in terms of both precision and recall.

Table 3 Model recall and precision rates.

Full size table

Mean average precision (mAP) comparative analysis

Mean Average Precision (mAP) serves as the primary metric in this study for quantitatively evaluating and comparing the overall detection performance of the YOLOv5, YOLOv8, and YOLOv10 models on the task of bolt and missing-bolt detection. The mAP is computed as:

$$mAP=\frac{1}{n}\sum\limits_{{k=1}}^{{k=n}} {A{P_k}}$$

(3)

where AP (Average Precision) for a specific category is derived from the area under its precision-recall curve, typically computed as the average of precision values at a set of recall levels. The variable n represents the total number of categories (e.g., “screw” and “no screw” in this work), and APₖ denotes the Average Precision of the k-th category.

(4)

To ensure a consistent and objective assessment, the Intersection over Union (IoU)²⁷ metric was employed to determine whether a predicted bounding box correctly localized a bolt. The IoU measures the overlap between a predicted bounding box and its corresponding ground truth, calculated as:

Here, the “Area of Overlap” refers to the spatial intersection between the predicted and ground truth bounding boxes, while the “Area of Union” represents their combined area. A detection is considered a true positive only when the IoU between the predicted bounding box and the ground truth exceeds the specified threshold (0.5 for mAP@0.5, and 0.5 to 0.95 for mAP@0.5:0.95), as defined in Eq. (4).

The compiled bolt image dataset was used to train the three YOLO models, and their mAP results are summarized in Table 4. The mAP values are reported under two common IoU threshold schemes: mAP@0.5 (IoU threshold = 0.5) and mAP@0.5:0.95 (average mAP over IoU thresholds from 0.5 to 0.95 in steps of 0.05).

Table 4 Mean average precision (mAP) comparison of YOLO models.

Full size table

As illustrated in the table, YOLOv8 achieves the highest performance on both mAP@0.5 and mAP@0.5:0.95, indicating its superior capability in accurately identifying bolts and missing bolts under varying detection thresholds.

The performance gap between mAP@0.5 (0.91) and mAP@0.5:0.95 (0.61) for YOLOv8 is primarily due to reduced bounding box localization precision under stricter IoU thresholds, which is significantly influenced by the variable shooting angles in our dataset.

Figure 8(a-b) illustrates the detection results for a bolt imaged at a direct, frontal angle. In Fig. 8(a) (ground-truth bounding box) and Fig. 8(b) (model’s predicted bounding box), the boxes align closely, resulting in a high IoU value.

In contrast, Fig. 8(c-d) shows the results for a bolt captured at an oblique angle. The ground-truth box in Fig. 8(c) is annotated tightly around the nut. However, the model’s predicted box in Fig. 8(d) encloses not only the nut but also a portion of the exposed threaded rod. This enlargement causes the predicted box to be larger than the ground-truth box, leading to a lower IoU.

This effect is more pronounced at even more extreme angles, approaching 90° (essentially perpendicular to the rod), as shown in Fig. 8(e-f). The ground-truth box in Fig. 8(e) contains only the nut and the exposed end of the bolt. The predicted box in Fig. 8(f), however, incorporates additional background areas surrounding the fastener, further reducing the IoU.

Critically, across all three scenarios, the model correctly identifies the presence of the bolt. The drop in mAP@0.5:0.95 reflects the challenge of achieving precise box localization under diverse angles, not a failure in bolt recognition.

Computational speed and parameter count analysis

Computational efficiency, measured by inference speed and parameter count, serves as a crucial metric for evaluating the practical deployment potential of deep learning models. This section presents a comparative analysis of the computational performance of the YOLOv5, YOLOv8, and YOLOv10 models, with detailed metrics summarized in Table 5.

Table 5 Computational performance and complexity of YOLO models.

Full size table

As shown in the table, YOLOv8 achieves the shortest total processing time (3.9 ms) among the three models. It is 11.4 ms faster than YOLOv5 (15.3 ms) and 16.7 ms faster than YOLOv10 (20.6 ms) in end-to-end inference. Furthermore, YOLOv8 requires a relatively short training time while maintaining a competitive parameter count and computational complexity (GFLOPs).

In summary, YOLOv8 demonstrates a favorable balance between detection performance and computational efficiency. It achieves higher frames per second (FPS) and reduced training time compared to YOLOv5 and YOLOv10, making it a suitable baseline for subsequent improvements. Future work will focus on enhancing the YOLOv8 architecture to further optimize its inference speed without compromising detection accuracy.

Model improvement

To enhance the computational efficiency of the YOLOv8 model for bolt-missing detection and improve its accuracy and robustness in detecting small objects under complex scenarios, this study introduces two key modifications: the integration of the Swin-Transformer network and MEDM. These enhancements are designed to strengthen the model’s capability to recognize bolts under challenging visual conditions.

Integration of the swin-transformer network

The Swin-Transformer architecture is integrated into the YOLOv8 backbone to enhance its capability of modeling long-range dependencies and complex spatial contexts, which are critical for recognizing bolts in cluttered backgrounds. Its core innovation lies in a shifted window mechanism that efficiently computes self-attention within non-overlapping local windows while still enabling cross-window communication²⁸. This design offers a superior balance between computational complexity and the ability to capture global features compared to standard convolutional operators or full self-attention. The overall structure of the Swin-Transformer and its fundamental building block are illustrated in Figs. 9 and 10, respectively.

In this study, the Swin-Transformer module was integrated into the 8th layer of the YOLOv8 backbone—between the high-level feature extractor and the Spatial Pyramid Pooling Fusion (SPPF) module. This placement was determined by analyzing the network’s feature hierarchy. The early convolutional layers (1–7) extract low-level local features efficiently, but inserting the Swin-Transformer here would be computationally costly due to large feature maps, with limited benefit as global context is not yet needed. The 8th layer, however, represents a mid-to-high-level stage where local features are well integrated. Adding the Swin-Transformer here enriches these semantically meaningful features through self-attention, thereby improving multi-scale fusion in the subsequent SPPF module.

The configuration [−1, 3, SwinTransformer, [1024, True]] was employed, where 1024 maintains dimensional consistency with adjacent layers, and True activates the window-based multi-head attention mechanism. Stacking the module three times enhances its representational capacity. This hybrid design allows the network to leverage convolutional layers (Conv and C2f) for local feature extraction and spatial hierarchy modeling, while the Swin-Transformer captures long-range dependencies and global context.

Integration of the multi-scale and detail-enhanced module

To improve the detection of bolts under challenging visual conditions, a MEDM is introduced. This module enhances multi-scale feature representation and emphasizes fine structural details through a dedicated edge reinforcement mechanism, enabling more accurate localization and detection of small and partially obscured objects.

The MEDM consists of three parallel branches, each performing convolution followed by average pooling. An edge enhancer, implemented as a residual connection combining multiple average pooling layers and a convolutional layer, is applied within each branch. Features from all branches are merged via a 1 × 1 convolution to form a unified multi-scale representation, which is then refined using a SimAM attention mechanism for adaptive feature weighting. This design strengthens the network’s ability to capture and fuse information across scales without introducing significant computational overhead. The structure of MEDM is depicted in Fig. 11.

The MEDM module is inserted between the Neck and Head components of YOLOv8. This placement enhances multi-scale feature fusion, preserves spatial and semantic information, and improves gradient flow, particularly beneficial for detecting small objects in cluttered environments. As a result, the module significantly boosts the model’s accuracy and robustness across diverse and complex scenarios.

The overall architecture of the proposed improved model, incorporating both the Swin-Transformer block and the MEDM module into YOLOv8, is illustrated in Fig. 12.

Model evaluation and ablation experiments

To comprehensively evaluate the performance of the improved model, we analyzed its training convergence, compared the classification performance of the baseline and improved models using confusion matrices, and conducted ablation experiments to quantify the contribution of each proposed component.

Model training and convergence analysis

The training and validation loss curves, along with the mAP progression, are shown in Fig. 13. The loss curves (Fig. 13a) for both training and validation sets decrease rapidly and stabilize after approximately 20 epochs. The validation loss remains consistently lower than the training loss, indicating stable model convergence without signs of overfitting. The mAP curves (Fig. 13b) show a steady increase with the training epochs, where mAP@0.5 approaches 1.0 and mAP@0.5:0.95 plateaus around 0.6, demonstrating the model’s effective learning on the complex bolt dataset.

Confusion matrix analysis

The class-wise confusion matrices for the baseline YOLOv8 and the proposed improved model on the test set are presented in Fig. 14. The matrices detail the model performance across three categories: “screw” (intact bolt), “noscrew” (missing bolt), and ‘background’.

A comparison reveals that the improved model (Fig. 14b) correctly identifies 6,521 “screw” instances, a significant increase from the 6,286 identified by the baseline (Fig. 14a). More critically, the improved model substantially reduces key misclassifications essential for safety inspection: the number of “noscrew” instances misclassified as “background” (a critical miss) drops sharply from 61 to 19. This demonstrates that the proposed architectural enhancements are particularly effective in improving the discrimination between intact and missing bolts, thereby enhancing inspection reliability.

Ablation experiments

Ablation experiments were conducted to validate the individual contributions of the Swin-Transformer and the Multi-scale and Detail Enhancement Module (MEDM). The results are summarized in Table 6.

Table 6 Results of ablation experiments.

Full size table

The results demonstrate the distinct role of each component:

1.
The Swin-Transformer module primarily improves Precision (from 0.90 to 0.94), indicating its effectiveness in reducing false positives by modeling global contextual relationships.
2.
The MEDM helps maintain high precision while slightly improving the mAP@0.5 (from 0.91 to 0.92), which is attributed to its enhanced capability in detecting multi-scale and small objects.
3.
The full model, integrating both components, achieves the best performance across the key metrics of Precision (0.97), Recall (0.90), and mAP@0.5 (0.94). This confirms that the Swin-Transformer and MEDM complement each other, with the former enhancing global feature representation and the latter refining multi-scale local detail fusion.

Experimental verification

Accuracy evaluation under multiple angles

This subsection presents an evaluation of the detection accuracy for the proposed improved model, based on YOLOv8 and enhanced with the Swin-Transformer Network and MEDM, across various viewing angles. Quantitative results are summarized in Table 7, while Fig. 15 illustrates example detection outcomes for bolts captured at 0°, 15°, 45°, and 60°.

As shown in Table 7, the model maintains high accuracy at moderate angles, achieving 100% detection rates at 15°, 30°, and 45°, though performance slightly decreases at more extreme angles such as 60° (92%), 75° (90%), and 85° (91%). These results demonstrate the robustness of the proposed method across a range of realistic viewing conditions.

Table 7 Detection accuracy of the proposed model under multiple angles.

Full size table

Accuracy evaluation under different lighting conditions

This study evaluates the detection accuracy of the improved model under varying lighting conditions. Brightness levels of 30%, 50%, 120%, and 300% were tested to validate model robustness. Example detection results under these conditions are illustrated in Fig. 16, and detailed quantitative accuracy metrics are provided in Table 8.

Table 8 Detection accuracy under varying illumination conditions.

Full size table

As summarized in Table 8, the model consistently achieves 100% detection accuracy across most brightness levels, including 30%, 50%, 100%, 120%, and 300%. A slight performance decrease to 94% is observed at the highest tested luminance level of 350 cd/m². These results indicate strong adaptability of the proposed method to significant variations in illumination.

Accuracy evaluation under complex backgrounds

To validate the model’s performance under realistic engineering scenarios, we simulated the common practice of applying anti-corrosion coatings to bridge steelwork²⁹. Model structures were painted in red, white, and blue to represent variously coated bolts. These were photographed against complex natural backgrounds (grassy terrain and mixed grass-ground environments) to evaluate bolt detection performance. Figure 17 illustrates the detection effects under these challenging visual conditions, and Table 9 provides a quantitative summary of the accuracy rates across the different background types.

Table 9 Detection accuracy under complex backgrounds with different coating colors.

Full size table

The experimental results demonstrate that the proposed model achieves consistently high detection accuracy across diverse coating colors and complex natural backgrounds, with all values exceeding 97.2%. This performance highlights the model’s strong generalization capability and practical applicability in real-world bridge inspection scenarios.

Engineering application

Engineering overview

This study applies the trained model to a steel suspension bridge that has been in service for 15 years since its completion in 2010, in order to validate its effectiveness in real-world engineering scenarios. The key structural parameters are as follows: a span of 252.0 m between tower centers, a main cable sag of 25.2 m, and a deck width of 7.0 m³⁰. The east bank tower (29.94 m in height) is situated in Beichuan County, while the west bank tower (32.80 m in height) is located in the Qianyuan Mountain Tourist Scenic Area. The bridge spans the river, effectively connecting these two regions.

The structural system includes 61 pairs of hangers spaced at 4.0 m intervals and a steel truss stiffening girder. The girder consists of two main trusses spaced 7.2 m apart center-to-center, with a vertical height of 2.2 m between the top and bottom chords. Cross-frames are installed at hanger attachment points with a longitudinal spacing of 4.0 m. The stiffening girder is composed of 60 standard segments, each 4.0 m in length, and two non-standard end segments, each measuring 4.78 m. The bridge elevation is shown in Fig. 18.

All truss members are connected at nodal plates using Grade 10.9 S M20 high-strength bolts, with over 50,000 such bolts used throughout the entire bridge. After 15 years of service, a visual inspection conducted in 2023 identified signs of loosening in certain bolts, suggesting a potential risk of bolt loss or dislodgment. Manually examining such a large number of bolts for defects such as missing fasteners is highly challenging.

To address this issue, a DJI drone was deployed to capture high-resolution images of the bridge, and the model developed in this study was applied to automatically detect bolts and identify missing fasteners. This approach is critical because the inspection environment poses two major challenges: complex visual backgrounds and multi-angle viewing requirements, which align with the core focus of this study.

The visual complexity arises from the varying surface conditions and coatings, which is shown in Fig. 19: the hangers are painted white for visibility, while the stiffening girder has a gray coating and is further complicated by extensive water stains, rust marks, and other environmental residues. These factors create a highly heterogeneous background that complicates automated detection.

Furthermore, the inspection inherently involves multi-angle image acquisition. The complex three-dimensional geometry of the bridge structure—including the diverse orientations of hangers, girder surfaces, and nodal connections—requires imagery to be captured from various viewpoints to ensure comprehensive coverage. Operational constraints, such as obstacle avoidance and flight safety regulations, further necessitate that the UAV capture imagery from a range of angles and perspectives, rather than from a single, ideal viewpoint.

Image acquisition

A DJI drone was deployed for image acquisition in this study. The key specifications of the drone are summarized in Table 10. Certain performance parameters were operated under reduced settings to accommodate constraints present in the field environment.

Table 10 DJI drone performance specifications.

Full size table

Owing to limitations in satellite signal stability and obstacle avoidance capabilities, the DJI drone was unable to operate safely within the internal truss structure. As a result, the inspection was focused on externally accessible regions, particularly the outer surfaces of the truss and the cable suspension zones.

Inspection results and statistical analysis

The on-site inspection results are summarized in Table 11, with representative detection visualizations provided in Fig. 20.

Table 11 Detection performance across structural components.

Full size table

In the critical hanger section, all 868 designated bolt sets were successfully detected, achieving a perfect detection rate. The favorable imaging conditions in this region—characterized by high contrast between the bolts and a relatively uniform background, larger bolt size (M22), and wider spacing—collectively contributed to this optimal performance, as visually evidenced in Fig. 20 (a) and (b).

For the more challenging stiffening girder section, which features smaller bolts (M20), denser arrangements, and cluttered backgrounds, the system identified 11,768 out of 11,904 bolt sets, yielding a detection rate of 98.86%. One instance of a missing bolt was successfully flagged within this segment, as shown in Fig. 20.

To provide a comprehensive and objective assessment of the model’s capability across all conditions, the confusion matrix for the complete test set is presented in Fig. 21. This set encompasses both the simple hanger region and the complex stiffening girder area. The matrix confirms the model’s robust but not perfect performance in the general case, offering a realistic representation of its overall reliability.

Across the entire bridge, 12,636 out of 12,772 bolt sets were identified, resulting in an overall detection rate of 98.94%. One missing bolt was confirmed, indicating a structural integrity rate of 99.99% for the inspected fasteners. These results demonstrate the high practical efficacy and robustness of the proposed method for automated bolt inspection in large-scale infrastructure, while the isolated missed detections highlight the persistent challenges posed by complex structural geometries.

Conclusion

This study developed and validated an improved deep learning-based framework for automated bolt and bolt-missing detection, specifically designed to address the challenges of complex backgrounds and multi-angle perspectives inherent in real-world bridge inspection scenarios. The main conclusions are summarized as follows:

1.
A comparative analysis of YOLOv5, YOLOv8, and YOLOv10 models identified YOLOv8 as the most effective baseline, achieving an mAP@0.5 of 0.91, a recall of 0.85, and a precision of 0.9. Its superior balance of detection accuracy and computational efficiency made it the optimal choice for subsequent enhancements.
2.
An improved model was proposed based on YOLOv8 by incorporating a Swin-Transformer network for global feature extraction and a MEDM for refined detail processing. The enhanced model demonstrated robust performance across challenging conditions: it maintained over 90% accuracy at angles of 15°, 30°, and 45°; achieved 100% accuracy under lighting intensities from 30% to 100%; and attained accuracy exceeding 97.2% for bolts with blue, red, and white coatings against complex grassy and mixed backgrounds.
3.
In a full-scale engineering application on a suspension bridge, the proposed model successfully analyzed over 12,772 bolt sets, achieving a detection rate of 98.94% and identifying one critical missing bolt. This result validates the model’s practical effectiveness and high reliability in real-world inspection scenarios.

Despite the high overall detection rate, the isolated missed detections highlight the persistent challenges posed by complex structural geometries and real-world variability. Future work will focus on conducting comprehensive comparative experiments with other candidate models, such as Faster R-CNN with FPN, as well as with successively released YOLO iterations. This comprehensive evaluation will validate the generalizability of our approach and ultimately enhance bolt and defect detection in complex environments.

Data availability

The data that support the findings of this study are available from the corresponding author upon reasonable request.

References

Wang, T., Song, G., Liu, S., Li, Y. & Xiao, H. Review of bolted connection monitoring. Int. J. Distrib. Sens. Netw. 9, 871213. https://doi.org/10.1155/2013/871213 (2013).
Article Google Scholar
Xinhua News Agency, The world’s longest-spanning railway arch bridge has been successfully closed. Accessed 05 Oct 2025 (2018).https://www.yidaiyilu.gov.cn/p/74297.html
Suda, M. et al. Development of ultrasonic axial bolting force inspection system for turbine bolts in thermal power plants. Jsme Int. J. ser. Solid Mech. Strength. Mater. 35, 216–219. https://doi.org/10.1299/jsmea1988.35.2_216 (1988).
Article Google Scholar
Yang, J. & Chang, F. K. Detection of bolt loosening in C-C composite thermal protection panels: II. Experimental verification. Smart Mater. Struct. 15, 591–599 (2006). http://stacks.iop.org/SMS/15/591
Article ADS Google Scholar
Okugawa, M. Bolt loosening detection method by using smart washer adopted 4SID. Japan Soc. Mech. Eng. https://doi.org/10.2514/6.2004-1981 (2003).
Article Google Scholar
Cha, Y. J., You, K. & Choi, W. Vision-based detection of loosened bolts using the Hough transform and support vector machines. Autom. Constr. 71, 181–188. https://doi.org/10.1016/j.autcon.2016.06.008 (2016).
Article Google Scholar
Wang, B. L. & Yu, L. Loose fault detection for fastening bolts of medium and low speed maglev F_rail. Information Technology. 43, 88–92 + 97. 10. 13274/j. cnki. hdzj. 08. 021 (2019). (2019).
Zhao, X. X., Qian, S. S. & Liu, X. G. Image identification method for high-strength bolt missing on railway bridge based on convolution neural network. China Railway Sci. 39, 56–62. https://doi.org/10.3969/j.issn.1001-4632.2018.04.09 (2018).
Article Google Scholar
Zhou, J. & Huo, L. Computer vision-based detection for delayed fracture of bolts in steel bridges. J. Sens. https://doi.org/10.1155/2021/8325398 (2021).
Article Google Scholar
Li, X. X., Li, D., Ren, W. X. & Zhang, J. S. Loosening Identification of multi-bolt connections based on wavelet transform and ResNet-50 convolutional neural network. Sensors 22, 6825. https://doi.org/10.3390/s22186825 (2022).
Article PubMed PubMed Central ADS Google Scholar
Yang, Z., Zhao, Y. & Xu, C. Detection of missing bolts for engineering structures in natural environment using machine vision and deep learning. Sensors 23, 5655. https://doi.org/10.3390/s23125655 (2023).
Article PubMed PubMed Central ADS Google Scholar
Ni, Y., Mao, J. & Wang, Y. X. Z. Corroded and loosened bolt detection of steel bolted joints based on imp-roved you only look once network and line segment detector. Smart structures and systems 32, 23–35. https://doi.org/10.12989/sss.2023.32.1.023 (2023).
Article Google Scholar
Chen, X. R., Zhou, Y., Zhao, Y. T. & Yan, X. F. An improved YOLOv5-based bolt missing detection method for mobile terminals. Mod. Manuf. Eng. 11, 108–114. https://doi.org/10.16731/j.cnki.1671-3133.2022.11.018 (2022).
Article Google Scholar
Pan, X., Tavasoli, S. & Yang, T. Y. Autonomous 3D vision-based bolt loosening assessment using micro aerial vehicles. Computer-Aided Civil and Infrastructure Engineering 38, 2443–2454. https://doi.org/10.1111/mice.13023 (2023).
Article Google Scholar
Lao, W., Cui, C., Zhang, D., Zhang, Q. & Bao, Y. Computer vision-based autonomous method for quantitative detection of loose bolts in bolted connections of steel structures. Struct. Control Health Monit. https://doi.org/10.1155/2023/8817058 (2023).
Article Google Scholar
Liu, Z. et al. Swin transformer: Hierarchical vision transformer using shifted windows. Proc. IEEE/CVF Int. Conf. Comput. Vis. https://doi.org/10.48550/arXiv.2103.14030 (2021).
Article Google Scholar
Aggarwal, A., Mittal, M. & Battineni, G. Generative adversarial network: An overview of theory and applications. Int. J. Inform. Manage. Data Insights. https://doi.org/10.1016/j.jjimei.2020.100004 (2021).
Article Google Scholar
Liao, C. et al. Benchmarking multi-modal semantic segmentation under sensor failures: Missing and noisy m-odality robustness. Proc. Comput. Vis. Pattern Recognit. Conf. https://doi.org/10.48550/arXiv.2503.18445 (2025).
Article Google Scholar
Vasiljevic, I., Chakrabarti, A. & Shakhnarovich, G. Examining the impact of blur on recognition by convolutional networks. (2016). https://doi.org/10.48550/arXiv.1611.05760
Wang, W., Chen, Z., Yuan, X. & Wu, X. Adaptive image enhancement method for correcting low-illumination images. Inf. Sci. 496, 25–41. https://doi.org/10.1016/j.ins.2019.05.015 (2019).
Article MathSciNet Google Scholar
Redmon, J., Divvala, S., Girshick, R. & Farhadi, A. You only look once: Unified, real-time object detection. Proceedings of the IEEE conference on computer vision and pattern recongnition. (2016). https://doi.org/10.48550/arXiv.1506.02640
Sun, Y., Li, M., Dong, R., Chen, W. & Jiang, D. Vision-based detection of bolt loosening using YOLOv5. Sensors 22, 5184. https://doi.org/10.3390/s22145184 (2022).
Article PubMed PubMed Central ADS Google Scholar
Yu, F. et al. Imaging-based instance segmentation of pavement cracks using an improved YOLOv8 network. Struct. Control Health Monit. 1660649. https://doi.org/10.1155/stc/1660649 (2025).
Lei, W. et al. Vision-based real-time bolt loosening detection by identifying anti-loosening lines. Sensors 24, 6747. https://doi.org/10.3390/s24206747 (2024).
Article PubMed PubMed Central ADS Google Scholar
Sajjadi, M. S. M. et al. Assessing generative models via precision and recall. Advances in neural information processing systems. (2018). https://doi.org/10.48550/arXiv.1806.00035
Wang, B. A parallel implementation of computing mean average precision. ArXiv https://doi.org/10.48550/arXiv.2206.09504 (2022).
Article PubMed PubMed Central Google Scholar
Yu, J., Jiang, Y., Wang, Z., Cao, Z. & Huang, T. Unitbox: An advanced object detection network. Proc. 24th ACM Int. Conf. Multimedia. https://doi.org/10.1145/2964284.2967274 (2016).
Article Google Scholar
Liu, Z. et al. Swin transformer: Hierarchical vision transformer using shifted windows. Proceedings of the IEEE/CVF international conference on computer vision. 10012–10022. (2021). https://doi.org/10.48550/arXiv.2103.14030
Du, W. F. Anti-corrosion coating treatment measures for highway and bridge guardrail. Total Corrosion Control 37, 121–124. https://doi.org/10.13726/j.cnki.11-2706/tq.2023.05.121.04 (2023).
Article Google Scholar
Li, Q. S. Design for Nezha suspension bridge in Jiangyou city of Sichuan province. Transport Research 09, 47–50. https://doi.org/10.3869/j.issn.1002-4786.2011.09.016 (2011).
Article Google Scholar

Download references

Funding

Project Supported by Sichuan Science and Technology Program, under grant No. (2025ZYDF080).

Author information

Authors and Affiliations

School of Civil Engineering and Architecture, Southwest University of Science and Technology, Mianyang, 621010, China
Ying Gu, Dongmei Peng, Songbo Ren & Chao Kong
Tianfu New Area General Aviation Profession Acade, Meishan, 620000, China
Jingyu Song

Authors

Ying Gu
View author publications
Search author on:PubMed Google Scholar
Dongmei Peng
View author publications
Search author on:PubMed Google Scholar
Jingyu Song
View author publications
Search author on:PubMed Google Scholar
Songbo Ren
View author publications
Search author on:PubMed Google Scholar
Chao Kong
View author publications
Search author on:PubMed Google Scholar

Contributions

Ying Gu (Y.G.): Conceptualization, methodology, investigation, supervision, project administration. Jingyu Song (J.Y.S): Data curation, data augmentation, formal analysis. Dongmei Peng (D.M.P): Model development, comparative experiments, validation. Chao Kong (C.K.) & Songbo Ren (S.B.R): Funding acquisition, visualization, writing–review & editing. All authors have reviewed and approved the final version of the manuscript.

Corresponding author

Correspondence to Ying Gu.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Gu, Y., Peng, D., Song, J. et al. Image-based detection of bolts and bolt-missing defects in multi-angle and complex background scenarios. Sci Rep 16, 11590 (2026). https://doi.org/10.1038/s41598-026-41036-x

Download citation

Received: 19 November 2025
Accepted: 17 February 2026
Published: 02 March 2026
Version of record: 07 April 2026
DOI: https://doi.org/10.1038/s41598-026-41036-x

Subjects

Abstract

Similar content being viewed by others

A real-time industrial safety automation using YOLO architectures leveraging diverse chromatic domains

Advancing e-waste classification with customizable YOLO based deep learning models

Conveyor belt foreign object detection method based on improved YOLOv11 and ESRGAN

Introduction

Dataset construction

Database collection

Image enhancement and generative adversarial network (GANs)

Image enhancement

Images generated by generative adversarial networks

Comparative study on detection performance of different models

Selected comparative models

YOLOv5 model

YOLOv8 model

YOLOv10 model

Model training parameters

Recall and precision analysis

Recall

Precision

Mean average precision (mAP) comparative analysis

Computational speed and parameter count analysis

Model improvement

Integration of the swin-transformer network

Integration of the multi-scale and detail-enhanced module

Model evaluation and ablation experiments

Model training and convergence analysis

Confusion matrix analysis

Ablation experiments

Experimental verification

Accuracy evaluation under multiple angles

Accuracy evaluation under different lighting conditions

Accuracy evaluation under complex backgrounds

Engineering application

Engineering overview

Image acquisition

Inspection results and statistical analysis

Conclusion

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Quick links