Explainable hybrid AI CAD framework for advanced prediction of steel surface defects

Moon, Changuk; Al-antari, Mugahed A.; Gu, Yeong Hyeon

doi:10.1038/s41598-025-34320-9

Download PDF

Article
Open access
Published: 28 March 2026

Explainable hybrid AI CAD framework for advanced prediction of steel surface defects

Changuk Moon¹,
Mugahed A. Al-antari² &
Yeong Hyeon Gu²

Scientific Reports volume 16, Article number: 10796 (2026) Cite this article

382 Accesses
Metrics details

Subjects

Abstract

Steel surface defect detection is essential for maintaining industrial production quality. However, traditional single-stage detectors often face a trade-off between localization and classification, limiting their ability to distinguish visually similar or irregular defects. This study proposes a novel explainable hybrid AI CAD framework that separates these tasks into two stages. The detection stage utilizes Fusion YOLO, which integrates an adopted DCBS-YOLO with YOLOv9c and YOLOv8s to perform class-agnostic binary detection, thereby optimizing defect localization. The classification stage employs a hybrid model combining ensemble CNNs and Vision Transformer (ViT) to capture both local textures and global dependencies. The entire pipeline is optimized via an MLOps-based auto hyperparameter tuning, with Grad-CAM providing explainability. Experiments on the NEU-DET dataset show that Fusion YOLO achieves an AP of 83.8%, and the classification stage reaches a 99.7% F1-score. Furthermore, the framework’s generalization capability was validated on the GC10-DET dataset, achieving a 71.5% mAP (detection) and a 94.8% F1-score (classification), confirming its robustness for reliable industrial inspection.

Steel surface defect detection based on multi-layer fusion networks

Article Open access 26 March 2025

A steel surface defect detection method based on improved RetinaNet

Article Open access 19 February 2025

Research on steel surface defect classification method based on deep learning

Article Open access 08 April 2024

Introduction

Steel plates are an important raw material for manufacturing-oriented industries and are used in a wide range of industries, including shipbuilding, automotive, semiconductor, aerospace, robotics and building materials. Due to the high quality of steel plates required by these industries, it is critical to accurately detect defects on the surface of steel plates¹. During the process of producing steel sheets, defects such as crazing, inclusions, patches, pitted surface, rolled-in scale and scratches occur due to various factors such as material quality, manufacturing process and production equipment^2,3,4. These defects reduce the quality of the steel plate, which makes the detection of steel plate surface defects even more important. To improve the quality of steel plates and increase production efficiency, manual visual inspection and statistical-based modelling methods have traditionally been used in industry to detect steel plate surface defects. Manual visual inspection is highly subjective and has limitations in detecting defects in real time. The statistical modelling method is a machine vision-based defect detection method that utilizes industrial cameras to replace the manual visual inspection method because it can solve the problem of subjective defect judgment and real-time detection. However, this statistical modelling method is limited by the fact that it requires manually designing algorithms to extract defect features and a new algorithm must be designed for every slight change. In recent years, deep learning technology has developed rapidly and CNN-based object detection technology has been adopted in various industries ⁵. Among them, many studies have been conducted in the field of defect detection and great achievements have been made beyond traditional methods. Among these CNN-based defect detection models, YOLO is by far the most utilized in the field of defect detection and it occupies a leading position by improving its technology through continuous updates such as YOLOv5, YOLOv8 and YOLOv9⁶. YOLO⁷ is a one-stage network that performs class prediction and bounding box coordinate regression on feature maps to reduce network size and balance accuracy and inference speed. Although various studies utilizing YOLO-based object detection models have laid the foundation for detecting defects, they still have the following limitations. First, they focus on improving the performance of a single YOLO, which limits the learning of defect features from different perspectives. Second, the similarity of the defect and background, as well as small defects, hinder defect detection. Third, they do not effectively address fundamental issues such as the irregular shape and size of defects and the similarity between different classes⁸. Furthermore, fundamental limitations exist in one-stage detectors like YOLO when applied to complex industrial defects. While efficient, these models force a trade-off by simultaneously optimizing for two conflicting objectives: localization (focusing on object boundaries) and classification (focusing on fine-grained textures). In the context of steel surface inspection, where defects often share similar visual patterns but distinct categories, this coupled approach can lead to suboptimal performance in both tasks. To overcome this, we adopted a decoupled 2-stage architecture. By dedicating a binary Fusion YOLO solely to precise localization and a separate Ensemble CNN + ViT model to classification, our framework minimizes the interference between these tasks. This design prioritizes the reduction of false negatives (missed defects) over the simplicity of a single-stage architecture. Implementing this approach, we propose an explainable Hybrid AI CAD framework specifically designed to target the intrinsic characteristics of steel surface defects. Within this framework, we introduce DCBS-YOLO, an enhanced detection architecture. By integrating SimAM, a parameter-free attention module, we refine feature representations without adding computational complexity, while DCNv3 enables dynamic adjustment of the receptive fields for irregular defect shapes such as crazing. In addition, a Background Suppression Module (BSM) is incorporated to emphasize defects under low-contrast steel surface backgrounds. Through our experiments, we observed that a binary YOLO detector trained to localize defects without class differentiation detects a larger number of defect regions than a multi-class detector, and that combining multiple binary YOLO models enables more precise localization of defect areas. To further maximize the synergy among multiple detectors, we employ Weighted Boxes Fusion (WBF). Unlike traditional Non-Maximum Suppression (NMS), which simply discards overlapping predictions, WBF aggregates them into a single, more reliable bounding box, thereby providing more robust defect localization. Finally, accurate defect identification is crucial, particularly when visually similar defect types are involved. Conventional ensemble strategies such as soft or hard voting rely only on aggregating final prediction scores, and thus cannot fully exploit the complementary intermediate representations learned by different models. Therefore, we adopt a hybrid ensemble of CNNs and a Vision Transformer (ViT) that jointly learns locally focused texture descriptors from CNN backbones and globally aware self-attention–based representations from ViT. This hybrid model effectively captures fine-grained textures and long-range spatial dependencies, reducing misclassification among visually similar defect categories. The entire Framework is optimized through an MLOps pipeline and provides explainability via Grad-CAM–based importance maps. Figure 1 illustrates the industrial steel surface inspection workflow and highlights the need for a CAD framework that integrates preprocessing, detection, classification, and analysis.

The main contributions of this research are as follows:

(1)
A novel explainable hybrid CAD framework is proposed for steel surface defect detection and classification simultaneously, to improve prediction accuracy and reduce missed detections.
(2)
An adopted DCBS-YOLO is presented as a defect-specific detector integrating DCNv3, SimAM, and BSM to adaptively capture irregular defect shapes and suppress low-contrast background interference on steel surfaces.
(3)
A YOLO-based late fusion strategy is conducted for multiple class-agnostic detection via weighted boxes fusion (WBF) to improve defect region detection robustness and reduce missed detections.
(4)
A hybrid classifier including an ensemble of CNN and ViT is proposed for feature-level fusion, leveraging CNN’s local texture sensitivity and ViT’s global context modeling to mitigate misclassification among visually similar defect types.
(5)
The proposed preprocessing module enhances defect visibility and contrast while preserving overall image quality, leading to consistent performance gains in both detection and classification stages.

The overall organization of this study is as follows. First, the related works are briefly reviewed, followed by a detailed explanation of the proposed methodology. Next, the experimental results are presented, and the findings are discussed along with potential directions for future research. The paper concludes with a summary of the overall contributions and closing remarks.

Related works

YOLO-based methods

For steel surface defect detection in the manufacturing industry, YOLO is the most popular detection model. Based on YOLOv5, Zhao et al.⁹ changed the backbone component to Res2Net block to expand the acceptance area and extract features at different scales to learn different shapes of defects. Lu et al.¹⁰ designed a C2f.-DSC module with dynamic snake convolution based on YOLOv8 to adaptively adjust the receptive field to learn different shapes of defects to improve detection performance. Huang et al.¹¹ integrated a channel attention mechanism module based on YOLOv5s into the convolutional network of the backbone to improve the feature extraction for small defects and their detection. They also fused the Swin Transformer module to the Neck to detect different sizes and multi-scale defects. Xie et al.¹² designed LighterMSMC, a lightweight multiscale feature extraction module based on YOLOv8, to efficiently extract multiscale features by lightening the backbone network while effectively ensuring the long-range dependence of features. Liu et al.¹³ integrate the CoTNet Transformer module based on YOLOv5s into the feature extraction process of the backbone network and apply the adaptive spatial feature fusion (ASFF) algorithm to the prediction process to improve the prediction accuracy of the model. Li et al.¹⁴ applied the Gradient-enriched CSPCrossLyaer module and Shuffle Attention based on YOLOX to improve the performance degradation problem caused by the similarity problem between defects and background. Guo et al.¹⁵ improved the performance of defect detection by combining features with global information by adding TRANS module designed on Transformer based on YOLOv5 to the backbone and detection header. Zhang et al.¹⁶ applied the DsPAN module with Attention Mechanism based on YOLOv8 to improve the performance of small-sized defect detection. Li et al.¹⁷ proposed a More Efficient Channel Attention (MECA) module that simultaneously performs maximum pooling and average pooling operations and combines the results to learn the relationship between channels. With this module, the defect detection performance is improved by combining the details of the defects with contextual features. Su et al.¹⁸ proposed DAF-CA, a plug-and-play coordinate attention that combines average pooling and maximum pooling based on YOLOv3. It can effectively extract and emphasize the features of defects. Although various studies have been conducted on steel surface defect detection using YOLO, most of the recent works try to improve the defect detection performance of a single model by changing the backbone network of YOLO or adding new attention modules. In this work, we show that in addition to improving performance with a single model, better defect detection performance can be achieved by combining multiple optimized models.

Deep ensemble classification methods

Deep ensemble classification performs very well in cancer and disease classification in the medical field and many studies have proven the usefulness of ensemble classification. In the medical field, distinguishing between normal and abnormal cancer and disease is a very difficult task and it is like the characteristics of defects in that they do not have a fixed shape. Defects are also difficult to classify because they are hard to distinguish from the background and do not have a fixed shape. We review previous work in the medical field and apply these methodologies to the field of steel plate surface defect detection. Qureshi et al.¹⁹ proposed an Ensemble CNN architecture that ensembles multiple CNN models and combines auxiliary data in the form of metadata associated with input images through a meta-learner to improve skin cancer prediction performance. Nakata et al.²⁰ improved the performance of liver mass classification in ultrasound images by ensemble techniques such as soft-booting, weighted average booting, weighted hard-booting and stacking of multiple CNNs. Patil et al.²¹ improved the final classification performance by ensemble learning the extracted features of two CNN models, Shallow CNN and VGG16, to classify three types of tumors. Zheng et al.²² achieved high accuracy by ensemble learning the best performing four CNN models for binary classification of breast cancer diagnosis. They used a method that combines the predictions of multiple models by converting the accuracy of each model into weights. Ukwuoma et al.²³ designed an ensemble CNN network based on feature fusion and combined with a transformer encoder for accurate pneumonia disease identification from chest X-ray images. Loddo et al.²⁴ applied ensemble bagging methodology to three models, AlexNet, Resnet101 and Inception-ResNet-v2, to perform binary classification for Alzheimer’s disease diagnosis to improve the performance of Alzheimer’s disease classification. Kang et al.²⁵ combined features from three CNN models, Densenet121, Resnext101 and Mnasnet, for brain tumor classification and then classified them using SVM, a machine learning method. The fusion of these feature ensembles with machine learning techniques improved the performance of brain tumor classification. Aurna et al.²⁶ improved the performance of brain tumour classification by extracting the features of the three best performing CNN models among several CNN models, selecting the important features through PCA and inputting them to the classifier.

Methodology

The proposed explainable hybrid AI CAD framework

We introduce an explainable hybrid AI CAD framework that encompasses various stages, including the preprocessing of surface defect images, auto hyperparameter tuning, defect region detection, classification and Grad-CAM technology. Figure 2 illustrates the entire process of the proposed CAD framework, which begins with the preprocessing of input images, followed by defect region detection using Fusion YOLO, defect classification via Ensemble CNN and ViT based on Self-Attention. Additionally, Explainable AI (XAI) techniques are employed to further detect and analyze defects missed by Fusion YOLO. The proposed CAD Framework is designed to handle the entire process of defect’s detection and analysis, from image preprocessing to detection, classification and the identification of potential missed defects. Specifically, the framework takes an RGB image with 640 × 640 pixel as input and utilizes the preprocessing module to enhance the visibility of defects, thereby facilitating more accurate defect detection. The preprocessed image is then passed into Fusion YOLO for defect region detection, where the detected regions are combined using the Weighted Box Fusion (WBF) algorithm²⁷. If defect regions are successfully detected, the image of these regions is forwarded to Stage 1 in the Classification Stage, where the defect region processing (DRP) algorithm is used to crop and resize the detected defect areas. The processed defect regions are then input to the proposed Ensemble CNN + ViT for defect classification. The output of Stage 1 includes the bounding box coordinates of the defect region, the defect type, confidence score and a heat map generated using Grad-CAM²⁷. In cases where defect region detection fails, the entire image is input into Stage 2, where the proposed Ensemble CNN + ViT is used to classify the defect type, produce the heat map via Grad-CAM and generate bounding box coordinates based on the heat map. This two-stage approach ensures that the proposed CAD framework not only performs defect detection and classification but also improves robustness by detecting and analyzing any undetected defect areas through Stage 2.

Dataset

For the experimental validation of this study, the NEU-DET and GC10-DET datasets were used to train and evaluate all the models involved in the experiments. The NEU-DET dataset, collected by Northeastern University, is specifically designed for detecting defects in steel strips during the manufacturing process²⁸. It contains six common surface defect types: Crazing (Cr), Inclusion (In), Patches (Pa), Pitted Surface (Ps), Rolled-in Scale (Rs) and Scratches (Sc). Figure 3 presents a visualization of defect images from the NEU-DET dataset. The data set consists of a total of 1,800 grayscale images, each with a size of 640 × 640 pixels and includes 300 images for each defect type. For this study, the dataset was split into 70% training, 10% validation and 20% testing sets, resulting in 1,260 images for training, 180 images for validation and 360 images for testing.

The GC10-DET dataset²⁹ consists of ten surface defect types encountered during the manufacturing process: Crease (Cr), Crescent_gap (Cg), Inclusion (In), Oil_spot (Os), Punching_hole (Pu), Rolled_pit (Rp), Silk_spot (Ss), Waist-folding (Wf), Water_spot (Ws) and Welding_line (Wl). Figure 4 presents a visualization of defect images from the GC10-DET dataset. The data set consists of a total of 2280 grayscale images, each with a size of 2048 × 1000 pixels. For the verification of the proposed CAD Framework, the dataset was split into 70% training, 10% validation and 20% testing sets, resulting in 1,594 images for training, 230 images for validation and 456 images for testing.

Preprocessing

Real-world steel plate surface defect images, like those in the NEU-DET dataset, often contain diverse defect shapes and subtle features that are hard to distinguish from the background, leading to poor detection performance. To address this, we propose a novel preprocessing module that enhances defect visibility without significantly altering the original image as shown in Fig. 5. The process begins by converting the RGB image to LAB color space, isolating the L channel for brightness and removing the A and B channels. The L channel is then processed using a Gaussian Filter to reduce noise, followed by a Median Filter to eliminate extreme noise while preserving boundaries. Next, Contrast Limited Adaptive Histogram Equalization (CLAHE) and Gamma Correction are applied to improve contrast and brightness, making defects more visible. A sharpening filter is then used to accentuate defect boundaries and morphological operations further clarify the defect’s structure. Finally, the processed L channel is blended with the original to create a final RGB image. This method emphasizes defects without distorting key features. Figure 6 compares the original and preprocessed images, showing improvements in defect visibility while preserving image quality, as confirmed by SSIM, PSNR and SNR metrics as shown in Table 1.

Table 1 The impact of image preprocessing on image quality is analyzed in terms of SSIM, PSNR, and SNR. The table displays the average improvements per class of the NEU-DET dataset.

Subjects

Abstract

Similar content being viewed by others

Steel surface defect detection based on multi-layer fusion networks

A steel surface defect detection method based on improved RetinaNet

Research on steel surface defect classification method based on deep learning

Introduction

Related works

YOLO-based methods

Deep ensemble classification methods

Methodology

The proposed explainable hybrid AI CAD framework

Dataset

Preprocessing

MLOps to auto-select the best AI models: for detection and classification

MLOps for detection stage: automatic hyper-parameter optimization

MLOps for classification stage: automatic hyper-parameter optimization

Detection fusion

DCBS-YOLO

Fusion YOLO

Ensemble classification

Training flow

Explainable saliency feature maps

Execution environment

Evaluation metrics

Detection stage

Classification stage

Experimental results

Hyper-parameters selection (detection stage)

Hyper-parameters selection (classification stage)

Detection stage: YOLO-based simultaneous detection and classification

Detection stage: YOLO-based binary detection without classification

Classification stage: single classification scenario

Classification stage: hybrid ensemble classification scenario (CNN + ViT)

The proposed CAD framework result and comparison: detection and classification

Performance comparison the preprocessing module

Performance comparison with recent research

Discussion

Detection stage

Classification stage

Detection ablation study

Classification ablation study

CAD framework ablation study

Computational efficiency and inference speed

Analysis of missed defects in detection: A XAI approach

Analysis for Fusion YOLO

The effectiveness of the WBF thresholds of IoU and box confidence score

Generalization and robustness evaluation using new GC10-DET dataset

Limitations and future work

Conclusion

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Quick links