Introduction

Steel plates are an important raw material for manufacturing-oriented industries and are used in a wide range of industries, including shipbuilding, automotive, semiconductor, aerospace, robotics and building materials. Due to the high quality of steel plates required by these industries, it is critical to accurately detect defects on the surface of steel plates1. During the process of producing steel sheets, defects such as crazing, inclusions, patches, pitted surface, rolled-in scale and scratches occur due to various factors such as material quality, manufacturing process and production equipment2,3,4. These defects reduce the quality of the steel plate, which makes the detection of steel plate surface defects even more important. To improve the quality of steel plates and increase production efficiency, manual visual inspection and statistical-based modelling methods have traditionally been used in industry to detect steel plate surface defects. Manual visual inspection is highly subjective and has limitations in detecting defects in real time. The statistical modelling method is a machine vision-based defect detection method that utilizes industrial cameras to replace the manual visual inspection method because it can solve the problem of subjective defect judgment and real-time detection. However, this statistical modelling method is limited by the fact that it requires manually designing algorithms to extract defect features and a new algorithm must be designed for every slight change. In recent years, deep learning technology has developed rapidly and CNN-based object detection technology has been adopted in various industries 5. Among them, many studies have been conducted in the field of defect detection and great achievements have been made beyond traditional methods. Among these CNN-based defect detection models, YOLO is by far the most utilized in the field of defect detection and it occupies a leading position by improving its technology through continuous updates such as YOLOv5, YOLOv8 and YOLOv96. YOLO7 is a one-stage network that performs class prediction and bounding box coordinate regression on feature maps to reduce network size and balance accuracy and inference speed. Although various studies utilizing YOLO-based object detection models have laid the foundation for detecting defects, they still have the following limitations. First, they focus on improving the performance of a single YOLO, which limits the learning of defect features from different perspectives. Second, the similarity of the defect and background, as well as small defects, hinder defect detection. Third, they do not effectively address fundamental issues such as the irregular shape and size of defects and the similarity between different classes8. Furthermore, fundamental limitations exist in one-stage detectors like YOLO when applied to complex industrial defects. While efficient, these models force a trade-off by simultaneously optimizing for two conflicting objectives: localization (focusing on object boundaries) and classification (focusing on fine-grained textures). In the context of steel surface inspection, where defects often share similar visual patterns but distinct categories, this coupled approach can lead to suboptimal performance in both tasks. To overcome this, we adopted a decoupled 2-stage architecture. By dedicating a binary Fusion YOLO solely to precise localization and a separate Ensemble CNN + ViT model to classification, our framework minimizes the interference between these tasks. This design prioritizes the reduction of false negatives (missed defects) over the simplicity of a single-stage architecture. Implementing this approach, we propose an explainable Hybrid AI CAD framework specifically designed to target the intrinsic characteristics of steel surface defects. Within this framework, we introduce DCBS-YOLO, an enhanced detection architecture. By integrating SimAM, a parameter-free attention module, we refine feature representations without adding computational complexity, while DCNv3 enables dynamic adjustment of the receptive fields for irregular defect shapes such as crazing. In addition, a Background Suppression Module (BSM) is incorporated to emphasize defects under low-contrast steel surface backgrounds. Through our experiments, we observed that a binary YOLO detector trained to localize defects without class differentiation detects a larger number of defect regions than a multi-class detector, and that combining multiple binary YOLO models enables more precise localization of defect areas. To further maximize the synergy among multiple detectors, we employ Weighted Boxes Fusion (WBF). Unlike traditional Non-Maximum Suppression (NMS), which simply discards overlapping predictions, WBF aggregates them into a single, more reliable bounding box, thereby providing more robust defect localization. Finally, accurate defect identification is crucial, particularly when visually similar defect types are involved. Conventional ensemble strategies such as soft or hard voting rely only on aggregating final prediction scores, and thus cannot fully exploit the complementary intermediate representations learned by different models. Therefore, we adopt a hybrid ensemble of CNNs and a Vision Transformer (ViT) that jointly learns locally focused texture descriptors from CNN backbones and globally aware self-attention–based representations from ViT. This hybrid model effectively captures fine-grained textures and long-range spatial dependencies, reducing misclassification among visually similar defect categories. The entire Framework is optimized through an MLOps pipeline and provides explainability via Grad-CAM–based importance maps. Figure 1 illustrates the industrial steel surface inspection workflow and highlights the need for a CAD framework that integrates preprocessing, detection, classification, and analysis.

Fig. 1
Fig. 1
Full size image

Industrial inspection of steel surface defect.

The main contributions of this research are as follows:

  1. (1)

    A novel explainable hybrid CAD framework is proposed for steel surface defect detection and classification simultaneously, to improve prediction accuracy and reduce missed detections.

  2. (2)

    An adopted DCBS-YOLO is presented as a defect-specific detector integrating DCNv3, SimAM, and BSM to adaptively capture irregular defect shapes and suppress low-contrast background interference on steel surfaces.

  3. (3)

    A YOLO-based late fusion strategy is conducted for multiple class-agnostic detection via weighted boxes fusion (WBF) to improve defect region detection robustness and reduce missed detections.

  4. (4)

    A hybrid classifier including an ensemble of CNN and ViT is proposed for feature-level fusion, leveraging CNN’s local texture sensitivity and ViT’s global context modeling to mitigate misclassification among visually similar defect types.

  5. (5)

    The proposed preprocessing module enhances defect visibility and contrast while preserving overall image quality, leading to consistent performance gains in both detection and classification stages.

The overall organization of this study is as follows. First, the related works are briefly reviewed, followed by a detailed explanation of the proposed methodology. Next, the experimental results are presented, and the findings are discussed along with potential directions for future research. The paper concludes with a summary of the overall contributions and closing remarks.

Related works

YOLO-based methods

For steel surface defect detection in the manufacturing industry, YOLO is the most popular detection model. Based on YOLOv5, Zhao et al.9 changed the backbone component to Res2Net block to expand the acceptance area and extract features at different scales to learn different shapes of defects. Lu et al.10 designed a C2f.-DSC module with dynamic snake convolution based on YOLOv8 to adaptively adjust the receptive field to learn different shapes of defects to improve detection performance. Huang et al.11 integrated a channel attention mechanism module based on YOLOv5s into the convolutional network of the backbone to improve the feature extraction for small defects and their detection. They also fused the Swin Transformer module to the Neck to detect different sizes and multi-scale defects. Xie et al.12 designed LighterMSMC, a lightweight multiscale feature extraction module based on YOLOv8, to efficiently extract multiscale features by lightening the backbone network while effectively ensuring the long-range dependence of features. Liu et al.13 integrate the CoTNet Transformer module based on YOLOv5s into the feature extraction process of the backbone network and apply the adaptive spatial feature fusion (ASFF) algorithm to the prediction process to improve the prediction accuracy of the model. Li et al.14 applied the Gradient-enriched CSPCrossLyaer module and Shuffle Attention based on YOLOX to improve the performance degradation problem caused by the similarity problem between defects and background. Guo et al.15 improved the performance of defect detection by combining features with global information by adding TRANS module designed on Transformer based on YOLOv5 to the backbone and detection header. Zhang et al.16 applied the DsPAN module with Attention Mechanism based on YOLOv8 to improve the performance of small-sized defect detection. Li et al.17 proposed a More Efficient Channel Attention (MECA) module that simultaneously performs maximum pooling and average pooling operations and combines the results to learn the relationship between channels. With this module, the defect detection performance is improved by combining the details of the defects with contextual features. Su et al.18 proposed DAF-CA, a plug-and-play coordinate attention that combines average pooling and maximum pooling based on YOLOv3. It can effectively extract and emphasize the features of defects. Although various studies have been conducted on steel surface defect detection using YOLO, most of the recent works try to improve the defect detection performance of a single model by changing the backbone network of YOLO or adding new attention modules. In this work, we show that in addition to improving performance with a single model, better defect detection performance can be achieved by combining multiple optimized models.

Deep ensemble classification methods

Deep ensemble classification performs very well in cancer and disease classification in the medical field and many studies have proven the usefulness of ensemble classification. In the medical field, distinguishing between normal and abnormal cancer and disease is a very difficult task and it is like the characteristics of defects in that they do not have a fixed shape. Defects are also difficult to classify because they are hard to distinguish from the background and do not have a fixed shape. We review previous work in the medical field and apply these methodologies to the field of steel plate surface defect detection. Qureshi et al.19 proposed an Ensemble CNN architecture that ensembles multiple CNN models and combines auxiliary data in the form of metadata associated with input images through a meta-learner to improve skin cancer prediction performance. Nakata et al.20 improved the performance of liver mass classification in ultrasound images by ensemble techniques such as soft-booting, weighted average booting, weighted hard-booting and stacking of multiple CNNs. Patil et al.21 improved the final classification performance by ensemble learning the extracted features of two CNN models, Shallow CNN and VGG16, to classify three types of tumors. Zheng et al.22 achieved high accuracy by ensemble learning the best performing four CNN models for binary classification of breast cancer diagnosis. They used a method that combines the predictions of multiple models by converting the accuracy of each model into weights. Ukwuoma et al.23 designed an ensemble CNN network based on feature fusion and combined with a transformer encoder for accurate pneumonia disease identification from chest X-ray images. Loddo et al.24 applied ensemble bagging methodology to three models, AlexNet, Resnet101 and Inception-ResNet-v2, to perform binary classification for Alzheimer’s disease diagnosis to improve the performance of Alzheimer’s disease classification. Kang et al.25 combined features from three CNN models, Densenet121, Resnext101 and Mnasnet, for brain tumor classification and then classified them using SVM, a machine learning method. The fusion of these feature ensembles with machine learning techniques improved the performance of brain tumor classification. Aurna et al.26 improved the performance of brain tumour classification by extracting the features of the three best performing CNN models among several CNN models, selecting the important features through PCA and inputting them to the classifier.

Methodology

The proposed explainable hybrid AI CAD framework

We introduce an explainable hybrid AI CAD framework that encompasses various stages, including the preprocessing of surface defect images, auto hyperparameter tuning, defect region detection, classification and Grad-CAM technology. Figure 2 illustrates the entire process of the proposed CAD framework, which begins with the preprocessing of input images, followed by defect region detection using Fusion YOLO, defect classification via Ensemble CNN and ViT based on Self-Attention. Additionally, Explainable AI (XAI) techniques are employed to further detect and analyze defects missed by Fusion YOLO. The proposed CAD Framework is designed to handle the entire process of defect’s detection and analysis, from image preprocessing to detection, classification and the identification of potential missed defects. Specifically, the framework takes an RGB image with 640 × 640 pixel as input and utilizes the preprocessing module to enhance the visibility of defects, thereby facilitating more accurate defect detection. The preprocessed image is then passed into Fusion YOLO for defect region detection, where the detected regions are combined using the Weighted Box Fusion (WBF) algorithm27. If defect regions are successfully detected, the image of these regions is forwarded to Stage 1 in the Classification Stage, where the defect region processing (DRP) algorithm is used to crop and resize the detected defect areas. The processed defect regions are then input to the proposed Ensemble CNN + ViT for defect classification. The output of Stage 1 includes the bounding box coordinates of the defect region, the defect type, confidence score and a heat map generated using Grad-CAM27. In cases where defect region detection fails, the entire image is input into Stage 2, where the proposed Ensemble CNN + ViT is used to classify the defect type, produce the heat map via Grad-CAM and generate bounding box coordinates based on the heat map. This two-stage approach ensures that the proposed CAD framework not only performs defect detection and classification but also improves robustness by detecting and analyzing any undetected defect areas through Stage 2.

Fig. 2
Fig. 2
Full size image

The proposed CAD framework: End-to-end abstract overview.

Dataset

For the experimental validation of this study, the NEU-DET and GC10-DET datasets were used to train and evaluate all the models involved in the experiments. The NEU-DET dataset, collected by Northeastern University, is specifically designed for detecting defects in steel strips during the manufacturing process28. It contains six common surface defect types: Crazing (Cr), Inclusion (In), Patches (Pa), Pitted Surface (Ps), Rolled-in Scale (Rs) and Scratches (Sc). Figure 3 presents a visualization of defect images from the NEU-DET dataset. The data set consists of a total of 1,800 grayscale images, each with a size of 640 × 640 pixels and includes 300 images for each defect type. For this study, the dataset was split into 70% training, 10% validation and 20% testing sets, resulting in 1,260 images for training, 180 images for validation and 360 images for testing.

Fig. 3
Fig. 3
Full size image

Samples of NEU-DET dataset defects: (a) Crazing, (b) Inclusion, (c) Patches, (d) Pitted Surface, (e) Rolled-in Scale (f) Scratches. The red boxes represent the ground-truth (GT) area of the defect.

The GC10-DET dataset29 consists of ten surface defect types encountered during the manufacturing process: Crease (Cr), Crescent_gap (Cg), Inclusion (In), Oil_spot (Os), Punching_hole (Pu), Rolled_pit (Rp), Silk_spot (Ss), Waist-folding (Wf), Water_spot (Ws) and Welding_line (Wl). Figure 4 presents a visualization of defect images from the GC10-DET dataset. The data set consists of a total of 2280 grayscale images, each with a size of 2048 × 1000 pixels. For the verification of the proposed CAD Framework, the dataset was split into 70% training, 10% validation and 20% testing sets, resulting in 1,594 images for training, 230 images for validation and 456 images for testing.

Fig. 4
Fig. 4
Full size image

Samples of GC10-DET dataset defects: (a) Crease, (b) Crescent_gap, (c) Inclusion, (d) Oil_spot, (e) Punching_hole (f) Rolled_pit. (g) Silk_spot. (h) Waist_folding (i) Water_spot (j) Welding_line. The red boxes represent the ground-truth (GT) area of the defect.

Preprocessing

Real-world steel plate surface defect images, like those in the NEU-DET dataset, often contain diverse defect shapes and subtle features that are hard to distinguish from the background, leading to poor detection performance. To address this, we propose a novel preprocessing module that enhances defect visibility without significantly altering the original image as shown in Fig. 5. The process begins by converting the RGB image to LAB color space, isolating the L channel for brightness and removing the A and B channels. The L channel is then processed using a Gaussian Filter to reduce noise, followed by a Median Filter to eliminate extreme noise while preserving boundaries. Next, Contrast Limited Adaptive Histogram Equalization (CLAHE) and Gamma Correction are applied to improve contrast and brightness, making defects more visible. A sharpening filter is then used to accentuate defect boundaries and morphological operations further clarify the defect’s structure. Finally, the processed L channel is blended with the original to create a final RGB image. This method emphasizes defects without distorting key features. Figure 6 compares the original and preprocessed images, showing improvements in defect visibility while preserving image quality, as confirmed by SSIM, PSNR and SNR metrics as shown in Table 1.

Fig. 5
Fig. 5
Full size image

The Proposed preprocessing module.

Fig. 6
Fig. 6
Full size image

Visual comparison between the original and pre-processed images.

Table 1 The impact of image preprocessing on image quality is analyzed in terms of SSIM, PSNR, and SNR. The table displays the average improvements per class of the NEU-DET dataset.

MLOps to auto-select the best AI models: for detection and classification

In AI and machine learning, experiment management and model optimization are essential for achieving optimal results. Auto hyperparameter Tuning, a key MLOps methodology, plays a crucial role in improving model performance, as model outcomes are highly influenced by hyper-parameter settings. However, with numerous possible hyper-parameter combinations, manually identifying the optimal settings is both inefficient and time-consuming. To address this, we utilize Weights and Biases auto hyperparameter tuning framework to determine the best defect detection and classification model for our proposed steel surface defect detection CAD. Figure 7 illustrates the auto hyperparameter Tuning process in this context, followed by detailed explanation of the detection and classification stages.

Fig. 7
Fig. 7
Full size image

Comprehensive workflow of the auto-hyperparameter tuning (MLOps).

MLOps for detection stage: automatic hyper-parameter optimization

To identify the optimal defect detection model, we select various YOLO architectures, including YOLOv5, YOLOv8 and YOLOv9, for experimentation. Each model series contains different sizes and structures, such as versions n, s, m, l and x for YOLOv5 and YOLOv8 and t, s, m, c and e for YOLOv9. We experiment with combinations of hyper-parameters like batch size, epochs, lr0 (initial learning rate), momentum, optimizer and weight decay. These hyper-parameters are crucial for the training process, as batch size affects learning speed and memory usage, lr0 impacts convergence speed, momentum enhances stability, the optimizer defines the optimization method and weight decay helps prevent overfitting by improving generalization. To efficiently explore the hyper-parameter space, we use Bayesian optimization, which reduces the number of experiments needed compared to grid search (which exhaustively checks all combinations) and random search (which selects randomly within a range).

MLOps for classification stage: automatic hyper-parameter optimization

Since CNN models, like defect detection models, depend on a combination of hyper-parameters for their performance, careful tuning of hyper-parameters is necessary to achieve optimal performance. We use the same framework we used for auto hyperparameter tuning in Detection Stage and explore key hyper-parameter combinations such as batch size, epochs, lr and weight decay. The total number of CNN models used in our experiments is 15: Resnet3429, Resnet5029, Resnet10129, Resnet15229, EfficientnetB030, EfficientnetB130 , EfficientnetB230, EfficientnetB330, EfficientnetB430, EfficientNetB530, Densenet12131, Resnext50_32 × 4d32, Resnext101_32 × 8d32, SE Resnet5033 and Wide Resnet34. The models selected for the experiments were randomly chosen from the most popular CNN models.

Detection fusion

We propose Fusion YOLO, which combines several good YOLO models, focusing on the detection of defect regions, to maximize the performance of detecting defect regions based on the YOLO architecture as a detection model. The proposed Fusion YOLO is designed to perform only binary detection without classifying the defects, which focuses on effectively identifying the presence and location of the most important defects in a defect inspection Framework. Fusion YOLO combines the YOLO models with the highest performance among the YOLO models trained with binary detection and the DCBS-YOLO model designed by considering the characteristics of defects based on the optimal YOLO model selected in MLOPS for Detection Stage to maximize the defect area detection performance. A detailed description of the DCBS-YOLO model is described in next section.

DCBS-YOLO

We propose a new defect detection model, DCBS-YOLO, for detecting defects of different shapes and sizes. Figure 8 shows the architecture of the proposed model. The proposed DCBS-YOLO model is built on YOLOv9c, which was selected as the optimal model through the auto hyperparameter tuning process in MLOPS for Detection Stage and is equipped with RepNCSPELAN4_DCNv3, which changes the convolution in the standard RepNCSPELAN4 module to DCNv3 (Deformable Convolution)35 to improve the ability to detect various shapes and sizes of defects. DCNv3 is an extension of the standard Convolution operation that uses a fixed kernel size in traditional CNN structures and can effectively learn various types of defects by applying spatial deformation. These RepNCSPELAN4_DCNv3 are in layers 2, 4, 6 and 8 of the networks, which can most precisely extract the important features of the defect from the backbone network, including the fine texture of the defect and the overall structural pattern. This RepNCSPELAN4_DCNv3 is organized as Conv1 × 1, Split, Bottleneck*2, Concatenate, Conv1 × 1, where Bottleneck refers to RepBottleneck_DCNv3. RepBottleneck_DCNv3 is composed of DCNv3 block and SimAM36, which has the structure of enhancing the features extracted through DCNv3 block with SimAM. This SimAM is an attentional mechanism inspired by neurobiological research and is designed to mimic neural inhibition in the visual cortex. It infers the three-dimensional attentional weights of the feature map without any additional learning parameters and determines the importance of each neuron by comparing its activation value with its neighboring regions. This ensures that important features are enhanced and unnecessary features are suppressed without incurring any additional cost in the learning process. The mathematical formula for such a SimAM is Eq. (1),

$${e}_{t}^{*}=\frac{4(\widehat{\sigma }+\backslash \lambda )}{{\left(t-\widehat{\mu }\right)}^{2}+2{\widehat{\sigma }}^{2}+2\lambda },$$
(1)

where \(t\) is target neuron’s value, \(\mu\) and \({\sigma }^{2}\) represent mean and variance of neurons within channel respectively, \(\lambda\) means regularization parameter and \({\left(t-\mu \right)}^{2}\) quantifies difference between target neuron value and mean of surrounding neurons. As this difference increases (as target neuron has more differentiated features from surroundings), \({e}_{t}^{*}\) decreases. Low \({e}_{t}^{*}\) value means that neuron contains important features. Therefore, by using inverse value of \({e}_{t}^{*}\) as attention weight, it strengthens that feature by giving higher weight to important features. To suppress background areas that interfere with accurate defect detection, BSM (background suppression module)37 is introduced in first layer of Proposed DCBS-YOLO. BSM is designed to solve similarity problem between defects and backgrounds, calculates difference between input feature map and its global average pooling value and applies kernel function to this difference to assign weights. Kernel function assigns smaller weights as feature map differences increase based on arctangent function, thereby effectively suppressing background areas. Mathematical formulas for such BSM are as follows in Eq. (2), (3),

Fig. 8
Fig. 8
Full size image

Model architecture of the proposed DCBS-YOLO.

$$Out=In+\left(In-Avg\right)*K\left(\left|In-Avg\right|\right).$$
(2)
$$K\left(x\right)=-\frac{1}{\pi }\mathit{arctan}\left(\left|x\right|\right)+\frac{1}{2}.$$
(3)

Fusion YOLO

In this study, we propose Fusion YOLO, which combines top three YOLO models (i.e., the proposed DCBS-YOLO, YOLOv9c and YOLOv8s) to further improve detection performance. These detection models are fused and combined using the Weighted Box Fusion as summarized in Algorithm 1. These three models perform binary detection that does not categorize the type of defect, so they can better detect defect regions that traditional YOLO models cannot detect. The proposed DCBS-YOLO has the highest defect area detection performance among the three models because it is carefully designed considering the defect characteristics. YOLOv9c and YOLOv8s act as complementary detection of defective regions missing by the proposed DCBS-YOLO. This improves the overall defect detection performance compared to using a single detection model. Each of these selected detection models is trained separately and then fed into Fusion YOLO. When a single defect image is input, Fusion YOLO outputs a detection result that includes defect area coordinates in the form of bounding boxes and box scores for each of the three installed YOLO models. Then, the bounding boxes output from each model are combined by weighing them according to the box score using a WBF algorithm. Unlike NMS or Soft NMS, which remove the remaining overlapping boxes except for the highest Box Score, the WBF uses all the bounding boxes in the set of overlapping bounding boxes to perform the combination and the combination process is as shown in Algorithm 1. Bounding box \({b}_{i}\) and the corresponding box score \({c}_{i}\) and a set of detection results \(B={\left\{\left({b}_{i},{c}_{i}\right)\right\}}_{i={1}^{n}}\) consisting of bounding boxes and their corresponding box scores, WBF uses the IoU threshold \(\tau\) and box score threshold \(\sigma\) iteratively combines the nested boxes based on the IoU threshold and box score threshold. At each iteration, the WBF algorithm selects the bounding box with the highest box score \({b}^{*}\) with the highest box score and identifies the set of bounding boxes that overlap with it. \(O=\{(b,c)\in B|IoU\left(b,{b}^{*}\right)>\tau\) of its overlapping bounding boxes. The union is performed using the box score as a weight to determine \(\overline{b} = \sum \left( {c_{i} b_{i} } \right) / \sum \left( {c_{i} } \right),\left( {b_{i} ,c_{i} } \right)\varepsilon O\), \(\overline{c} = \sum \left( {c_{i}^{ 2} } \right)/\sum \left( {c_{i} } \right)\) and using the box score as a weighted average. \(\overline{b }\) denote the coordinates of the combined bounding box, \(\overline{c }\) denotes the coordinates of the combined bounding box and \(\sigma\) Only bounding boxes with higher box scores are included in the final prediction result. This process continues until all bounding boxes are processed. Also, unlike traditional algorithms, our WBF does not use class information to perform the fusion.

Algorithm 1
Algorithm 1
Full size image

Weighted Box Fusion (WBF).

Ensemble classification

Defects lack a fixed color or shape, making it challenging to fully analyze their features using a single CNN model. To address this, we propose an ensemble CNN based on a feature-ensemble approach, which combines diverse high-dimensional features from top-performing CNN models to enhance classification performance. Additionally, we integrate this ensemble CNN with a Vision Transformer (ViT) leveraging a self-attention mechanism to further improve classification accuracy. As illustrated in Fig. 9, the ensemble CNN model is constructed using the three best-performing classifiers (i.e., ResNet101, EfficientNetB5 and ResNeXt101_32 × 8d) selected as described in MLOPS for Classification Stage. The model extracts backbone features by removing the classifier layers, followed by applying adaptive average pooling (AAP) to convert the features into a 2D format. These high-level features are then concatenated and fed into the ViT input for further processing. The proposed ensemble CNN + ViT model is designed for end-to-end training, optimizing all network components simultaneously. This integrated approach eliminates the need for separate feature extraction and training steps, allowing the CNN and ViT modules to operate as a unified network. This continuous learning process improves learning efficiency and facilitates the development of more sophisticated and diverse representations of defect features. Input images are preprocessed using the defect region processing (DRP) algorithm, which applies consistent cropping and resizing rules across all defect classes to ensure unbiased processing. Specifically, it crops the bounding box region identified by fusion YOLO, resizes it to 224 × 224 proportions, and applies zero-padding to preserve the defect’s features. The example output generated by the DRP algorithm is shown in Fig. 10. This preprocessing ensures the integrity of defect characteristics while enabling accurate and efficient feature-space ensemble classification model, as shown in Fig. 10.

Fig. 9
Fig. 9
Full size image

Comprehensive architectural of the proposed Ensemble CNN + ViT and Defect Region Processing (DRP) algorithm.

Fig. 10
Fig. 10
Full size image

Example of visualization results of applying defect region processing (DRP).

Training flow

This section outlines the training and finalization procedures for our proposed detection and classification models. Initially, we conducted automated hyperparameter tuning across all detection models to identify the optimal model and hyperparameter configurations for peak performance. This process led to the development of DCBS-YOLO, which underwent further hyperparameter optimization to determine the best settings. The top-performing DCBS-YOLO model, along with other selected detection models, were trained for binary detection, with the three leading models integrated into Fusion YOLO. Subsequently, we focused on training the classification model. Utilizing images extracted from the ground truth regions of the NEU-DET dataset processed through the DRP algorithm, we ensured consistency by employing the same dataset as the detection model. Automated hyperparameter tuning was applied to all classification models using this dataset to select the top three performers. These selected models were then refined by removing their original classifier layers, freezing the parameters of all layers and integrating a pretrained Vision Transformer (ViT) with a new classifier. The comprehensive training workflow for our proposed models and methodology is depicted in Fig. 11. This approach ensures that both detection and classification models are meticulously optimized and harmonized for superior performance.

Fig. 11
Fig. 11
Full size image

Training flow diagram of the proposed models.

Explainable saliency feature maps

Deep learning models address complex societal challenges but are often criticized as opaque “black boxes” making their decision-making processes difficult to understand. This lack of transparency hinders trust, especially in critical applications like defect detection. To overcome this, we integrate Explainable AI (XAI) techniques, which provide clarity by explaining model predictions and enhancing credibility. In our approach, we use Grad-CAM (Gradient-weighted Class Activation Mapping) to highlight regions influencing CNN-based classifications38. Grad-CAM generates heatmaps showing important areas in an image that contribute to the model’s decisions. After detecting and classifying defects in the proposed CAD framework, these heatmaps validate the results and assist users in analyzing defects. If defect detection fails in the initial stage, our Ensemble CNN + ViT model, enhanced with Grad-CAM, identifies and analyzes significant regions in the image. This visual interpretation not only reveals key features recognized by the model but also provides insights to evaluate performance and guide future improvements.

Execution environment

This code was implemented using Python 3.9 and the experiments were conducted on a server equipped with a CPU (Intel i5-13,500) and a GPU (NVIDIA GeForce RTX 3090). The GPU environment utilized CUDA version 11.8 and cuDNN version 8.9.7.

Evaluation metrics

Detection stage

The experiments used Precision, Recall, AP (Average Precision) and F1-Score as performance metrics to evaluate the model’s results. These metrics are defined as follows;

$$Precision \left(Pre\right)=\frac{TP}{TP+FP}.$$
(4)
$$Recall \left(Re\right)=\frac{TP}{TP+FN}.$$
(5)
$$AP={\int }_{0}^{1}P\left(R\right)dR.$$
(6)
$$F1-score=2\times \frac{P\times R}{P+R}.$$
(7)

The performance metrics used for evaluation are calculated based on true positive (TP), true negative (TN), false positive (FP) and false negative (FN). TP refers to cases where an object exists and the model correctly detects it. FP refers to cases where the model mistakenly predicts the existence of an object that is not present. FN refers to cases where an object exists, but the model fails to detect it. Based on these factors, the performance metrics are defined as follows. Precision is a metric that evaluates the proportion of true positives out of all the instances the model predicted as positive. Recall measures the proportion of actual positive samples that the model correctly predicted as positive. F1 Score is the harmonic mean of Precision and Recall, taking both into account. AP (Average Precision) is calculated as the area under the Precision-Recall curve, which evaluates how well the model maintains high Precision across various levels of Recall.

Classification stage

For classification experiments, we use F1-Score and Accuracy as performance metrics to evaluate the model. The performance metrics are represented by the following formulas.

$$Accuracy \left(Acc\right)=\frac{TP+TN}{TP+TN+FP+FN}.$$
(8)

F1-score is the same as described in Detection Stage and Accuracy is calculated based on TP, FP, FN and TN. Accuracy is a metric that indicates the percentage of the total sample that the model correctly predicts and is used to simply evaluate the overall performance of the model.

Experimental results

Hyper-parameters selection (detection stage)

Figure 12 shows the results of applying the auto hyperparameter tuning for all YOLO models: YOLOv 5, 8 and 9. A total of 380 hyper-parameter combinations were explored for all YOLO models. Table 2 summarizes the optimal combination hyper-parameters for the best detection model (i.e., YOLOv9c) comparing with our proposed DCBS-YOLO model. Figure 13 visualizes the results of auto hyperparameter tuning for the proposed DCBS-YOLO model. A total of 83 hyper-parameter combinations were explored, of which the selected optimal combination as is described in Table 2.

Fig. 12
Fig. 12
Full size image

Visualization of the Auto-hyperparameter tuning results for all versions of YOLOv5, YOLOv8, and YOLOv9, where the yellow curve indicates the hyper-parameter combination that achieved the best performance.

Table 2 The selected hyper-parameters for the detection stage.
Fig. 13
Fig. 13
Full size image

Visualization of auto hyperparameter tuning results for proposed DCBS-YOLO, where the yellow line indicates the hyper-parameter combination that achieved the best performance.

Hyper-parameters selection (classification stage)

Figure 14 visualizes the results of applying the auto hyperparameter tuning methodology for all classification models: Resnet34, Resnet50, Resnet101, Resnet152, EfficientnetB0, EfficientnetB1, EfficientnetB2, EfficientnetB3, EfficientnetB4, EfficientNetB5, Densenet121, Resnext50_32 × 4d, Resnext101_32 × 8d, SE Resnet50, Wide Resnet. A total of 317 hyper-parameter combinations were explored and tuned. Based on this process, the top three AI classifiers were selected to be EfficientnetB5, Resnet101 and Resnext101_32 × 8d. Table 3 shows best hyper-parameter combinations for these top three models.

Fig. 14
Fig. 14
Full size image

Visualization of auto hyperparameter tuning results for all classification models used in the experiment, where the yellow line indicates the hyper-parameter combination that achieved the best performance.

Table 3 The selected hyper-parameter for the classification stage.

Detection stage: YOLO-based simultaneous detection and classification

We trained and evaluated the detection models used in our experiments, including all versions (n, m, s, l, x) of YOLOv5, YOLOv8 and YOLOv9 (t, s, m, c, e). As shown in Table 4, YOLOv9C outperformed other models with a 75% mAP@0.5. After applying auto hyperparameter tuning, YOLOv9C achieved 75.8% mAP@0.5, a 0.8% improvement, ranking it as the best baseline detection model. Building on YOLOv9C, we developed and trained the proposed DCBS-YOLO model, which achieved a mAP@0.5 of 75.7%, a 0.7% improvement over the original YOLOv9C. After tuning its hyperparameters, DCBS-YOLO demonstrated significant performance gains across defect classes: 36.6% AP for Crazing, 81.1% for Inclusion, 96.3% for Patches, 95.5% for Pitted Surface, 63.7% for Rolled-in Scale and 88.3% for Scratches. This resulted in a 76.9% improvement over the untuned DCBS-YOLO, a 1.2% increase in mAP@0.5 and a 1.1% gain over the hyperparameter-tuned YOLOv9C. Table 5 provides detailed performance comparisons and Fig. 15 visualizes metrics such as precision, recall, F1-score and AP for each class using radar graphs. These results demonstrate that the proposed auto hyperparameter tuning methodology and DCBS-YOLO significantly enhance defect detection performance.

Table 4 Detection evaluation results (%) of each model against the proposed DCBS-YOLO.
Table 5 Performance comparison of detection models based on auto hyperparameter Tuning.
Fig. 15
Fig. 15
Full size image

Comprehensive performance for all versions of YOLOv5, YOLOv8, YOLOv9, and proposed DCBS-YOLO trained under identical conditions. The Diamond-shaped red line represents the proposed DCBS-YOLO.

Detection stage: YOLO-based binary detection without classification

We trained and evaluated all YOLO models and proposed methods with binary detection and the results are presented in Table 6. The findings indicate that the DCBS-YOLO and YOLOv9c models were trained using hyperparameters optimized through auto hyperparameter tuning, while the other detection models were trained with predefined hyperparameters. Overall, the performance of binary detection outperformed simultaneous detection and classification. YOLOv9c achieved 79.4% Precision, 67.9% Recall and 79.8% AP. Meanwhile, the proposed DCBS-YOLO model attained 75.7% Precision, 78.6% Recall and an AP improvement of 0.7%, reaching 80.5%. Further experiments with Fusion YOLO, combining three models of DCBS-YOLO, YOLOv9c and YOLOv8s demonstrated the highest precision among the binary detection models, with 82.8% precision and 76.3% recall. Additionally, Fusion YOLO achieved the highest AP of 83.8%, an improvement of 3.3% over the proposed DCBS-YOLO. These results highlight that combining multiple optimized binary detection models enhances defect area detection performance compared to using a single detection model.

Table 6 Binary detection evaluation Results (%) of all individual models against the proposed DCBS-YOLO and the Fusion YOLO.

Classification stage: single classification scenario

We trained and evaluated all 15 classification models used in our experiments, analyzing their performance individually as shown in Table 7. The experimental results highlight three top-performing models: Resnet101, EfficientnetB5 and Resnext101_32 × 8d. Resnet101 achieved 98.6% Accuracy and an average F1-Score of 98%. EfficientnetB5 followed closely, achieving 98.6% Accuracy and 98.6% F1-Score (avg), while Resnext101_32 × 8d recorded 98.5% Accuracy and an F1-Score (avg) of 98.5%. Among all the models, Resnext101 achieved the highest average F1-Score. Additionally, training EfficientnetB5, Resnet101 and Resnext101_32 × 8d with hyperparameter configurations optimized using auto hyperparameter tuning resulted in slight performance improvements compared to their untuned counterparts. Detailed performance metrics for these tuned models are provided in Table 8.

Table 7 Classification evaluation results (%) of each classifier against the proposed Ensemble CNN + ViT.
Table 8 Performance evaluation results (%) of the top three classification models based on the auto hyperparameter tuning.

Classification stage: hybrid ensemble classification scenario (CNN + ViT)

As demonstrated by the experimental results in MLOPS for Classification Stage, the proposed Ensemble CNN + ViT model, which integrates feature-based ensembles of the three hyperparameter-tuned models and incorporates ViT, outperforms EfficientnetB5, Resnet101 and Resnext101_32 × 8d in terms of Accuracy and F1-Score. Additionally, it surpasses other classification models in F1-Score across all classes, except for the Rolled-in Scale class, highlighting the effectiveness of our approach for defect classification. Figure 16 illustrates the normalized confusion matrix for the hyperparameter-tuned EfficientnetB5, Resnet101, Resnext101_32 × 8d and the proposed Ensemble CNN + ViT. Notably, the Ensemble CNN + ViT model achieved perfect classification for defects in the Crazing, Patches and Scratches classes, demonstrating that combining ensemble methods with ViT enhances classification performance compared to relying on a single model.

Fig. 16
Fig. 16
Full size image

Confusion matrices for the best three classifiers (i.e., EfficientB5, Resnet101, Resnext101_32 × 8d) against the proposed Ensemble CNN + ViT trained with optimal hyper-parameter combinations.

The proposed CAD framework result and comparison: detection and classification

The experimental results demonstrate that the proposed Fusion YOLO achieves high effectiveness in the detection stage, while the Ensemble CNN + ViT delivers superior performance in classification. Leveraging these strengths, we integrated the proposed models from each stage into a unified hybrid CAD Framework. Table 9 compares the performance of our CAD Framework with the auto-tuned DCBS-YOLO, YOLOv9c (from Table 5) and other high-performing detection models, including YOLOv9s, YOLOv9m, YOLOv5m and YOLOv5x, as detailed in Table 4. The proposed CAD Framework achieves AP@0.5 values of 44.8%, 82.1%, 99%, 96.1%, 75.3% and 96.4% for the classes Crazing, Inclusion, Patches, Pitted Surface, Rolled-in Scale and Scratches, respectively. These results surpass the detection performance of all models evaluated in the experiments. Furthermore, the CAD framework achieves an 82.3% mAP, which represents a 5.4% improvement over the hyperparameter-tuned DCBS-YOLO, demonstrating its overall effectiveness in defect detection and classification.

Table 9 Performance comparison between the best-performing detection models and the proposed CAD Framework.

Performance comparison the preprocessing module

Table 10 summarizes the performance comparison between the proposed models and the baseline methods conducted to assess the effectiveness of the preprocessing module. When the preprocessing module is employed, all detection models exhibit consistent improvements in mAP@0.5 at the Detection and Classification stage, and YOLOv9c, DCBS-YOLO, and Fusion YOLO also achieve higher AP@0.5 in the Binary Detection stage. At the Classification stage, the F1-score of each individual CNN increases slightly, while the proposed Ensemble CNN + ViT attains the highest F1-score when the preprocessing module is applied. These results suggest that the preprocessing module enhances the visibility of defect regions and suppresses background interference, thereby leading to a stable improvement in detection and classification performance across the proposed CAD framework.

Table 10 Impact of the preprocessing step using the NEU-DET dataset.

Performance comparison with recent research

This section compares the performance evaluation results of the proposed CAD framework and recent research based on YOLO in the field of steel surface defect detection. Table 11 shows the methodology of the recent research and the performance of mAP@0.5 evaluated based on the proposed CAD framework and the NEU-DET dataset. The proposed CAD framework outperforms the latest studies and incorporates various methodologies. These results confirm that the proposed CAD framework provides competitive performance compared to recent research.

Table 11 Performance comparison with recent research works using the NEU-DET dataset.

Discussion

Detection stage

The evaluation of the proposed DCBS-YOLO, a single detection model, demonstrated its enhanced performance in defect area detection through binary learning experiments. All detection models in the study were trained for binary detection, with the best-performing models combined to construct and assess Fusion YOLO. The proposed DCBS-YOLO achieved a 75.7% mAP, outperforming all other models in the experiment and further improved to 76.9% mAP (a 1.8% increase) through the auto hyperparameter tuning. These results validate the effectiveness of our methodology for defect detection. The hyper-parameter-tuned DCBS-YOLO (Binary detection) showed superior defect area detection compared to other models, while Fusion YOLO, incorporating the top-performing models, achieved a 3.3% AP improvement over the tuned DCBS-YOLO (Binary detection). This highlights the advantage of combining multiple high-performing models to boost detection accuracy. Despite the notable performance improvements of DCBS-YOLO, certain defects, such as Crazing and Rolled-in Scale, showed AP values below 70%. Examinations of these cases revealed challenges, including defects that closely resemble the background, blurred boundaries that mismatch the ground truth box and the presence of multiple defect areas in a single image leading to incomplete or merged detections. These limitations, commonly encountered in industrial defect detection, underscore the difficulty of achieving accurate detection with a single model. This reinforces the significance of the proposed Fusion YOLO, which combines multiple models to address these challenges and improve overall performance.

Classification stage

We evaluated the defect classification performance of individual CNNs used in the experiment and subsequently assessed the proposed Ensemble CNN + ViT model. Pre-processing with the DRP algorithm, designed to preserve the integrity of the original image, enabled most single CNNs to achieve high defect classification accuracy. However, confusion matrix results revealed misclassifications among certain models, such as EfficientnetB5, Resnext101_32 × 8d and Resnet101. These models occasionally confused patches with inclusions and curved surfaces with rolled-in scales. In contrast, the proposed Ensemble CNN + ViT demonstrated superior classification performance compared to individual models. It effectively reduced the misclassification, particularly mitigating confusion between rolled-in scales and pitted surfaces, while maintaining consistent accuracy across other defect classes. This highlights the robustness and stability of the Ensemble CNN + ViT approach for defect classification.

Detection ablation study

An ablation study was conducted on the proposed DCBS-YOLO model to evaluate how individual components and their combinations contribute to defect detection performance. The study examined the effects of integrating DCNv3, SimAM and BSM modules into YOLOv9c, the baseline model for DCBS-YOLO. Table 12 summarizes the results, with YOLOv9c trained using the DCBS-YOLO hyper-parameter configuration. The baseline YOLOv9c achieved 74.9% mAP, slightly below its original performance. However, incorporating DCNv3 improved the mAP by 1.6%, reaching 76.5%, highlighting DCNv3 as a significant contributor to performance enhancement. SimAM and BSM yielded smaller gains, with improvements of 0.1% and 0.4%, respectively, indicating a modest impact for SimAM and a relatively higher influence for BSM. Combinations of these modules were also evaluated. The DCNv3&SimAM and DCNv3&BSM pairings achieved mAPs of 76.5% and 76.7%, respectively, while SimAM&BSM resulted in a 75.5% mAP. Applying all three modules together resulted in the highest mAP of 76.9%, confirming that the integration of DCNv3, SimAM and BSM is effective in enhancing the performance of the DCBS-YOLO model. This analysis demonstrates the synergistic impact of these components on overall defect detection accuracy.

Table 12 Ablation study results of proposed DCBS-YOLO.

Classification ablation study

We conducted an ablation study to assess the impact of EfficientNetB5, ResNet101 and ResNeXt101_32 × 8d models on the Ensemble CNN and evaluated the effectiveness of integrating the Ensemble CNN with ViT. Table 13 presents the results, showing that EfficientNetB5, ResNet101 and ResNeXt101_32 × 8d, trained using the hyper-parameter combinations. Achieved average F1-Scores of 98.8%, 98.7% and 98.7%, respectively. Ensembling pairs of models (i.e., ResNeXt101_32 × 8d & EfficientNetB5 and ResNeXt101_32 × 8d & ResNet101) yielded an average F1-Score of 98.8%, showing no significant improvement. However, the combination of EfficientNetB5 & ResNet101 achieved a higher F1-Score of 99.1%. This analysis indicates that ResNeXt101_32 × 8d has the lowest impact on performance improvement, while EfficientNetB5 and ResNet101 contribute more significantly. An Ensemble CNN combining all three models achieved a 99.2% F1-Score, outperforming individual models. This demonstrates that ensembling high-performing CNN models enhances classification performance compared to using a single model. Building on this result, we integrated Ensemble CNN with ViT, achieving an F1-Score of 99.7%, which represents a 0.5% improvement. These findings confirm the efficacy of combining Ensemble CNN with ViT for superior defect classification performance.

Table 13 Ablation study results of proposed Ensemble classifier.

CAD framework ablation study

We demonstrated that ensembling high-performing AI models enhances performance. Building on this, we conducted an ablation study to evaluate the performance impact of the proposed Fusion YOLO when integrated into the proposed CAD framework, with the results summarized in Table 14. The evaluation criterion was the final defect detection performance, measured by inputting the detected defect areas from the proposed Fusion YOLO into the proposed Ensemble CNN + ViT. Individually, YOLOv8s, YOLOv9C and the Proposed DCBS-YOLO achieved mAPs of 77.3%, 79.1% and 80.4%, respectively, with the proposed DCBS-YOLO having the highest impact on improving the CAD framework’s performance. Combining models yielded further improvements, as YOLOv8s & YOLOv9C achieved 80.7% mAP, YOLOv8s & Proposed DCBS-YOLO achieved 81.9% mAP and YOLOv9C & Proposed DCBS-YOLO achieved 82.0% mAP. When all three models (YOLOv8s, YOLOv9C and Proposed DCBS-YOLO) were combined, the mAP reached 82.3%, demonstrating superior performance compared to any single model. These results confirm that the Proposed CAD Framework significantly benefits from ensembling multiple models, achieving enhanced defect detection and classification capabilities.

Table 14 Ablation study evaluation results of the proposed CAD Framework.

Computational efficiency and inference speed

Table 15 presents a comparative analysis of the computational cost and inference speed between the baseline model (YOLOv9c) and the proposed CAD Framework. Due to the complex nature of the 2-stage architecture designed for sophisticated defect detection and classification, the proposed CAD Framework requires higher computational resources, exhibiting 313.32 GFLOPs and an inference latency of 34.01 ms, which results in a throughput of 29.4 FPS. In contrast, the baseline YOLOv9c operates at 88.9 FPS with 99.8 GFLOPs. This increase in computational cost is directly attributed to the architectural design aimed at precise defect identification. Crucially, the proposed Framework yields a significant improvement in detection performance. It achieves an mAP@0.5 of 82.3%, outperforming the YOLOv9c (75.8%) by a substantial margin of 6.5%. In the context of steel surface defect detection, while achieving real-time processing (30 FPS) is desirable, minimizing missed detections (securing high mAP) is also paramount. The current throughput of 29.4 FPS is marginally below the 30 FPS real-time threshold, indicating that the framework is on the cusp of real-time capability. Therefore, although inference speed was compromised to secure a higher mAP in this study, there remains sufficient potential for enhancing speed and achieving full real-time operation through future architectural optimization and infrastructure adjustments.

Table 15 Comparison of computational cost between YOLOv9c and the proposed CAD Framework.

Analysis of missed defects in detection: A XAI approach

To further detect and analyze defect images that Fusion YOLO fails to detect, the proposed CAD Framework incorporates Grad-CAM to generate heat maps, allowing users to visualize the areas that the AI model focuses on. Experimental results indicate that the Crazing class exhibits a notably lower detection rate compared to other defect classes, which negatively impacts the overall defect detection performance. In a quantitative analysis of 10 undetected images (containing 17 ground truth defects), only 3 defects were successfully localized at an IoU threshold of 0.45. This result is attributed to the tendency of Grad-CAM heatmaps to encompass multiple defects or cover broader regions around small defects, which lowers the IoU metric. However, they remain qualitatively effective for users to identify defect’s locations. Figure 17 illustrates the heat map for a test image where the proposed CAD Framework did not detect any defects. In Fig. 17a, the Grad-CAM result accurately highlights the ground truth region; in Fig. 17b, it covers most of the ground truth region; and in Fig. 17c, it highlights the right break correctly, although with some error, while weakly covering the bottom region of the ground truth. This XAI feature offers additional insight into + defect areas missed during the Detection Stage, providing valuable information and helping users or domain experts unfamiliar with defect detection to analyze results more efficiently.

Fig. 17
Fig. 17
Full size image

(a) and (b) represent the crazing class, and (c) represents the inclusion class. The green bounding boxes drawn in the missing detection at the top indicate the ground truth, and the bottom images show heat maps visualized through Grad-CAM.

Analysis for Fusion YOLO

We provide a detailed analysis of the behavior of the proposed Fusion YOLO. The primary goal of Fusion YOLO is to enhance defect detection by combining the outputs of multiple models, thus enabling the detection of defect areas missing by any single model. However, an inherent challenge arises while ideally only correctly detected defect areas would be fused, false positives can also be combined. To address this, we leverage the fact that false positives typically have low box scores. By applying the WFB algorithm, we aim to fuse false positive bounding boxes into one, ensuring that the low scores associated with these false positives are combined, which results in a lower overall score. During this process, boxes with low scores are removed using a box score threshold, effectively eliminating false positives. Figures 18 and 19 visualize the detection results of each model used in Fusion YOLO, along with the final output of the proposed CAD Framework. For the Crazing class, YOLO1 successfully detects one defect, YOLO2 detects nothing and YOLO3 detects a different ground truth. Fusion YOLO’s WBF algorithm combines these results, allowing each model to contribute defect areas that others missed. In the Inclusion class, YOLO1 detects all the ground truth regions correctly, while YOLO2 makes a wrong detection and YOLO3 detects two defects as one. False positives generally have lower scores and the WBF algorithm adjusts the bounding boxes by weighing the detection results of each model according to their scores, reducing the influence of low-scoring detections. For regions that are incorrectly detected and have no valid matches, the box score threshold removes low-scoring detections, minimizing the inclusion of false positives. In the Patches class, YOLO1 makes correct detections, while YOLO2 and YOLO3 produce false positives, which are combined and removed based on their low scores. This approach is applied consistently across the Pitted Surface, Rolled-in Scale and Scratches classes, ensuring the minimization of false positives. Overall, the proposed CAD Framework enhances defect detection by reliably combining the results of effective models while minimizing false positives.

Fig. 18
Fig. 18
Full size image

Visualization compares the detection results of DCBS-YOLO, YOLOv9c and YOLOv8s incorporated into proposed Fusion YOLO with the results of the proposed CAD Framework, which integrates all the methodologies we suggest. Correct detection refers to accurately identifying the ground truth, missed detection indicates failing to detect the ground truth, and wrong detection represents detecting an incorrect region or failing to meet the IoU threshold.

Fig. 19
Fig. 19
Full size image

This analysis follows the same approach as Fig. 18, but this figure focuses on comparisons for the pitted surface, rolled-in scale, and scratches classes.

The effectiveness of the WBF thresholds of IoU and box confidence score

To examine the influence of IoU and box-confidence thresholds on the CAD performance, we tested a range of threshold values from 0.2 to 0.8. The resulting mAP@0.5 scores are summarized in Fig. 20. Our findings show that the proposed framework is far more sensitive to the box-confidence threshold than to the IoU setting. The best performance is redigested when the box-confidence threshold is set between 0.2 and 0.4, reaching a peak mAP of 82.30% with an IoU of 0.3.

Fig. 20
Fig. 20
Full size image

Heatmap of mAP@0.5 of the CAD framework according to the IoU and box confidence score thresholds.

Generalization and robustness evaluation using new GC10-DET dataset

To verify the generalization capability and robustness of the proposed CAD framework, we additionally trained and evaluated it on the GC10-DET dataset. The experimental configuration followed the settings derived from the NEU-DET dataset experiments. As summarized in Table 16, the proposed CAD Framework achieves an mAP@0.5 of 71.5%, which is 4.7 percentage points higher than YOLOv9c (66.8%). This performance gain validates the effectiveness of the proposed 2-Stage structure in handling diverse defect types.

Table 16 Performance comparison between the detection models and the proposed CAD Framework.

More specifically, as shown in Table 17, in the binary detection stage the proposed Fusion YOLO achieves an AP@0.5 of 76.6%, outperforming YOLOv9c (73.9%) and DCBS-YOLO (74.8%). This result indicates that the proposed Fusion YOLO is effective at localizing defect regions without relying on explicit class differentiation.

Table 17 GC10-DET Preprocessing / Binary detection evaluation Results (%) of all individual models against the proposed DCBS-YOLO and the Fusion YOLO.

The results for the classification stage are reported in Table 18. The proposed Ensemble CNN + ViT model attains an average F1-score of 94.8% and an accuracy of 97.5%. By combining the local feature extraction capability of CNNs with the global context modeling ability of ViT, the proposed classifier achieves more accurate defect-type classification than single-backbone models such as ResNeXt101 (93.1%) and EfficientNetB5 (92.4%).

Table 18 GC10-DET Classification evaluation results (%) of each classifier against the proposed Ensemble CNN + ViT.

Table 19 presents a performance comparison between the proposed model and the baseline models with and without the preprocessing module. When the preprocessing module is applied, the CAD Framework achieves an mAP@0.5 of 75.8%, corresponding to a 2.8 percentage point improvement over the version without preprocessing. Furthermore, all tasks exhibit consistent and meaningful performance gains when the preprocessing module is used. These results demonstrate that the preprocessing module is effective for both defect localization and defect-type classification, and confirm that the proposed CAD Framework is not restricted to a single dataset but remains robust when applied to other datasets such as GC10-DET.

Table 19 Impact of the preprocessing step using the GC10-DET dataset.

Limitations and future work

Despite the promising performance and the comprehensive integration of multiple functions within the proposed CAD Framework, there are several limitations that need to be addressed to ensure its full applicability in real-world manufacturing settings. First, while the framework achieved high accuracy in defect detection and classification, the computational complexity resulted in an inference speed that is marginally below the ideal real-time threshold (e.g., 30 FPS). However, this performance level suggests that the framework is on the verge of real-time capability. Therefore, this slight latency can be effectively resolved in future research through further model optimization and by configuring an efficient inference infrastructure optimized for the manufacturing environment, ensuring robust real-time processing speeds. Second, the current evaluation was conducted using two publicly available open datasets. While these datasets are valuable for benchmarking, they may not fully capture the noise and variability characteristic of actual production environments. Future work must involve validating the framework on real-world industrial datasets to ensure its robustness and practical reliability in diverse operational conditions.

Conclusion

In this study, we proposed an explainable Hybrid AI CAD framework that decouples detection and classification tasks to maximize the accuracy and reliability of steel surface defect detection. The framework enhances defect visibility through a preprocessing module and secures optimal performance via MLOps-based auto hyperparameter Tuning. In the detection stage, we introduced DCBS-YOLO, incorporating DCNv3 and SimAM to address irregular defect shapes, and constructed Fusion YOLO to overcome single-model limitations, significantly improving binary defect localization. For classification, the Ensemble CNN and Vision Transformer (ViT) model was employed to jointly learn local features and global contexts, effectively reducing misclassification among visually similar defects. Experimental results demonstrated that the proposed framework achieved a mAP@0.5 of 82.3% and a classification F1-score of 99.7% on the NEU-DET dataset, outperforming existing state-of-the-art models. Furthermore, validation on the GC10-DET dataset yielded a mAP of 71.5% and an F1-score of 94.8%, proving the framework’s strong generalization capability and robustness across diverse manufacturing environments. Consequently, by integrating high-performance detection capabilities with Grad-CAM-based explainability, this study presents a practical industrial solution that enables non-experts to diagnose defects accurately. Future work will focus on configuring an efficient inference infrastructure and optimizing the model to ensure real-time applicability in high-speed production lines.