Introduction

The surface quality of industrial filters plays a critical role in determining their sealing performance, filtration efficiency, and service life in automotive engine systems. However, during the manufacturing process, variations in equipment and materials can introduce surface defects, such as scratches, dents, and printing errors. Traditionally, these defects have been identified through manual inspection, a method that is subjective, inconsistent, and susceptible to operator fatigue and varying environmental factors. Therefore, defect inspection not only incurs high labor costs but also compromises reliability of quality control. To address these limitations, machine vision-based automated defect detection has become a promising alternative to address these limitations.1,2.

Early machine learning methods improved the consistency of defect detection compared to manual inspection but struggled to handle the diversity of defect types and complexity of industrial backgrounds. Recently, deep learning has become the dominant approach, offering superior accuracy and robustness. Modern deep learning-based object detection algorithms can generally be classified into three main categories: two-stage R-CNN variants3, single-stage YOLO variants4, and Transformer-based DETR architectures5.

Single-stage detectors, particularly the You Only Look Once(YOLO) series, are well-suited for real-time applications owing to their high processing speed . For instance, Ji et al.6 improved YOLOv5 by adding an extra detection layer to improve recognition of small defects. Similarly, Pei et al.7 integrated the DSC-C2f module into YOLOv8 to better address complex defect morphologies. Despite their speed advantages, single-stage detectors face two key challenges. First, their reliance on non-maximum suppression (NMS) for post-processing introduces computational overhead, which can destabilize inference speed, particularly in scenarios with densely packed defects8. Second, the limited receptive field of standard convolutional kernels restricts the model’s ability to capture global contextual information. This limitation renders accurately detecting small or overlapping objects in complex backgrounds difficult, ultimately reducing detection accuracy.

In contrast, Transformer-based models, such as DETR, use self-attention mechanisms to capture global dependencies, enabling more effective modeling of complex object relationships9. DETR combines a CNN backbone for feature extraction with a Transformer encoder– decoder architecture, eliminating the need for anchor and NMS, thereby enabling a streamlined end-to-end pipeline. Although this approach often yields higher accuracy, the self-attention mechanism’s quadratic complexity results in significant computational costs, limiting its suitability for real-time settings. To address this gap, Baidu’s PaddlePaddle team developed RT-DETR10, a hybrid model that improves both speed and accuracy through the integration of an efficient hybrid encoder and the removal of the NMS step. However, RT-DETR’s ResNet-based backbone11 and Transformer components still contain parameter redundancy, and its feature fusion network remains suboptimal. Efforts to develop lightweight variants, such as those by Wang et al.12 and Liu et al.13, have often resulted in significant accuracy degradation.

Hence, the core challenge remains achieving high-precision defect detection suitable for industrial inspection while maintaining computational efficiency. This study addresses that challenge by proposing a surface defect detection method for filters based on a significantly improved RT-DETR architecture. The proposed model improves detection precision while simultaneously reducing both the parameter count and computational complexity compared to the original RT-DETR.

Our key contributions are as follows:

  1. 1.

    Backbone Network Optimization: We integrate the large selective kernel attention (LSKAttention) mechanism into the backbone to enhance multi-scale feature extraction and global context modeling both of which are accurately detecting small defects.

  2. 2.

    Efficient Feature Fusion: We replace the standard RepC3 blocks in the neck with GELAN modules, which use a more advanced layer aggregation strategy to improve feature representation and fusion efficiency, while also reducing the overall parameter count.

  3. 3.

    Optimized Down-Sampling: We substitute traditional strided convolutions with the Adown module, which uses a dual-path design to minimize information loss during down-sampling and preserve fine-grained defect details.

The remainder of this paper is structured as follows: Section 2 outlines the proposed methodology, Section 3 presents the results, and Section 4 discusses the findings.

methods

This study aims to enhance the accuracy of surface defect detection in industrial filters while reducing the model’s parameter count. By balancing detection precision with computational efficiency, the proposed approach is well-suited for industrial deployment requirements. To address the challenges of detecting small-target defects, we introduce an improved RT-DETR-LGA model, which builds upon the original RT-DETR architecture.

RT-DETR

The structure of RT-DETR is shown in Fig. 1. The RT-DETR architecture comprises three primary components: (1) a backbone network for hierarchical feature extraction, (2) hybrid encoder for multi-scale information fusion, and (3) T ransformer decoder equipped with auxiliary prediction heads. The backbone leverages CNN architectures such as the ResNet series or Baidu’s proprietary HGNet. These CNN-based backbones exhibit superior real-time performance compared to Vision Transformer (ViT) frameworks, making RT-DETR particularly well-suited for industrial production environments. The backbone generates three multi-scale feature maps, denoted as S3, S4, and S 5, which are passed to the hybrid encoder for further processing. The encoder processes the multi-scale feature using two complementary mechanisms: the attention-based intra-scale feature interaction (AIFI) module for intra-scale interaction, and cross-scale fusion (CCFM) for cross-scale feature fusion. Together, these modules transform the input into sequential image representations suitable for object detection. An IoU-aware query selection mechanism is then employed to extract the most informative candidates from the encoder’s output. These candidates serve as the initial object queries for the decoder. Within the decoder, the queries are subsequently refined over multiple stages, facilitated by auxiliary prediction heads, to generate the final bounding boxes along with their associated confidence scores.

Figure 1
Figure 1
Full size image

RT-DETR model structure.

The hybrid encoder comprises two key components: the AIFI and CCFM modules. The AIFI module utilizes 2D sine-cosine positional embeddings to generate spatial encodings, thereby enhancing contextual correlations between image regions. This enhances feature extraction and improves the model’s semantic learning capabilities. Conversely, the CCFM module incorporates multiple convolutional fusion blocks along cross-scale pathways, enabling effective information across different resolution levels.

The IoU-aware query selection mechanism identifies the top-K features candidates from the encoder’s outputs based on classification scores, which are then used to initialize object queries (or positional queries independently). To address the common misalignment between classification confidence and localization accuracy, the training procedure integrates IoU scores into the objective function of the classification branch. This dual-constraint approach promotes both accurate classification of positive samples and precise bounding box regression.

RT-DETR-LGA

The architecture of the proposed RT-DETR-LGA model is illustrated in Fig. 2. The BasicBlock-LSKA module integrates LSKAttention into the backbone network, enhancing multi-scale feature extraction while mitigating parameter growth. Furthermore, the original RepC3 module is replaced with RepNCSPELAN4, which promotes more effective multi-scale data integration. This substitution improves both target localization and recognition accuracy without increasing model complexity. The Adown down-sampling module utilizes a multi-path processing strategy, offering significant advantages over conventional three-layer convolutional structures by preserving critical feature details while maintaining a low parameter count and computational cost.

Figure 2
Figure 2
Full size image

Improved RT-DETR model architecture (RT-DETR-LGA).

Efficient feature fusion with RepNCSPELAN4

In the baseline RT-DETR architecture, the CCFM employs RepC3 blocks for feature integration. The RepC3 module14, built upon RepConv units, leverages structural re-parameterization to improve inference-time efficiency. During training, its dual-branch structure enables the extraction of diverse features, which are later merged into a single convolutional layer for faster inference. Although RepC3 is effective for general feature extraction, its simple stacking of RepConv units limits its ability to model complex feature dependencies, a critical capability for detecting small industrial defects in noisy backgrounds. To overcome this limitation, we replace the RepC3 blocks with the RepNCSPELAN4 module, a more advanced architecture derived from the GELAN framework used in YOLOv9.

RepNCSPELAN48 is an efficient layer aggregation network that integrates design principles from both CSPNet15 and ELAN16. Its architecture is specifically engineered to optimize gradient paths and enhance feature representation (Fig. 3), all while maintaining low computational overhead. The architecture of RepNCSPELAN4 can be summarized as follows:

  • Input Processing: The module begins by applying a 1 \(\times\) 1 convolution to the input feature map. This step operation adjusts the channel dimensions and acts as a bottleneck layer, efficiently preparing the features for subsequent, more complex processing stages.

  • Cross-Stage Partial (CSP) Structure: At the core of the module is the RepNCSP block8, which splits the input features into two parallel paths. The first path passes through a stack of convolution layers, representing the ELAN component, to allow rich hierarchical feature extraction. The second path bypasses these layers, functioning as a residual connection that preserves the original information and facilitates gradient flow.

  • Feature Aggregation and Output: The features from both paths are concatenated and processed through a final 1 \(\times\) 1 convolution. This CSP-inspired design facilitates gradient flow across network stages, reduces feature degradation, and enables the network to learn more robust and discriminative representations.

By replacing RepC3 with RepNCSPELAN4, we are not simply substituting one module for another, but enhancing the feature fusion mechanism. The multi-path, cross-stage design of RepNCSPELAN4 ensures more effective integration of fine-grained details and high-level semantic features. This architecture upgrade significantly improves the model’s ability to accurately localize and classify challenging small-scale defects, particularly in complex industrial environments.This enables the model to process large-scale data more efficiently, significantly reducing computational and memory overhead.

Figure 3
Figure 3
Full size image

RepNCSPELAN4 module structure.

Information-preserving downsampling with adown

In CNN-based architectures, down-sampling is essential for building feature pyramids and expanding the receptive field. The standard RT-DETR model utilizes traditional strided convolutions for this purpose. Although computationally efficient, this approach aggressively discards spatial information, which adversely affects the detection of small objects. Subtle yet critical defect features such as fine scratches and pinholes may be lost irreversibly. To address this information loss, we replace strided convolutions with the Adown down-sampling module from YOLOv917.

The Adown module, shown in Fig. 4, is specifically designed to preserve critical feature details during downsampling through a multi-path processing strategy. Its operation can be summarized as follows:

  1. 1.

    The input feature map X is split into two halves, \(X_1\) and \(X_2\), along the channel dimension.

  2. 2.

    The first path processes \(X_1\) using a conventional strided convolution (e.g., stride = 2). This path efficiently learns down-sampled feature representations, capturing the contextual gist of the region.

  3. 3.

    The second path processes \(X_2\) differently. It first applies a max-pooling layer, which explicitly preserves the most salient feature activations within each local window. This is followed by a non-strided (stride = 1) convolution to further transform these high-activation features.

  4. 4.

    Finally, the feature maps from both paths are concatenated along the channel dimension to form the final output.

This dual-path design offers a key advantage. The strided convolution path performs learned, smoothed-out down-sampling, while the max-pooling path preserves high-frequency details and sharp edges—features often essential for defect detection. By combining these complementary representations, the Adown module achieves more robust down-sampling, enriching the feature information passed to subsequent layers, minimizing detail loss, and maintaining parameter efficiency.

Figure 4
Figure 4
Full size image

Adown module structure.

LSKAttention macronuclear attention mechanism

The baseline RT-DETR model utilizes a ResNet backbone for feature extraction. Although effective, deep residual networks often suffer from parameter redundancy and, more importantly, a limited effective receptive field due to the locality constraints of standard small-kernel convolutions. This limitation hinders the model’s ability to capture long-range dependencies, an essential capability for distinguishing small defects from similarly textured but non-defective background patterns. To address this issue, we enhance the backbone by integrating the LSKAttention mechanism into the structure18.

LSKAttention is a highly efficient visual attention mechanism that emulates the effect of very large convolution kernels (e.g., 27x27 or larger) without incurring excessive computational cost. It achieves this by decomposing large-kernel operations into a sequence of more manageable components as follows:

  • Depthwise Convolution: A depthwise convolution with a medium-sized kernel (e.g., 5x5) is first applied to capture local spatial information within each channel.

  • Depthwise Dilated Convolution: A depthwise convolution with a large kernel and high dilation rate is applied, expanding the receptive field. This enables the model to capture long-range spatial dependencies and gather contextual information from distant regions in the image.

  • Pointwise Convolution (1x1 Conv): Finally, a 1x1 convolution fuses features across channels, enabling channel-wise interaction and recalibration of feature responses.

The sequence generates a dynamic attention map, which is applied element-wise to the input feature map. This operation selectively enhances important features while suppressing irrelevant ones. The process can be formally expressed as Equation 1:

$$\begin{aligned} \begin{aligned}&A = C_{1\times 1}(DW\text {-}Conv_{\text {dilated}}(D\text {-}Conv(F))) \\&O=A\otimes F \end{aligned} \end{aligned}$$
(1)

where F represents the input feature map, A denotes the attention map, and O is the output. The LSKAttention block comprises this visual attention submodule followed by a feedforward network, with normalization layers applied before each component (Fig. 5). By embedding this mechanism into the ResNet backbone, we significantly enhance its ability to capture global contextual information. This improvement enables the network to make more informed decisions when identifying small or ambiguous defects that require a broader understanding of the surrounding image area.

The choice of LSKAttention is deliberate and justified when compared to other common attention mechanisms for this specific industrial task. While channel-wise attention modules like Squeeze-and-Excitation (SE-Net) can recalibrate channel importance, they do not address the spatial limitation of small receptive fields, which is critical for contextualizing small defects. Hybrid attention mechanisms like the Convolutional Block Attention Module (CBAM) incorporate spatial attention but typically do so with small kernels, offering only limited expansion of the receptive field. In contrast, LSKAttention’s use of a large-kernel dilated convolution provides a significantly broader spatial context, which is more effective for distinguishing ambiguous small defects from background noise. Furthermore, compared to the quadratic complexity of standard self-attention found in Vision Transformers, LSKAttention’s decomposed structure maintains linear complexity, striking an optimal balance between the global context modeling of Transformers and the computational efficiency required for real-time industrial deployment.

Figure 5
Figure 5
Full size image

LSKAttention structure diagram.

Results

Experimental setup and evaluation metrics

Experiments were conducted on a Windows 11 system equipped with an AMD Ryzen 7 6800H with Radeon Graphics (3.20 GHz) as the central processing unit (CPU), paired with an NVIDIA GeForce RTX 4090D Laptop GPU running CUDA version 12.1. Python 3.9.1 was employed as the programming framework. The model was trained for 400 epochs with a batch size of 4. A progressive learning rate schedule was adopted, starting at 0.0001 and gradually increasing to a maximum of 0.1 for stabilization.

Filter surface defect dataset

The dataset, compiled from authentic filter components on an automotive production line, includes four defect categories: scratches, unclear printing, ink leakage, and missing rubber rings. The distribution of classes is shown in Fig. 7.

Figure 6
Figure 6
Full size image

Example of filter sample data.

Figure 7
Figure 7
Full size image

Number of images containing each defect class.

The dataset was constructed using a sophisticated data generation method based on Generative Adversarial Networks (GANs) to ensure both realism and diversity. Starting with 200 defect-free filter images and a seed set of 100 authentic defect samples, a GAN was trained to learn the distribution of real defect patterns. Its trained generator was then used to produce a large volume of novel, high-fidelity defect patches. These synthetic defects were realistically superimposed onto the defect-free base images, followed by geometric and photometric augmentations to simulate real-world variations. This process yielded a comprehensive dataset of 6,000 images, which was partitioned into training (4,800 images), validation (600 images), and test sets (600 images). Representative samples are illustrated in Fig. 6. For a detailed description of the GAN-based synthesis pipeline and annotation standards, please refer to Appendix A.

NEU-DET dataset

The NEU-DET dataset, provided by Northeastern University19, comprises six defect categories: Crazing, Inclusion, Patches, Pitted Surface, Rolled-in Scale, and Scratches. Each category contains 300 samples, totaling 1800 images. Example images for each category are shown in Fig. 8.

In industrial applications, the NEU-DET dataset is widely utilized in automated production lines for real-time detection and identification of steel surface defects, playing a crucial role in quality assurance. Furthermore, it serves as a benchmark to validate the performance of advanced deep learning algorithms. For instance, it has been used to validate YOLOv7-BA, which incorporates dynamic sparse sampling and adaptive spatial feature fusion20, as well as SDDF-Net, which integrates spatial deformable convolution with a dense feature pyramid.

Figure 8
Figure 8
Full size image

Sample images of various categories of the NEU-DET dataset.

GC10-DET dataset

To further challenge our model and rigorously evaluate its robustness and generalization capabilities, we incorporate the GC10-DET public dataset20. Unlike simpler benchmarks, GC10-DET is specifically chosen to simulate complex, real-world industrial inspection scenarios. It comprises ten categories of steel surface defects: Crescent gap, Welding line, Water spot, Silk spot, Inclusion, Oil spot, Crease, Punching, Waist folding, and Rolled pit. Representative samples illustrating the diversity and complexity of these defects are shown in Fig. 9.

Figure 9
Figure 9
Full size image

Representative samples from the GC10-DET dataset, showcasing the ten distinct defect categories evaluated in this study.

The primary value of GC10-DET as a benchmark lies in its inherent difficulty. The dataset presents significant challenges that are often absent in more controlled collections. These include: 1) low-contrast defects that blend into the background (e.g., Oil spot, Rolled pit); 2) subtle textural anomalies that require sensitivity to fine-grained patterns (e.g., Silk spot); and 3) defects captured under variable lighting conditions with surface noise.

Consequently, GC10-DET is widely regarded in the field as a more demanding benchmark for validating the practical utility of detection algorithms. Success on this dataset provides strong evidence of a model’s ability to perform reliably in challenging industrial environments, making it an essential component of our evaluation methodology.

Evaluation indicators

The evaluation metrics used for target detection are recall (\(R_c\)), precision (P), and mean average precision (\(m_{AP}\)), calculated using Equation 2:

$$\begin{aligned} \begin{aligned} R_c=\frac{P_T}{(P_T+N_F)} \\ P=\frac{P_T}{(P_T+P_F)} \\ m_{AP}=\frac{1}{c}\sum _{i=1}^cAP_i \end{aligned} \end{aligned}$$
(2)

where \(P_T\) denotes the number of correctly predicted defects, \(P_F\) represents the number of incorrectly predicted defects, and \(N_F\) is the number of undetected defects. The variable c indicates the number of defect categories, and \(AP_i\) refers to the average precision of the i-th category. \(AP_i\) is calculated as the area under the \(P-R_c\) curve. The mean average precision \(m_AP\) is then obtained by averaging the \(AP_i\) values across all categories. In this study, a higher \(m_AP\) indicates better network performance.

To specifically evaluate the model’s capability on small-scale defects, a critical task in industrial inspection, we also report performance on a small object subset. This subset is created from our test set by isolating all ground-truth objects with a bounding box area less than \(32 \times 32\) pixels, following the standard COCO evaluation criteria. The metrics for this focused analysis include mAP@0.5 and Recall, enabling a direct comparison of each model’s effectiveness on small targets versus its overall performance.

Ablation experiment

To rigorously evaluate the individual and synergistic contributions of each proposed component, we conducted a comprehensive ablation study. We systematically analyzed the effects of integrating the LSKAttention (LSKA), RepNCSPELAN4 (Gan), and Adown modules, using the original RT-DETR as our baseline. The results, which include not only accuracy metrics but also computational efficiency indicators like FLOPs and FPS, are detailed in Table 1.

The baseline RT-DETR model achieved a solid mAP@0.5 of 90.32% with 20.1M parameters and a computational load of 58.29G FLOPs. Our analysis of individual module additions reveals critical trade-offs between accuracy and efficiency. Adding only the LSKA module (Baseline + LSKA) increased mAP@0.5 by 5.23 percentage points to 95.55%, though it slightly increased the parameter count and reduced FPS (Table 1). This substantial gain, however, came at the cost of a slight increase in parameters and a decrease in inference speed (from 134.1 to 119.8 FPS), highlighting the computational overhead of the attention mechanism.

In contrast, replacing RepC3 with RepNCSPELAN4 (‘Baseline + Gan‘) demonstrated a remarkable dual benefit: it improved mAP@0.5 to 93.87% while simultaneously reducing both parameters (by 1.2M) and FLOPs (by 10.9%), resulting in the highest inference speed of 171.9 FPS among all configurations. This underscores the architectural efficiency of the GELAN-based design.

The Adown module’s primary contribution is model lightweighting. As a standalone addition (‘Baseline + Adown‘), it reduced parameters and FLOPs to the lowest levels (17.2M and 50.89G, respectively) with a negligible impact on FPS. This highlights its value in optimizing computational resources.

The synergy between the modules is evident in the combined configurations. The ‘Baseline + LSKA + Gan‘ model pushed the mAP@0.5 to an impressive 97.86%, demonstrating that the powerful feature representation from LSKA is effectively processed by the efficient Gan module. Our final model, which incorporates all three modules, strikes the most optimal balance. It achieves a near-peak mAP@0.5 of 97.67% and the highest mAP@0.5:0.95 of 75.72%. Crucially, the introduction of the Adown module, despite causing a marginal 0.19% drop in mAP@0.5 compared to the previous step, significantly reduces the computational load (FLOPs) by 13.1% and the parameter count by 6.9% relative to the baseline. The final model, incorporating all three modules, achieved a mAP@0.5 of 97.67% and a mAP@0.5:0.95 of 75.72%. Compared to the baseline, this configuration reduced parameters by 6.9% and FLOPs by 13.1%, while maintaining an FPS of 118.1.

Table 1 Ablation study of the proposed modules, including computational efficiency metrics. “Gan” is an abbreviation for the RepNCSPELAN4 module.

Visualization comparison

To visually demonstrate the algorithm’s improvements, this study compares the detection results of the baseline RT-DETR-R18 model with those of the enhanced RT-DETR-LGA model on the filter defect dataset (Fig. 10). Each image pair displays the original model’s output (left) and the improved model’s results (right), clearly illustrating the improved detection efficacy. In Fig. 10(a) and (c), RT-DETR-LGA demonstrates superior accuracy in detecting blurred markings and ink leakage defects. Fig. 10(b) and (f) further highlight the model’s enhanced capability in accurately identifying scratch defects. Notably, Fig. 10(d) and (e) show RT-DETR-LGA’s ability to detect micro-defects that were missed by the baseline RT-DETR-R18 model. In industrial production environments, micro-scratches and ink residue are common issues. The baseline model exhibited a 12.7% false negative (FN) rate and an 8.3% false positives (FP) rate—issues that critically affect product quality and customer satisfaction. In contrast, the RT-DETR-LGA model achieved 97.2% precision and 95.6% recall, effectively addressing these operational challenges with significantly enhanced detection accuracy.

Figure 10
Figure 10
Full size image

Comparison of the detection results of the baseline RT-DETR-R18 model with those of the proposed enhanced RT-DETR-LGA model on the filter defect dataset.

Figure 11 presents a comparison of the confusion matrices, where diagonal elements indicate the proportion of correctly predicted defect samples relative to total defect instances, while off-diagonal elements represent misclassified or undetected proportions. The left matrix corresponds to the RT-DETR model, whereas the right illustrates the RT-DETR-LGA model’s performance. The labels defect0 through defect3 respectively denote four defect types in the filter dataset: blurred markings (defect0), ink leakage (defect1), surface scratches (defect2), and seal ring absence (defect3).

In Fig. 11(a), the RT-DETR model demonstrates effective detection of seal ring absence defects (defect3 accuracy: 0.94) but exhibits elevated missed detection rates for surface scratches (defect2 recall: 0.86) and blurred markings (defect0 recall: 0.89). Specifically addressing the potential confusion between “blurred markings” (defect0) and “ink leakage” (defect1), the baseline model misclassifies 5% of true ink leakage cases as blurred markings, and 2% of true blurred markings as ink leakage. This confirms a degree of feature overlap that challenges the original model.

Figure 11(b) reveals that the improved RT-DETR-LGA model significantly reduces this specific confusion. The misclassification of ink leakage as blurred markings drops to just 2%, and the misclassification of blurred markings as ink leakage is reduced to 1%. This demonstrates the enhanced discriminative power of our proposed architecture. Overall, significant performance gains are observed in previously underperforming categories: the detection accuracy for blurred markings improves from 0.89 to 0.97, while surface scratch recognition increases from 0.86 to 0.95.

Figure 11
Figure 11
Full size image

Confusion matrices for the four classes of detected objects before and after model improvement. (a) Confusion matrix of RT-DETR model; (b) Confusion matrix of RT-DETR-LGA model.

Comparative tests of mainstream algorithms

To contextualize the performance of our proposed RT-DETR-LGA model, we conducted a comprehensive comparative analysis against a range of mainstream and state-of-the-art object detection algorithms. The selection includes classic two-stage (Faster R-CNN) and single-stage (SSD) detectors, various YOLO family members (YOLOv5m, YOLOv7-s AF, YOLOv8m), and recent cutting-edge models such as YOLOv9, YOLOv10, and an updated RT-DETRv2. The results, encompassing accuracy (mAP), model size (Params), computational load (FLOPs), and inference speed (FPS), are presented in Table 2.

As shown in Table 2, the proposed RT-DETR-LGA model achieved the highest detection accuracy among all tested algorithms.Our model achieves the highest detection accuracy, with a mAP@0.5 of 97.67% and a mAP@0.5:0.95 of 75.72%. This performance surpasses all other tested models, including the powerful YOLOv9 (92.4% mAP@0.5) and its own predecessors, RT-DETR and RT-DETRv2.

Crucially, this top-tier accuracy is achieved without compromising efficiency. A deeper look at the trade-offs reveals the strength of our approach. For instance, while YOLOv7-s AF is the most lightweight model in terms of parameters (11.0M) and FLOPs (28.1G), its mAP is over 10 percentage points lower, making it unsuitable for high-precision requirements. Conversely, high-performance models like YOLOv9 and YOLOv10, while achieving strong accuracy, do so at a much higher computational cost. YOLOv9, for example, requires 83% more FLOPs (92.87G vs 50.67G) and has a 36% slower inference speed (74.85 FPS vs 118.1 FPS) than our RT-DETR-LGA.

Compared to its direct lineage, our model demonstrates clear advancements. It outperforms the baseline RT-DETR by 7.3 percentage points in mAP@0.5 while simultaneously being more efficient, with a 6.9% reduction in parameters and a 13.1% reduction in FLOPs. Compared to high-performance models like YOLOv9, RT-DETR-LGA achieved higher accuracy with 83% fewer FLOPs and a 36% faster inference speed. This indicates a favorable balance between performance and computational cost.

Table 2 Comparison of detection performance with mainstream and state-of-the-art algorithms.

In addition to inference efficiency, we also briefly evaluated the training overhead. On our hardware (NVIDIA RTX 4090D), the proposed RT-DETR-LGA model required approximately 14.2 hours to complete the 400-epoch training schedule. This is comparable to the baseline RT-DETR, which took around 12.8 hours, indicating that our architectural modifications do not introduce significant training complexity. For reference, training a model like YOLOv8m under similar conditions took approximately 14.6 hours. This demonstrates that our model maintains a practical training efficiency profile suitable for industrial research and development cycles.

Generalization and robustness evaluation

Generalization Verification on the NEU-DET Dataset

To further validate the performance and generalization ability of our proposed model, we conducted extensive experiments on the public NEU-DET dataset. We compared our improved model, RT-DETR-LGA, against the original RT-DETR baseline and several recently improved algorithms for steel surface defect detection.

For a robust evaluation, both the baseline and our proposed model were run five times, with the mean and standard deviation (\(\mu \pm \sigma\)) of key metrics reported. The comprehensive results are presented in Table 3.

Table 3 Performance comparison on the NEU-DET dataset. Results are reported as mean ± standard deviation over 5 independent runs to ensure statistical robustness.

The results in Table 3 demonstrate a significant performance enhancement. Our RT-DETR-LGA model achieves an average mAP@0.5 of 80.28%, representing a substantial increase of 5.74 percentage points over the baseline. The low standard deviation of \(\pm 0.53\%\) across five runs highlights the stability and reliability of this improvement.

Although precision experienced a modest decrease from 79.28% to 75.76%, recall improved significantly from 68.14% to 72.12%. This trend suggests that our proposed LGA module enhances the model’s sensitivity to defect features, reducing the number of missed defects. In industrial quality control, minimizing missed defects is often paramount, as undetected flaws can lead to significant downstream costs or safety issues. Therefore, the substantial gain in recall, despite a slight increase in false positives , results in a more effective and practical detection system. The overall benefit of this trade-off is further confirmed by the significant increase in the primary metric, mAP@0.5, which balances both precision and recall. Furthermore, this performance improvement is accompanied by a reduction in model parameters from 20.1M to 18.7M and FLOPs from 58.29G to 50.67G.

The confusion matrices in Fig. 12 provide a qualitative view of this improvement. The baseline RT-DETR model (a) struggles significantly with certain defect types, particularly “Crazing,” which it detects with only 0.37 accuracy. Our improved model (b) dramatically boosts performance on this challenging class, increasing accuracy to 0.87, and shows overall reduced confusion between classes.

Figure 12
Figure 12
Full size image

Confusion matrices demonstrating model generalization on the public NEU-DET dataset, which contains six defect categories. (a) The baseline RT-DETR model struggles with certain classes like Crazing. (b) The proposed RT-DETR-LGA model shows substantially improved accuracy and reduced confusion between classes, confirming its robust generalization capabilities.

To further benchmark our model, we compared it against other state-of-the-art methods on the NEU-DET dataset, as shown in Table 4. Our RT-DETR-LGA model achieved an average mAP@0.5 of 80.28%, surpassing the YOLOX-based method by Li et al.21 (77.0%) and the YOLOv5n-based method by Tian et al.22 (75.3%).

Table 4 Comparison with other state-of-the-art methods on the NEU-DET dataset.

In conclusion, extensive experiments on the NEU-DET dataset, supported by multiple-run analysis, confirm that the proposed RT-DETR-LGA model achieves superior performance with strong generalization capability and robustness.

Robustness verification on the GC10-DET dataset

To address concerns about performance in more complex industrial scenarios, we conducted a rigorous evaluation on the public GC10-DET dataset. This benchmark is particularly challenging as it includes ten distinct defect types captured under real-world production conditions, featuring noise such as uneven illumination and oil stain interference. Our primary goal was to precisely quantify the improvements of RT-DETR-LGA over its baseline and to contextualize its performance against other specialized algorithms.

A detailed, per-class comparison of detection accuracy is presented in Table 5. The results unequivocally demonstrate the superior overall performance of our proposed model. RT-DETR-LGA achieves an overall mAP of 73.40%, which is a substantial improvement of 7.92 percentage points over the baseline RT-DETR and also surpasses the 71.00% mAP of the specialized EC-YOLO23.

Table 5 Detailed per-class Average Precision (AP, in %) and overall mAP@0.5 on the GC10-DET dataset. The best performance for each metric is highlighted in bold.

A deeper, per-class analysis reveals a nuanced performance landscape. The EC-YOLO model shows exceptional strength in certain categories, such as Inclusion (In), Oil spot (Os), and Rolled-in pit (Rp). In contrast, our RT-DETR-LGA model demonstrates a commanding advantage in other challenging classes. For example, its performance on Welding line (Wl) is 93.10%, a staggering 55.1 percentage points higher than EC-YOLO. Similarly, it achieves significantly better results for Waist folding (Wf) and Crease (Cr).

This indicates that while EC-YOLO is highly tuned for specific defect types, our proposed RT-DETR-LGA exhibits a more balanced and robust detection capability across a wider variety of defects. The superior performance on classes like ’Wl’ and ’Wf’ suggests that our architectural enhancements, particularly the LSKAttention’s ability to capture global context and the Adown module’s preservation of fine-grained features, contribute to a better-rounded model. The higher overall mAP confirms that our model strikes a more effective balance, making it a more reliable and general-purpose solution for complex industrial inspection tasks.

Validation on authentic defect samples

A critical aspect of evaluating any model trained primarily on synthetic data is its ability to generalize to real-world conditions. To this end, we conducted a crucial validation experiment on a hold-out set of 75 authentic defect images, which were strictly excluded from both the GAN training and the final dataset construction. This set represents the true, un-synthesized data distribution, serving as the ultimate test of practical applicability.

The quantitative results of this validation are presented in Table 6. Our proposed RT-DETR-LGA model achieved a mAP@0.5 of 98.7%, outperforming the baseline RT-DETR by a significant margin of 6.05 percentage points on this challenging real-world data. Notably, the improvements in both precision (+4.1%) and recall (+5.9%) demonstrate that our architectural enhancements effectively reduce both false positives and missed detections in practical scenarios.

Table 6 Performance comparison of the baseline and our proposed model on the hold-out set of 75 authentic defect images.

To provide a qualitative illustration of this performance gap, Fig. 13 showcases the detection results on several challenging authentic samples. In the first example (top row), the baseline model fails to detect all instances of fine scratches, resulting in a missed detection (Fig. 13a). In contrast, our RT-DETR-LGA model correctly identifies all defects (Fig. 13b). The second example (bottom row) highlights the issue of false positives; the baseline model incorrectly flags non-defective areas, likely due to surface texture or lighting variations (Fig. 13c), while our model exhibits higher precision and avoids such errors (Fig. 13d).

This strong performance, both quantitative and qualitative, provides compelling evidence that our GAN-based synthesis process successfully captured the essential features of real defects. It confirms that the proposed RT-DETR-LGA is not merely overfitted to the synthetic data but has learned robust and generalizable representations, making it highly suitable for practical deployment in industrial inspection.

Figure 13
Figure 13
Full size image

Qualitative comparison on challenging authentic defect samples. The top row shows a case of missed detection by the baseline, while the bottom row illustrates a false positive. (a) and (c) are results from the baseline RT-DETR. (b) and (d) are results from our proposed RT-DETR-LGA, which demonstrates superior accuracy and robustness.

Analysis of small object detection performance

To address the critical challenge of detecting small-scale industrial defects, we conducted a specific analysis evaluating our model’s performance on small objects. Following the standard COCO evaluation criteria, we define “small objects” as those with a bounding box area less than \(32 \times 32\) pixels. We created a subset from our test set containing only these small defects and recalculated key performance metrics for both the baseline RT-DETR and our proposed RT-DETR-LGA model.

The results, presented in Table 7, clearly demonstrate the superior capability of our model in this challenging scenario.

Table 7 Performance comparison on the small object subset. Metrics for the overall test set are included for reference.

As shown in Table 7, while the baseline RT-DETR’s performance degrades when faced with small targets (mAP drops from 90.32% to 87.14%), our RT-DETR-LGA model maintains exceptionally high accuracy (95.93%). The performance gain of our model over the baseline is more pronounced on the small object subset (+8.79% in mAP) than on the overall dataset (+7.35%). This enhanced capability is also reflected in the recall metric, which shows a greater improvement for small targets (+9.5%) compared to the overall recall gain (+6.53%). This indicates that our architectural improvements specifically target and resolve the key challenges of small object detection.

The enhanced performance can be attributed to the synergistic effect of our three core contributions. First, the LSKAttention mechanism provides a significantly expanded receptive field, allowing the model to capture broader contextual information around a tiny defect. This helps distinguish it from noise or complex background textures where standard small-kernel convolutions in the baseline often fail. Second, the information-preserving Adown module is critical for small objects; its dual-path design, particularly the max-pooling path, explicitly preserves high-frequency details and salient feature activations, ensuring that signals from small defects are retained throughout the network’s down-sampling stages. Finally, the efficient feature fusion with GELAN, through the RepNCSPELAN4 module, provides a more efficient gradient path and a richer aggregation structure. This allows for the effective fusion of fine-grained spatial details from shallow layers with strong semantic context from deeper layers, culminating in more precise localization and classification of small-scale targets.

In summary, this focused analysis confirms that the RT-DETR-LGA model not only achieves superior overall performance but also demonstrates a remarkable and targeted improvement in the detection of small industrial defects, directly addressing a primary limitation of existing models.

Model interpretability analysis

To provide a qualitative assessment of our model’s capabilities and to gain deeper insights into its decision-making process, we employed Gradient-weighted Class Activation Mapping (Grad-CAM)24. This technique generates visual explanations by producing a heatmap that highlights the image regions most influential to the model’s prediction. By visualizing what the model “sees,” we can intuitively compare the feature extraction and attention mechanisms of our proposed method against the baseline.

We selected two representative scenarios to conduct this analysis: a challenging case with strong reflective interference and a case requiring fine-grained localization. The comparative results are visualized in Fig. 14.

Figure 14
Figure 14
Full size image

Grad-CAM visualization comparing the attention mechanisms of the baseline RT-DETR and our proposed RT-DETR-LGA. (a-c) A challenging case with strong surface reflections. The baseline’s attention is scattered by interference, while RT-DETR-LGA robustly focuses on the semantically relevant text region. (d-f) A fine-grained localization case. The baseline’s activation is coarse, whereas our method pinpoints a single character with high precision.

Robustness against visual interference

Figure 14(a-c) illustrates a common industrial scenario where robustness to visual noise is paramount. The baseline RT-DETR model (b), when confronted with a glossy surface, exhibits a diffuse and misguided activation map. Its attention is heavily distracted by high-contrast yet semantically irrelevant features, such as the strong reflections and the object’s physical contours. This indicates a reliance on superficial cues, which can lead to unreliable performance in visually complex environments.

In stark contrast, the activation map of our proposed RT-DETR-LGA (c) is more focused on the text region and less affected by the reflective interference.It produces a highly concentrated activation map precisely centered on the text block, effectively treating the strong reflections and background as noise. A more nuanced inspection reveals a sophisticated, hierarchical attention strategy: the primary, high-intensity focus is on the text itself, while a much fainter, secondary activation lightly delineates the object’s boundary. This suggests that our model has learned to differentiate between core evidence (the text) and its surrounding context, a capability the baseline fundamentally lacks. This enhanced focus is attributable to the LSKAttention module, which leverages a large receptive field to capture global context and prioritize salient semantic features over low-level noise.

Precision in fine-grained localization

The second case, shown in Fig. 14(d-f), evaluates the model’s ability to localize fine-grained features. The baseline RT-DETR (e), while correctly identifying the general region of interest, generates a coarse and oversized heatmap that indiscriminately covers the entire text block. This level of precision is often insufficient for industrial inspection, where the goal is to pinpoint specific, small-scale defects.

Our RT-DETR-LGA (f), however, generates a more tightly localized activation map, focusing on a single character (‘G’).This demonstrates an exceptional ability to focus on micro-level details, which is critical for identifying subtle flaws like printing errors or minor surface scratches. This pin-point localization capability stems from the synergistic effect of our architectural enhancements, which not only provide a global understanding of the scene but also preserve the high-resolution details necessary for such precise analysis. This confirms that our model’s improvements are not just in general detection but also in the fine-grained perception essential for high-standard quality control.

Discussion

Our experiments revealed a notable and insightful performance gradient across the three datasets. The model achieved its highest accuracy on our proprietary filter dataset (97.67% mAP@0.5), followed by the public NEU-DET dataset (80.28% mAP@0.5), and then the more complex GC10-DET dataset (78.4% mAP@0.5). This discrepancy is not indicative of a model failure but rather offers valuable insights into its generalization capabilities when faced with progressively challenging conditions. The primary driver of this performance difference lies in the inherent characteristics of the datasets. Our filter dataset was captured in a highly controlled industrial environment with uniform illumination. In contrast, NEU-DET introduces greater complexity with low-contrast, texture-like defects, and GC10-DET further elevates the challenge with real-world noise, including oil stains, lighting variations, and a wider range of defect types.

It is precisely in these more difficult scenarios that our architectural enhancements prove their value. The progressive, yet graceful, degradation in performance, while still outperforming specialized state-of-the-art methods on both public benchmarks, confirms that our model has learned robust, generalizable features, not merely overfitted to the clean conditions of our primary dataset. The LSKAttention module’s ability to leverage broader spatial context is critical for differentiating ambiguous defects from background noise, a task where baseline models typically struggle. This robust generalization ability, validated on multiple public datasets including the challenging GC10-DET, strongly supports the model’s suitability for deployment in real, unpredictable industrial environments.

These technical advancements have direct implications for industrial deployment and cost-effectiveness. The model’s enhanced accuracy, particularly its ability to drastically reduce both false negatives and false positives, translates into significant operational savings by minimizing warranty claims, rework, and material waste. For example, based on our results, a 77% reduction in critical quality failures can be projected. Concurrently, the lightweight nature of RT-DETR-LGA (18.7M parameters, 50.67 GFLOPs) reduces capital expenditure, as it can be deployed on more affordable edge computing hardware compared to heavier alternatives like Faster R-CNN or the original DETR. The high inference speed of 118.1 FPS ensures it does not become a production bottleneck. In essence, the proposed model offers a compelling business case by making high-precision automated inspection both more effective and economically accessible.

Conclusion

This paper introduced an improved version of the RT-DETR model, named RT-DETR-LGA, designed to meet the critical industrial demand for a defect detection algorithm that combines high accuracy with computational efficiency. Experimental results demonstrate the effectiveness of the proposed approach. RT-DETR-LGA outperformed the baseline RT-DETR by a significant margin, improving Map by 7.3 percentage points on the filter defect dataset, while also reducing the parameter count by 6.9%.

The success of our model stems from the synergistic combination of three key architectural improvements. First, integrating the LSKAttention mechanism proved essential, providing the backbone with long-range spatial awareness that improved detection of small, low-contrast defects, such as, fine scratches, areas where the original model often missed. Second, replacing the RepC3 blocks with the GELAN-based module streamlined feature fusion in the neck, enhancing localization precision without increasing computational cost. Lastly, the Adown module minimized information loss during down-sampling, preserving fine details critical for identifying micro-defects. This improvement was visually confirmed in our analysis, where RT-DETR-LGA consistently reduced FNs and FPs compared to the baseline model.

Compared to other mainstream algorithms, including various YOLO variants and the original DETR, our model established a superior balance between performance and efficiency. It achieved the highest mAP while maintaining a parameter count comparable to lighter yet less accurate models like YOLOv7-s AF. Furthermore, extensive experiments on public benchmarks, including the NEU-DET and the highly challenging GC10-DET steel defect datasets, confirmed the model’s strong generalization capability and robustness in complex industrial scenarios. With an mAP of 80.28%, it outperformed both the baseline and other published methods, validating that the architectural enhancements are effective across domains and not overfitted to a single dataset.

In conclusion, the proposed RT-DETR-LGA model effectively resolves the trade-off between accuracy and deployment cost in industrial surface defect detection. It provides a robust, lightweight, and highly precise solution, well-suited for real-world manufacturing environments. Future work could explore advanced model quantization and pruning techniques to further optimize performance for deployment on highly resource-constrained edge devices.