PEYOLO a perception efficient network for multiscale surface defects detection

Li, Xun; Zhao, Yuzhen; Jiao, Xiangke; Meng, Qingzhe; Guo, Zhun; Yao, Ruijuan; Yang, Yaqiao; Yuan, Baoxi

doi:10.1038/s41598-025-05574-0

Download PDF

Article
Open access
Published: 06 August 2025

PEYOLO a perception efficient network for multiscale surface defects detection

Xun Li^1,2,3,
Yuzhen Zhao¹,
Xiangke Jiao¹,
Qingzhe Meng¹,
Zhun Guo¹,
Ruijuan Yao¹,
Yaqiao Yang⁴ &
…
Baoxi Yuan^1,2,3

Scientific Reports volume 15, Article number: 28804 (2025) Cite this article

823 Accesses
Metrics details

Subjects

Abstract

Steel defect detection is a crucial aspect of steel production and quality control. Therefore, focusing on small-scale defects in complex production environments remains a critical challenge. To address this issue, we propose an innovative perception-efficient network designed for the fast and accurate detection of multi-scale surface defects. First, we introduce the Defect Capture Path Aggregation Network, which enhances the feature fusion network’s ability to learn multi-scale representations. Second, we design a Perception-Efficient Head (PEHead) to effectively mitigate local aliasing issues, thereby reducing the occurrence of missed detections. Finally, we propose the Receptive Field Extension Module (RFEM) to strengthen the backbone network’s ability to capture global features and address extreme aspect ratio variations. These three modules can be seamlessly integrated into the YOLO framework. The proposed method is evaluated on three public steel defect datasets: NEU-DET, GC10-DET, and Severstal. Compared to the original YOLOv8n model, PEYOLO achieves mAP50 improvements of 3.5%, 9.1%, and 3.3% on these datasets, respectively. While maintaining similar detection accuracy, PEYOLO retains a high inference speed, making it suitable for real-time applications. Experimental results demonstrate that the proposed PEYOLO can be effectively applied to real-time steel defect detection.

Surface defect detection of hot rolled steel based on multi-scale feature fusion and attention mechanism residual block

Article Open access 01 April 2024

Steel surface defect detection algorithm based on improved YOLOv10

Article Open access 25 September 2025

RSTD-YOLOv7: a steel surface defect detection based on improved YOLOv7

Article Open access 04 June 2025

Introduction

Steel materials are widely used in various fields, including construction, bridges, automobiles, aerospace, and energy. Its quality directly impacts production efficiency. During the manufacturing process, various surface defects may occur in steel products, such as cracks, inclusions, scratches, and corrosion¹. These defects not only affect the performance of the products but may also pose potential threats to the quality and safety of downstream products. As an indispensable part of steel production, defect detection plays a crucial role in improving product quality, enhancing production efficiency, and ensuring product safety.

Fast and accurate defect detection faces multiple challenges. First, as shown in Fig. 1, steel surface defects are diverse and complex, and different types of defects may coexist on the same product, increasing the difficulty of detection. Second, the steel production environment often involves uneven lighting, background noise interference, and steel plate vibrations, all of which can affect the performance of defect detection equipment. If defects on steel surfaces are not detected and addressed in a timely manner, they may lead to reprocessing or even scrapping of entire batches of products as the manufacturing process progresses.

Steel surface defect detection can be broadly categorized into three approaches: manual inspection, machine vision-based detection, and deep learning-based detection. The accuracy of manual visual inspection is highly dependent on the experience and vision of the inspectors, making it inefficient. Although traditional machine vision-based detection methods improve inspection efficiency, manually extracting defect features and designing corresponding algorithms significantly increase the workload^2,3. Additionally, non-end-to-end learning approaches may lead to error accumulation. Recently, deep learning has achieved remarkable success in various fields. Deep learning-based object detection methods have demonstrated advantages in real-time performance and accuracy for surface defect detection across different industrial products. Traditional CNN-based detectors, such as Faster R-CNN, Single Shot MultiBox Detector (SSD), Fully Convolutional One-Stage Object Detection (FCOS), Efficient Detection (EfficientDet), and You Only Look Once (YOLO)^{4,5,6,7,8,9,10,11,12,13}, play a crucial role in capturing fine-grained features. However, the inductive bias of convolution restricts non-local feature extraction¹⁴, limiting the detector’s ability to obtain global semantic information of defects. Transformer-based detectors, such as DEtection TRansformer (DETR) and Real-Time Detection Transformer (RT-DETR), are able to model long-range dependencies in local features, thus addressing the issue of insufficient visual features for defects^15,16. However, the dot-product operations in Vision Transformer (ViT) lead to high memory consumption¹⁷, restricting its widespread deployment on resource-limited platforms. Mamba-based detectors, such as Mamba-YOLO, through their unique architectural design, significantly enhance the detector’s capability to capture both local and global features while being more computationally efficient¹⁸. Existing studies have shown that large-kernel attention can significantly increase the receptive field of detectors, and detectors with larger receptive fields can more effectively model the global features of foreground objects¹⁹. We consider whether it is possible to retain the detailed feature perception and low memory consumption of YOLO models while expanding their receptive field to achieve global modeling capabilities similar to ViT or Mamba architectures. Multi-branch components can aggregate diverse gradient flows, facilitating the model’s learning of multi-scale features²⁰. Based on these findings, we design a receptive field extension module to enhance the backbone network’s ability to perceive global features, enabling the detector to accurately focus on steel defects in complex background noise. The feature fusion network can combine and correlate global features and fine-grained features from different levels^21,22. However, compared to objects in natural scenes, the visual features of steel defects are limited and small in scale. During the forward propagation process, the detailed features of defects gradually fade, leading to a widening semantic gap between features at different levels. Specifically, feature maps from the high levels of the backbone network must undergo several upsampling and downsampling operations before being passed to the high levels of the Path Aggregation Network (PAN), which amplifies the impact of information bottlenecks²³. Therefore, we design an advanced semantic fusion module and embed it in the upper layers of the PAN, forming the defect capture path aggregation network. DcPAN can fully aggregate multi-scale features, reducing the semantic gap between different levels and solving the multi-scale defect problem. In practical production, due to lighting and manufacturing process factors, defects on the steel surface are often obscured by shadows or other defects, resulting in incomplete defect features. To tackle this problem, we propose a perception-efficient head to solve problems related to local aliasing and shadow occlusion. Additionally, the three magnetic heads of PEHead enhance the detector’s robustness to multi-scale defects.

In conclusion, the main contributions of this paper are summarized as follows:

(1)
A perception-efficient detection head is proposed, which boosts the detector’s robustness to multi-scale objects and addresses the issue of local aliasing between defects.
(2)
A receptive field extension module is proposed, which boosts the backbone’s capability to extract global features and improves the detector’s accuracy in detecting defects with extreme aspect ratios.
(3)
An advanced semantic fusion module is proposed to optimize the existing PAN, leading to the introduction of a defect capture path aggregation network. This method reduces the semantic gap between different levels and enhances the detector’s sensitivity to multi-scale information.

The structure of the subsequent section of the paper is as described below: Section II outlines the related work on tiny defect detection, multi-scale feature extraction and background noise removal for industrial defect images. Section III describes the proposed PEYOLO and its associated improvement strategies. Section IV details the implementation process and presents an analysis of the experimental results. Section V concludes with a summary and suggests possible avenues for future research.

Related work

Defect detection is a key research focus in the field of target detection. In this paper, we will review the related work from three perspectives: tiny defect detection, multi-scale feature extraction, and background noise removal.

Tiny defect detection

The small scale of defects is one of the challenges in defect detection. Existing research has proposed several effective methods to improve the detection performance of small defects, but certain limitations still exist. The first approach is to expand the receptive field of the detector to extract global defect features more comprehensively. For example, Zhou et al.²⁴ employed the ASPP module to extend the receptive field of the feature extraction network, enabling the model to extract defect information more effectively within an appropriate receptive field. However, this method may introduce irrelevant information in complex backgrounds, reducing detection accuracy. The second approach is to minimize the loss of detailed features, which helps to more accurately retain information about small defects. Yu et al.²⁵ proposed a multi-dimensional data feature fusion method to adaptively focus on local features. Liu et al.²⁶ introduced the MPFF module, which preserves more small defect features as the network deepens. Yuan et al.²⁷ optimized the detection head adaptively for enhancing the recognition capability of small detection heads for tiny defects. Wu et al.²⁸ optimized the upsampling module by introducing the CARAFE component, which enhanced the detector’s ability to retain small defect features. However, these methods still rely on feature fusion strategies, which may lead to information mismatches under drastic multi-scale variations. The third approach is to incorporate attention mechanisms to enhance the detector’s focus on small defects. For example, Zhu et al.²⁹ introduced an attention mechanism at each feature extraction stage to better retain small object information. However, excessive reliance on attention mechanisms may increase computational overhead, affecting real-time performance.

In summary, while existing methods have improved small defect detection to some extent, they still face challenges such as multi-scale feature loss, background noise interference, and increased computational complexity. To address these issues, this paper proposes an optimized detection head—PEHead. PEHead enhances the model’s perception of small defects while mitigating the impact of shadow occlusion. Experimental results demonstrate that PEHead achieves higher detection accuracy and greater robustness in complex steel defect detection tasks. Therefore, optimizing the feature extraction capability of detectors and enhancing their focus on small targets are crucial directions for improving defect detection performance.

Multi-scale feature extraction

Traditional feature fusion networks achieve the interaction of multi-scale features by aggregating gradient flows from different levels. However, due to multiple downsampling operations during the feed-forward process, the semantic gap between features at different levels gradually increases, which affects the effective fusion of features. Zhang et al.³⁰ introduced the DsPAN component to narrow the semantic gap between the Feature Pyramid Network (FPN) and PAN, thus improving the fusion of multi-scale information. However, this method still does not fully address the issue of insufficient visualization feature in steel defect detection tasks. Su et al.³¹ reduced the impact of defects at different scales by combining important information with background information in the feature map, but they did not completely solve the problem of background noise interference. The adaptive feature fusion method proposed by Yeung et al.³² enhanced the detector’s ability to recognize defects at different scales, but it still has limitations when dealing with defects with extreme aspect ratios. Peng et al.³³ added skip connections between the encoder and decoder, enabling the model to effectively capture multi-scale features of images. However, this method mainly relies on direct feature transfer and does not fully consider the hierarchical differences between features at different scales, which may lead to insufficient fusion of global information and local details. Song et al.³⁴ improved the detector’s performance in detecting complex-shaped defects by combining deformable convolution and region-of-interest (ROI) alignment techniques, but their feature alignment mechanism still faces the risk of information loss in multi-scale scenarios. Yu et al.³⁵ combined the advantages of CNN and Transformer to achieve the fusion of global and local feature extraction, enhancing the model’s ability to represent multi-scale features. However, this method has a high computational complexity, making it difficult to deploy in resource-constrained industrial environments.

To address these issues, this paper proposes an optimized multi-scale feature fusion method. DcPAN improves the detector’s sensitivity to small targets by merging information flows from different receptive fields through a parallel branch structure. Experimental results show that the introduction of DcPAN significantly improves the detection performance of multi-scale defects. In particular, when dealing with small-scale defects, complex backgrounds, and extreme aspect ratios, DcPAN can effectively reduce the false detection rate and improve the overall detection accuracy of the model. Therefore, optimizing feature fusion strategies is an effective approach to solving the multi-scale variation problem in steel defect detection.

Background noise removal

Due to poor lighting conditions and the high similarity between steel defects and the background, the detector is easily affected by background noise interference. Existing research has proposed various methods to reduce the impact of background noise, but there are still certain limitations. Liu et al.³⁶ treated channels and their relationships as a fully connected graph and enhanced the model’s understanding of global information through graph-based channel reasoning, enabling the detector to better focus on the foreground target. However, this method primarily focuses on feature relationship modeling and does not fully address the dynamic interference from background noise. Dong et al.³⁷ adopted attention mechanisms and dilated convolutions to effectively improve the model’s ability to extract key features from complex backgrounds, but this approach has high computational complexity, which may affect the real-time performance of detection. Chen et al.³⁸ selected MobileNet as the backbone extractor and replaced traditional convolutions with multi-scale depthwise separable convolutions to expand the receptive field and enhance the model’s robustness against complex background noise. However, the lightweight nature of the MobileNet structure may limit the ability to extract high-level semantic information. Zhou et al.³⁹ introduced a bidirectional fusion strategy to enhance the integration of semantic information, thereby improving the contrast between defects and background, but detection accuracy may still decrease under complex lighting conditions. Tie et al.⁴⁰ optimized feature representation of the detection head through attention components, enabling the detector to accurately learn the relationship between steel defects and the background. However, this method primarily focuses on feature representation optimization and does not provide stable detection performance in noisy and complex environments.

In conclusion, although existing methods have reduced the impact of background noise to some extent, they still face challenges such as insufficient global information modeling, high computational overhead, and decreased detection accuracy in complex environments. To address these challenges, this paper proposes an optimized receptive field expansion module. RFEM utilizes depthwise separable convolutions and dilated convolutions to expand the receptive field, thereby enhancing the detector’s adaptability to complex background noise. Experimental results show that RFEM can effectively focus on foreground targets in complex industrial environments, improving detection accuracy and reducing the false detection rate. Therefore, optimizing the receptive field of the detector can effectively eliminate noise interference in the steel defect detection process.

Methods

This paper proposes the PEYOLO model based on YOLOv8n to address the challenges of 2D detection of steel surface defects. The innovations of PEYOLO mainly include three aspects: DcPAN, PEHead, and RFEM. The primary function of DcPAN is to capture multi-scale defects. PEHead further represents multi-scale defects and effectively addresses occlusion issues in the defect detection process. RFEM enhances the detector’s receptive field while maintaining sensitivity to small defect targets and detailed information.

Apart from these improvements, the rest of PEYOLO retains the network structure of YOLOv8n. The C2f module fully utilizes information from different levels, enabling the detector to acquire both rich non-local and fine-grained information simultaneously, thereby improving its adaptability to complex scenes. The CBS module transforms features to better represent defects. The detailed network architecture of PEYOLO is shown in Fig. 2.

Defect capture PAN

The size, shape, and aspect ratio of steel defects vary significantly. Multi-scale feature fusion helps the detector capture subtle features at different resolutions, which is crucial for improving the detection capability of small defects. The effectiveness of feature fusion networks in multi-scale feature extraction has been widely validated. FPN propagates global information from deep feature maps layer by layer through upsampling and fuses it with shallow feature maps of corresponding scales, which enriches the details of the entire feature layer. However, since FPN primarily employs a unidirectional information flow, small defect information may be lost during feature transmission. PAN builds upon FPN by introducing a bottom-up path, enabling multi-level multi-scale information interaction and improving the detector’s response to small objects.

Compared to natural scenes, steel defect detection involves more complex multi-scale objects and background noise interference. Therefore, we optimized the PAN structure by replacing its top-layer C2f module with ASFM and proposed DcPAN. Unlike other PAN variants, such as the Bidirectional Feature Pyramid Network (BiFPN), which relies on denser hierarchical connections to capture scale information⁴¹, DcPAN effectively integrates information flows from different receptive fields through the ASFM module, which improves the detector’s sensitivity to small defects.

The motivation behind ASFM’s design is to enhance the distinguishability of small defects. Small defects may appear differently at various feature scales, and without establishing effective connections between low-level and high-level features, their edge details may be lost or blurred during transmission. ASFM employs multiple parallel branches to extract different types of features and process images from various perspectives. This design strengthens the detector’s ability to capture image details, allowing the detector to learn multi-scale and multi-view information. By fusing these features, the detector can extract more comprehensive representations, improving its recognition ability for complex patterns, textures, and small defects.

As the core encoder module of ASFM, the Multi-Scale Block (MSBlock) allows adaptive selection of branch numbers and convolution kernel sizes based on task requirements, which reduces the semantic gap between different feature levels. This enables better integration of low-level and high-level features, preventing excessive compression of small defect details⁴². Specifically, MSBlock replaces standard 3 × 3 convolutions with inverted bottleneck layers, significantly improving parameter efficiency while retaining detailed information within a larger receptive field. Experimental results show that incorporating the ASFM module increases detection accuracy for small defects such as crazing and inclusion by 5.3% and 4%, respectively. Figure 3 illustrates the detailed structure of the ASFM module.

Let ${X}_{a}\in {\mathbb{R}}^{H\times W\times C}$ be the input feature map of the ASFM module, where $H$ and $W$ represent the height and width of the feature map, respectively, and $C$ denotes the number of input channels. First, ${X}_{a}$ is fed into the CBS module, transforming the feature map size to $H\times W\times C\_out$. Next, the feature map is evenly split into two parts, which are then fed into the main branch and the identity mapping branch separately. The main branch is further divided into an identity mapping branch and a MSBlock branch, where the feature maps in both branches have the same size of $H\times W\times {C}_{out}/2$. The MSBlock branch consists of three sub-branches. The first sub-branch is an identity mapping branch. The remaining two sub-branches are composed of cascaded 1 × 1 convolution, 9 × 9 convolution, and another 1 × 1 convolution. At this stage, the feature maps in all three sub-branches of the MSBlock branch are of size $H\times W\times {C}_{out}/6$. Then, the feature maps from the main branch and the identity mapping branch are concatenated, resulting in a feature map of size $H\times W\times {3(C}_{out}/2)$. Finally, the concatenated feature map is passed through the CBS module to obtain the output feature map of the ASFM module, denoted as ${X}_{a}{\prime}\in {\mathbb{R}}^{H\times W\times C\_out}$. Notably, each CBS module consists of a cascaded 3 × 3 convolution module, a batch normalization layer, and a SiLU activation function.

The primary function of the 3 × 3 convolution is features transformation and channel reduction, thereby reducing computational cost.

$${X}_{conv}=Conv(X, {W}_{conv})$$

(1)

Here, ${W}_{conv}$ is a 3 × 3 convolution kernel used to extract local features from the input feature map $X$.

Batch normalization helps improve training stability and reduces the problem of internal covariate shift.

$$BN(X)=\frac{X-{\mu }_{X}}{{\sigma }_{X}}\cdot \gamma +\beta$$

(2)

Here, $BN(\cdot )$ represents normalization applied to each channel, where ${\mu }_{X}$ and ${\sigma }_{X}$ are the mean and standard deviation of each channel, and $\gamma$ and $\beta$ are the learned scaling and bias parameters.

SiLU activation function provides a smoother nonlinear transformation. Compared to ReLU, SiLU ensures more stable gradient propagation.

$$\sigma (x)=x\cdot \frac{1}{1+{e}^{-x}}$$

(3)

Here, $\sigma (\cdot )$ represents the SiLU activation function, which enhances the network’s expressive capability by introducing nonlinear operations.

Channel-wise division splits the feature map into multiple sub-feature maps, allowing the model to process different channel groups in parallel, thereby increasing the network’s diversity.

$${X}_{split}=Split(X, n)$$

(4)

Here, $Split(\cdot )$ represents dividing the feature map $X$ into $n$ equal parts along the channel dimension.

With different convolution kernels, feature maps capture varying local or global semantics. By concatenating different feature maps, the network can learn more diverse feature representations, thereby enhancing the model’s expressiveness.

$${X}_{concat}=Concat({X}_{1},{X}_{2},...)$$

(5)

Here, $Concat(\cdot )$ represents the concatenation of feature maps along the channel dimension after different pooling and convolution operations, enabling the fusion of multi-scale and multi-channel information.

The parallel branches in MSBlock enhance the network’s sensitivity to defects of different scales, and its working principle is as follows:

$$Y_{n} = \left\{ {\begin{array}{*{20}c} {X_{n} , n = 1} \\ {IB_{k \times k} \left( {Y_{n - 1} + X_{n} } \right), n > 1} \\ \end{array} } \right.$$

(6)

Here, ${Y}_{n}$ represents the output feature map of each branch in the MSBlock module; $I{B}_{k\times k}(\cdot )$ denotes the cascaded 1 × 1 convolution, $k\times k$ convolution, and 1 × 1 convolution. Here, $k$ = 9 and $n$ = 3.

Perception-efficient head

Steel surface defects often exhibit multi-scale variations, local aliasing, and shadow occlusion, making it difficult for detectors to accurately focus on them. As a key component of the detector, the detection head is directly responsible for predicting defect categories and locations. Therefore, the detection head must be robust to multi-scale defects and sensitive to occluded defect features. Based on this, we designed the PEHead.

PEHead consists of three heads with different resolutions: 80 × 80, 40 × 40, and 20 × 20. The low-resolution head can quickly locate large defects, the medium-resolution head further refines medium-sized defects, and the high-resolution head focuses on accurately detecting small defects. By combining heads of different sizes, PEHead can effectively perceive variations in defect scale, enabling it to handle multi-scale defect detection tasks efficiently.

To address local aliasing and shadow occlusion issues, PEHead incorporates specific designs to enhance its perception of occluded areas. Local aliasing refers to the visual overlap or similarity between defects, which may cause the detector to misidentify them. Shadow occlusion occurs when part of a defect is covered by shadows or other objects, leading to the loss of critical feature information. To mitigate these challenges, PEHead introduces the Semantic Enhancement Attention Module (SEAM)⁴³. Similar to the approach used to address occlusion in face recognition, the SEAM module learns the relationship between occluded and non-occluded defects, allowing the model to supplement key feature information of the defect in the occluded areas. Specifically, SEAM utilizes depthwise separable convolution to efficiently extract local features from each channel, ensuring that even under partial occlusion or feature loss, sufficient information can still be captured. Following the depthwise separable convolution, the SEAM module applies a 1 × 1 convolution to fuse information across channels. The 1 × 1 convolution learns the importance of each channel, weighting and integrating different feature channels, allowing SEAM to restore occluded defect features using contextual information. Figure 4 illustrates the detailed structure of PEHead.

PEHead consists of a cascade of CBS modules, the SEAM module, and 1 × 1 convolutional modules. The SEAM module is composed of two residual branches and a main branch. The output feature map from the first residual branch is element-wise added to the output feature map of the CSMM module. The output feature map from the second residual branch is multiplied with the output feature map from the main branch to obtain the output feature map of the SEAM module. The main branch of the SEAM module consists of the CSMM module, an average pooling layer, and linear layers. The CSMM module is divided into two parts. The first part consists of the main branch and the identity mapping branch. The main branch consists of 3 × 3 depthwise separable convolutions, GELU activation functions, and batch normalization layers. The second part consists of a cascade of 1 × 1 convolutions, GELU activation functions, and batch normalization layers.

Let ${H}_{p}\in {\mathbb{R}}^{H\times W\times C\_in}$ represent the input feature map of the SEAM module, where $H$ and $W$ are the height and width of the feature map, and $C\_in$ is the number of input channels. ${H}_{p}$ is fed into the first part of the CSMM module to obtain the feature map ${H}_{0}\in {\mathbb{R}}^{H\times W\times C\_in}$. ${H}_{0}$ is then input into the second part of the CSMM module to obtain the output feature map ${H}_{1}\in {\mathbb{R}}^{H\times W\times C\_in}$. The feature map undergoes an adaptive average pooling operation. Adaptive average pooling dynamically adjusts the pooled size according to the size of the input feature map, extracting global information and avoiding size mismatch issues caused by input size variations. At this point, the feature map size is $1\times 1\times C\_in$. The feature map is input into a cascade of linear layers, ReLU activation functions, linear layers, and Sigmoid activation functions to obtain ${H}_{2}\in {\mathbb{R}}^{1\times 1\times C\_in}$. The exponential operation with base e is applied to ${H}_{2}$, and the result is element-wise multiplied with ${H}_{p}$ to obtain the output feature map of the SEAM module ${H}_{3}\in {\mathbb{R}}^{H\times W\times C\_in}$.

$$GELU(x)=0.5x \left(1+tanh\left(\sqrt{\frac{2}{\pi }}\left(x+0.044715{x}^{3}\right)\right)\right)$$

(7)

Receptive field extension module

The challenge in steel defect detection lies in the limited visual features of defects and interference from background noise. Increasing the receptive field allows the detector to extract global features of the defects, which helps the detector learn the relationship between the background and defects, thereby accurately identifying and locating defects. Detectors based on ViT architectures like DETR expand the global receptive field by calculating the cosine similarity between each pixel, but this approach has two main problems: high computational overhead and neglecting small defects. Large Kernel Attention (LKA) combined with vision detectors achieves satisfactory receptive fields with lower computational cost, but its memory usage and kernel size still exhibit an exponential relationship⁴⁴. Large Separable Kernel Attention (LSKA) decomposes the rectangular convolution kernel into cascaded horizontal and vertical strip convolution kernels, which reduces the memory usage of the detector further on top of LKA⁴⁵.

Steel surfaces often have many elongated scratch defects, which have simple features and extreme aspect ratios. The receptive field of traditional 2D convolution components is small and unable to capture complete feature information for such defects. The strip convolution kernels of LSKA can effectively match the features of such strip-like defects. At the same time, LSKA achieves a larger receptive field by cascading deep convolution and dilated convolution, allowing the detector to capture the complete features of strip-like defects. However, a larger receptive field may smooth out subtle spatial changes, sacrificing the detector’s sensitivity to small defects and local details.

To address this issue, we introduce a multi-scale feature fusion strategy based on the LSKA component and propose the RFEM component. RFEM effectively improves detection accuracy for defects of different sizes by combining features at different scales. Specifically, the 1D convolution module in RFEM performs convolution along the elongated direction of the defect, which better captures the vertical features of strip-like defects while avoiding the local information loss problem that traditional 2D kernels may produce. Compared to traditional 2D convolution, 1D kernels retain better vertical texture information when processing such strip-like structures, while reducing interference from irrelevant horizontal information. Therefore, RFEM effectively captures the features of strip-like defects through 1D convolution and enhances sensitivity to small defects. Figure 5 shows the detailed structure of the RFEM component.

Assume ${X}_{r}\in {\mathbb{R}}^{H\times W\times C}$ is the input feature map of the RFEM component, where $H$ and $W$ are the height and width of the feature map, respectively, and $C$ is the number of channels. The first part of the RFEM component is a 1 × 1 convolution module, which is used to adjust the number of channels in the feature map while effectively reducing computational complexity through weight sharing.

$${X}_{1}=Conv({X}_{r}, W)$$

(8)

Here, $Conv$ represents the 1 × 1 convolution, and $W$ is the 1 × 1 convolution kernel applied to the input feature map ${X}_{r}$ to extract local features. ${X}_{1}\in {\mathbb{R}}^{H\times W\times C\_out}$ is the output feature map.

The pooling window reduces the computational complexity and extracts important spatial information by retaining the maximum features within the local region. The cascaded pooling layers increase the receptive field, thereby enhancing the network’s ability to perceive global context.

$${X}_{2}={MaxPool}_{5\times 5}({X}_{1})$$

(9)

$${X}_{3}={MaxPool}_{5\times 5}({X}_{2})$$

(10)

$${X}_{4}={MaxPool}_{5\times 5}({X}_{3})$$

(11)

$${X}_{m}=Concat({X}_{1},{X}_{2},{X}_{3},{X}_{4})$$

(12)

Here, ${MaxPool}_{5\times 5}(\cdot )$ represents a 5 × 5 max pooling layer, and ${X}_{2}, {X}_{3},{X}_{4}\in {\mathbb{R}}^{H\times W\times C\_out/2}$, ${X}_{m}\in {\mathbb{R}}^{H\times W\times 4(C\_out/2)}$.

Compared to traditional convolutions, depthwise separable convolutions can significantly improve efficiency. Dilated depthwise separable convolutions expand the receptive field by increasing the spacing between convolutional kernel elements, which helps capture features from greater distances within the image. By cascading depthwise separable convolutions with dilated depthwise separable convolutions, the background noise interference to the detector is reduced, and the detector’s sensitivity to small defects is enhanced.

$${X}_{L}={Conv}_{1\times 1}({D}_{5\times 1}({D}_{1\times 5}\left({DW}_{3\times 1}({DW}_{1\times 3}({X}_{m}\right))))$$

(13)

$${{X}_{o}={Conv}_{1\times 1}(X}_{m}\cdot {X}_{L})$$

(14)

Here, ${DW}_{1\times 3}(\cdot )$ and ${DW}_{3\times 1}(\cdot )$ represent depthwise separable convolutions of size 1 × 3 and 3 × 1, respectively; ${D}_{1\times 5}(\cdot )$ and ${D}_{5\times 1}(\cdot )$ represent dilated depthwise separable convolutions of size 1 × 5 and 5 × 1, respectively, with dilation = 2. ${X}_{L}\in {\mathbb{R}}^{H\times W\times 4(C\_out/2)}$ and ${X}_{o}\in {\mathbb{R}}^{H\times W\times C\_out}$ represent the output feature maps of the RFEM module.

Experiments and analysis

This section introduces three publicly available steel surface defect detection datasets: NEU-DET, GC10-DET, and Severstal. In addition, this paper also introduces the publicly available PCB-DET dataset. For relevant information, please refer to Table 1. Then, it explains the experimental metrics and environment. Next, it selects the baseline model of PEYOLO and compares its detection results with other methods on the NEU-DET, GC10-DET, Severstal and PCB-DET datasets, demonstrating the performance advantages of PEYOLO. In addition, through ablation experiments, the effectiveness of the proposed improvement strategies is validated and the generalization validation of the proposed DcPAN, PEHead, and RFEM structures is conducted, using the YOLOv5 framework. Finally, it replaces the Backbone, Neck, and Head of the baseline model separately and compares the modified detectors with PEYOLO. The experimental results further confirm the superiority of PEYOLO.

Table 1 Details of NEU-DET, GC10-DET, Severstal and PCB-DET datasets.

Subjects

Abstract

Similar content being viewed by others

Surface defect detection of hot rolled steel based on multi-scale feature fusion and attention mechanism residual block

Steel surface defect detection algorithm based on improved YOLOv10

RSTD-YOLOv7: a steel surface defect detection based on improved YOLOv7

Introduction

Related work

Tiny defect detection

Multi-scale feature extraction

Background noise removal

Methods

Defect capture PAN

Perception-efficient head

Receptive field extension module

Experiments and analysis

Dataset description

Evaluation metrics

Training strategies and implementation details

Results on NEU-DET and GC10-DET dataset

Baseline model selection

Comparison of PEYOLO with YOLOv8, YOLOv9, and YOLOv10 series

Comparison with the latest techniques

Inference experiment

Results on severstal dataset

Cross-domain experiments on PCB-DET dataset

Ablation analysis

Effectiveness of DcPAN

Effectiveness of PEHead

Effectiveness of RFEM

Generalization verification

Conclusion

Advantage

Limitation

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Quick links