Introduction

Insulator strings serve as critical components of overhead transmission lines, fulfilling the dual roles of electrical insulation and mechanical support1. Prolonged exposure to harsh environments including intense ultraviolet radiation, heavy rain, lightning strikes, and significant temperature fluctuations can lead to material aging and performance degradation. This deterioration may induce potential hazards such as flashovers and physical damage, consequently triggering safety incidents2. Therefore, achieving accurate identification of defects in insulator strings is paramount for ensuring the safe and stable operation of the power grid.

Existing Unmanned Aerial Vehicle (UAV) imagery is often subject to interference from complex background terrain, variable lighting conditions, and the diversity in target scale and orientation, posing significant challenges for the precise detection of insulator string defects3. Early research on insulator detection primarily relied on handcrafted features or prior rules for target discrimination and localization. Depending on the feature extraction approach, these methods can be broadly categorized into several classes: threshold segmentation, texture/shape analysis, and feature point matching. For instance, in threshold segmentation, researchers separated insulator targets by setting thresholds in the RGB color space4, Lab color space5, or grayscale images6. In texture and shape analysis, descriptors such as the Gray-Level Co-occurrence Matrix (GLCM)7, edge histograms8, or shape invariant moments9 were employed to characterize the target’s appearance patterns. For feature point matching, algorithms like SIFT and SURF10 were utilized to extract and match stable local keypoints. Although these methods achieved certain effectiveness in specific scenarios, their performance was fundamentally constrained by the limitations of the manually designed features themselves. Static threshold rules struggle to adapt to the dynamic variations of illumination and weather in field environments11; handcrafted texture and shape features lack robustness against target occlusion, deformation, and viewpoint changes; while methods based on local feature point matching face a trade-off between computational efficiency and matching stability, and are prone to ambiguous matches in cluttered backgrounds. Consequently, traditional approaches find it difficult to bridge the “semantic gap” introduced by complex scenes and fail to meet the requirements for highly robust and adaptive automated inspection12.

To overcome the fundamental limitations of traditional methods, deep learning-based object detection approaches utilizing Convolutional Neural Networks (CNNs) have emerged and rapidly become mainstream in this field. Single-stage detectors, represented by the YOLO series13, have been widely adopted for insulator defect detection tasks due to their favorable balance between detection speed and accuracy. The evolution of related research clearly focuses on enhancing the model’s ability to represent insulators and their multi-scale, small-sized defects, primarily deepening along three directions: first, strengthening the feature extraction capability of the backbone network, for instance, by incorporating attention mechanisms or designing more efficient network architectures14 to capture more discriminative foundational features; second, optimizing the feature fusion capability of the neck network through designing more complex variants of Feature Pyramid Networks (FPN)15,16 or context-aware modules17 to better fuse shallow spatial details with deep semantic information; third, implementing specialized designs for small targets, such as adding dedicated detection layers for small objects18 or introducing deformable convolutions19 to enhance sensitivity to subtle defect features like cracks. These improvements have, to some extent, elevated detection performance metrics.

However, with deeper investigation, existing deep learning-based detection models, when confronting extremely complex scenarios, are gradually encountering performance bottlenecks inherent to their network architectures. This is mainly reflected in two profound contradictions that have not yet been systematically resolved:

  1. 1.

    The contradiction of “irreversible attenuation” of feature representation during long-path propagation. CNNs acquire high-level semantic features through cascaded down-sampling operations, but this process inevitably loses high-frequency details (e.g., edges, texture) and low-frequency structural information that are crucial for precise localization and fine-grained recognition. Although attention mechanisms20 or multi-branch structures21 can enrich features to some extent within the backbone network, they cannot effectively compensate for the global information loss that continuously occurs along the lengthy forward propagation path from the “backbone network → neck network → detection head(s)”. This attenuation directly leads to a decline in the representational capacity of the features fed into the final detection head, becoming a key internal factor constraining model performance.

  2. 2.

    The contradiction of “static and insufficient discriminative power” in multi-scale feature fusion. Most existing neck networks rely on linear operations with fixed weights, such as addition or concatenation, for multi-scale feature fusion. In the face of complex situations where insulator defects share visual similarities with background clutter (e.g., bird nests, tower structures, vegetation) or where features of different defect categories overlap, this static, non-adaptive fusion strategy lacks the ability to dynamically reinforce key discriminative features and suppress irrelevant background noise. Consequently, it struggles to establish clear and robust decision boundaries within the feature space, leading to missed detections and false alarms by the model in complex backgrounds22,23.

In summary, the forefront challenge in current insulator defect detection research has evolved from designing localized improvement modules to addressing, at the holistic network architecture level, the systematic issues of information attenuation and discriminative degradation during feature propagation.

To fundamentally tackle this problem, this paper proposes an integrated detection framework that combines robust region proposal with deep feature compensation, with the primary contributions as follows:

  1. 1.

    A data-driven adaptive region proposal method: By integrating RGB color space statistical priors with local texture consistency metrics, this method achieves fast and robust initial localization of insulator strings within complex backgrounds, providing high-quality candidate regions for subsequent precise detection.

  2. 2.

    A heterogeneous parallel feature extraction network: It introduces a frequency-domain decoupling and parallel enhancement mechanism at the input stage, explicitly separating and specifically enhancing high-frequency detail-texture information and low-frequency structural-contour information, thereby mitigating feature aliasing and loss at the source.

  3. 3.

    A hierarchical dynamic feature compensation module: This module designs a content-aware gated fusion mechanism that dynamically reorganizes frequency-domain information based on the semantics of feature maps at each level. It further incorporates a feature reconstruction unit to enhance discriminative power, achieving synergy between “compensation” and “enhancement.”

Insulator defect detection method based on image color guidance and multi-scale feature compensation

Robust insulator string segmentation preprocessing based on RGB and texture analysis

To overcome interference from complex environmental backgrounds and provide high-purity Regions of Interest (ROI) for subsequent deep networks, this stage designs a segmentation preprocessing algorithm jointly driven by data and prior knowledge. Its core lies in comprehensively utilizing the distinctive color distribution prior of insulators and local texture consistency to perform hierarchical processing from coarse screening to fine localization.

RGB threshold segmentation based on adaptive illumination normalization

To ensure the stability of subsequent threshold segmentation and overcome the impact of complex environmental lighting conditions (such as strong light, backlighting, and shadows) on color constancy, this paper applies an adaptive illumination normalization preprocessing to the input aerial images. This aims to suppress illumination non-uniformity and enhance the color contrast of the insulator regions.

Given an input image I, it is first converted from the RGB color space to the Lab color space, where the separation between luminance and color information is more thorough. In the Lab space, the luminance channel L is approximately decoupled from the color channels a and b, facilitating independent adjustment of illumination. The normalization process is as follows:

  1. 1.

    Adaptive Correction of the Luminance Channel: Contrast Limited Adaptive Histogram Equalization (CLAHE) is applied to the L channel. By limiting the contrast gain in local regions, this significantly enhances the overall visible details of the image while avoiding the introduction of excessive noise in uniform sky areas.

  2. 2.

    Collaborative Enhancement of Color Channels: To increase the color distinction between insulators and the background, a linear transformation based on standard deviation stretching is performed on the a and b channels. Let the original channel mean be \(\mu\) and the standard deviation be \(\sigma\). The transformed pixel value \(p^{\prime}\) is given by.

$$p^{\prime}=\mu +S \cdot (p - \mu )$$
(1)

where S is a stretching factor adaptively selected based on the overall color contrast of the image. This operation moderately saturates the colors, causing the typical colors of insulators, such as blue and red, to become more clustered in the color space.

The processed L, a, and b channels are then recombined and converted back to the RGB space, yielding the normalized image \({I_{normal}}\). This process maps the color values of insulators captured under varying lighting conditions to a more stable and consistent range, greatly alleviating issues of color distortion or insufficient contrast in the original images caused by overexposure or underexposure.

Based on historical inspection data, an adaptive RGB threshold segmentation model is constructed to dynamically set the interval ranges for each color component and the relational constraints between them. This model matches the inherent color characteristics of insulators on specific power lines under local environmental lighting, thereby maximizing the filtration of obvious backgrounds at the initial stage. For a designated tower, the historical extremum values of each RGB component within the insulator regions are calculated from N previously collected images. These values define the threshold basis for segmenting the current image, as detailed in Eqs. (2) to (4). Concurrently, difference constraints between the components are introduced, as shown in Eqs. (5) to (7), to enhance the ability to exclude pixels with anomalous colors. This adaptive strategy effectively mitigates the sensitivity of fixed thresholds to illumination changes and accomplishes preliminary coarse screening of the background.

$$\sum\limits_{{i=1}}^{N} {\frac{{R\hbox{min} (i)}}{N}} <r<\sum\limits_{{i=1}}^{N} {\frac{{R\hbox{max} (i)}}{N}}$$
(2)
$$\sum\limits_{{i=1}}^{N} {\frac{{G\hbox{min} (i)}}{N}} <g<\sum\limits_{{i=1}}^{N} {\frac{{G\hbox{max} (i)}}{N}}$$
(3)
$$\sum\limits_{{i=1}}^{N} {\frac{{B\hbox{min} (i)}}{N}} <b<\sum\limits_{{i=1}}^{N} {\frac{{B\hbox{max} (i)}}{N}}$$
(4)
$$|r - g| \leqslant Trg$$
(5)
$$|r - b| \leqslant Trb$$
(6)
$$|g - b| \leqslant Tgb$$
(7)

In Eq. (2) to (7), r, g and b represent the red, green, and blue color components of the aerial image, respectively. R, G and B denote the maximum and minimum values of the three components extracted from the insulator strings on the specified tower across all historically collected UAV images. N is the total number of historical collection instances, and i refers to the i-th collected image. Trg, Trb, and Tgb represent the thresholds for the absolute differences between the red-green, red-blue, and green-blue components, respectively.

When inspecting the insulator strings on a designated tower, the threshold settings for the r, g and b components in a newly acquired aerial image can be referenced against historical inspection records from previous UAV collections. This allows for filtering out a portion of the background and annotating the remaining areas. This approach effectively mitigates the sensitivity of fixed thresholds to illumination variations and achieves preliminary, efficient elimination of non-insulator backgrounds.

Texture feature-guided fine localization

Through the adaptive RGB threshold segmentation model and mathematical morphological processing, image regions exhibiting similar color statistical features can be initially extracted. However, this coarse localization result still faces two fundamental issues that constrain the performance of subsequent deep network-based detection. Firstly, segmentation boundary uncertainty exists. In the boundary regions between the target and the background, due to mixed-pixel effects, localized abrupt illumination changes, and slight motion blur, single-color thresholds struggle to achieve pixel-level precise localization. This leads to extracted region contours exhibiting phenomena of “over-erosion” or “under-segmentation,” directly impacting the geometric accuracy of subsequent defect localization. Secondly, inter-class ambiguity of color features presents a challenge. Inspection scenes contain interfering objects such as corroded tower materials and shadow-covered conductors. Their surface reflection characteristics may coincidentally resemble the historical color statistical model of insulators under specific lighting conditions, causing them to be retained as pseudo-targets. If such candidate regions containing geometric noise and semantic interference are directly fed into a deep neural network, the model is forced to couple the dual tasks of “target identification” and “defect recognition.” This not only unnecessarily increases the learning complexity of the model but may also cause the network to learn spurious correlation features. Consequently, this can weaken its discriminative sensitivity to subtle defects (such as millimeter-level cracks and local flashover carbon traces) in insulator strings and compromise the purity of its feature representation.

Further analysis reveals that both the periodic layered structure of sheds in disc insulator strings and the groove patterns on the surface of composite long-rod insulators form a distinct texture pattern within local image regions, characterized by significant directionality, spatial periodicity, and statistical consistency24,25. This pattern originates from the regularity of their physical structure and the uniformity of their manufacturing process, which fundamentally distinguishes it from the random, aperiodic textures formed by complex natural backgrounds such as the gradual uniformity of the sky, the disorderly clutter of vegetation, or the irregular mottling on building surfaces. Therefore, employing texture feature consistency verification as a post-color-segmentation refinement and purification step is crucial for filtering out color-similar interferences, precisely delineating the genuine targets, and optimizing their boundary representation.

This paper adopts the Gray-Level Co-occurrence Matrix (GLCM) for the quantitative characterization and consistency measurement of the aforementioned texture pattern. By statistically analyzing the joint probability distribution of paired gray-level values occurring at a fixed spatial relationship defined by distance d and direction θ, the GLCM precisely captures the fine structure, directional properties, and contrast information of the texture. Its mathematical definition, as shown in Eq. (8), aims to construct a mapping from the pixel gray-level space to the texture feature space.

$$\begin{gathered} G(K,Q)=\sum\limits_{{x=1}}^{v} {\sum\limits_{{y=1}}^{v} {P\left\{ {A=K\& B=Q} \right\}} } \hfill \\ A={I_{(i,j)}}(x,y) \hfill \\ B={I_{(i,j)}}(x - d\sin \theta ,y+d\cos \theta ) \hfill \\ \end{gathered}$$
(8)

In the equation, i, j represent window positions; x, y denote specific coordinates within each window; K, Q signify a specific pair of numbers; v indicates the size of the window; I is the original image; G represents the number of gray levels in the Gray-Level Co-occurrence Matrix for a given window. Here, a “window” refers to partitioning the original image into many non-overlapping rectangular blocks.

Balancing feature discriminative power with computational complexity, and adhering to the principle of low inter-feature correlation, this paper selects four Haralick features from the multidimensional feature set derived from the GLCM. These features have clear physical meanings and have been proven to possess strong discriminative power in texture classification, forming the discriminant vector.

Energy (ASM) reflects the uniformity of the image’s gray-level distribution and the coarseness of the texture. A larger value indicates a coarser texture or a more concentrated gray-level distribution.

$$ASM={\sum\limits_{{i=0}}^{{G - 1}} {\sum\limits_{{j=0}}^{{G - 1}} {[P(i,j,d,\theta )]} } ^2}$$
(9)

Contrast (CON) measures the clarity of texture or the intensity of local gray-level variations, correlating positively with the visual depth of texture “valleys.“

$$CON=\sum\limits_{{i=0}}^{{G - 1}} {\sum\limits_{{j=0}}^{{G - 1}} {{{(i - j)}^2}P(i,j,d,\theta )} }$$
(10)

Correlation (COR) quantifies the degree of linear dependency between row and column elements in the GLCM, effectively characterizing the directionality and linear structure of the texture.

$$COR=\sum\limits_{{i=0}}^{{G - 1}} {\sum\limits_{{j=0}}^{{G - 1}} {\frac{{ijP(i,j,d,\theta ) - \mu i\mu j}}{{sisj}}} }$$
(11)
$$\mu k=\sum\limits_{{i=0}}^{{G - 1}} {\sum\limits_{{j=0}}^{{G - 1}} {kP(i,j,d,\theta ),k=i,j} }$$
(12)
$$sk=\sum\limits_{{i=0}}^{{G - 1}} {\sum\limits_{{j=0}}^{{G - 1}} {P(i,j,d,\theta ){{(k - \mu k)}^2},k=i,j} }$$
(13)

Inverse Difference Moment (IDM), also known as Homogeneity, reflects the local uniformity of the texture. A larger value indicates gentler gray-level variations within a local region and a more homogeneous structure.

$$IDM=\sum\limits_{{i=0}}^{{G - 1}} {\sum\limits_{{j=0}}^{{G - 1}} {\frac{{P(i,j,d,\theta )}}{{1+{{(i - j)}^2}}}} }$$
(14)

In Eqs. (9)–(14), i and j are coordinate positions in the Gray-Level Co-occurrence Matrix G, where G is the number of gray levels in the image. \(\mu i\), \(\mu j\) are the row and column means of \(P(i,j)\), respectively, and \(si\),\(sj\) are their corresponding row and column standard deviations.

To obtain a rotation-invariant and robust texture description while comprehensively capturing the structural characteristics of insulators in different orientations, this paper computes the aforementioned features in four directions: 0°, 45°, 90°, and 135°. This generates a 16-dimensional texture feature vector F= [ASMθ, CONθ, CORθ, IDMθ] for each image sub-region.

The fine localization based on texture consistency first involves calculating the minimum enclosing horizontal rectangle for each binary connected region obtained after coarse segmentation and morphological refinement. The engineering rationale for this step lies in the fact that the actual installation posture of transmission line insulator strings is predominantly approximately horizontal or vertical, as shown in Fig. 1. Employing an axially aligned horizontal bounding box, on one hand, can tightly encompass the main target body, maximizing region utilization. On the other hand, its regular geometric shape greatly facilitates subsequent image cropping and size normalization, ensuring the standardization and consistency of data input to the deep learning model in the spatial dimension.

Fig. 1
Fig. 1
Full size image

Illustration of horizontal bounding box.

Subsequently, to perform fine-grained texture analysis, the aforementioned rectangular region is spatially divided into a grid of non-overlapping sub-windows. In this work, a sub-window size of 20 × 20 pixels is adopted to ensure each window contains sufficient texture structural information, as illustrated in Fig. 2. The 16-dimensional texture feature vector F_patch is computed for each sub-window. Then, F_patch is compared dimension-by-dimension with a benchmark feature vector F_std, which is obtained through offline learning from a large-scale database of pure insulator samples. This study defines a set of adaptive thresholds and establishes a strict texture consistency decision criterion: a candidate region is confirmed as a genuine insulator string and its optimized minimum enclosing horizontal rectangle is output only if a sufficiently high proportion (e.g., with a set threshold η = 85%) of its sub-windows satisfy \(\left| {\left. {{\mathbf{F}}_{{{\text{patch}}}}^{{(k)}} - {\mathbf{F}}_{{{\text{std}}}}^{{(k)}}} \right|} \right.<{T_k}\quad (k=1,2, \ldots ,16)\) (where \({T_k}\) is the adaptive threshold vector). Otherwise, the region is classified as background interference with an inconsistent texture pattern and is discarded.

Fig. 2
Fig. 2
Full size image

Illustration of non-overlapping windows for texture analysis.

Insulator defect detection network based on multi-scale feature compensation

After obtaining precise insulator region proposals, the core of the detection task shifts to the accurate identification and classification of subtle defects within them. Although single-stage detectors, represented by the YOLO series, are widely adopted for their balance between speed and accuracy, their inherent network architecture faces two core internal contradictions when tackling the highly specific task of insulator defect detection. Firstly, the irreversible attenuation of feature representation during long-path propagation. Convolutional Neural Networks abstract high-level semantics through layer-wise down-sampling, but this process inevitably leads to the continuous loss of high-frequency details (e.g., edges, crack textures) and low-frequency structural information crucial for localizing and identifying fine defects. Although attention mechanisms26 or multi-branch designs27 can locally enrich features within the backbone network, they fail to effectively compensate for the global information loss that occurs along the complete forward propagation chain of “backbone network - neck network - detection head(s)”. Secondly, the static nature and insufficient discriminative power of multi-scale feature fusion. Existing neck networks like the Feature Pyramid Network (FPN) mostly rely on linear fusion strategies with fixed weights, such as addition or concatenation. When confronted with visually variable and irregular defects like flashover or subtle damage, set against complex background clutter such as vegetation, bird nests, or tower structures, this static strategy lacks the ability to dynamically reinforce key discriminative features and suppress irrelevant noise, making it difficult to construct clear decision boundaries in the feature space28,29. These contradictions collectively constrain the model’s ultimate performance in complex real-world scenarios.

To fundamentally and systematically address the aforementioned problems, this paper deviates from the conventional approach of making localized modifications to existing architectures and proposes a novel Multi-scale Feature Compensation Detection Network. The design of this network starts from a global perspective of feature representation learning, constructing an overall architecture centered on the core paradigm of “frequency-domain decoupling, parallel extraction, and dynamic gated compensation”. Through the synergy of its core components—the Heterogeneous Parallel Feature Extraction Network (HPFEN) and the Hierarchical Dynamic Compensation Module (HDCM)—MFCD-Net aims to proactively intervene in the information attenuation process and achieve adaptive and discriminatively enhanced feature fusion. The overall architecture is illustrated in Fig. 3, designed to achieve a complete logical loop from differentiated extraction at the feature source to targeted compensation along the propagation path.

Fig. 3
Fig. 3
Full size image

Overall architecture of the deep learning-based multi-scale defect detection for insulator strings.

Frequency domain decoupling for defect diagnosis

Traditional single-path backbone networks pursue high-level semantic abstraction through successive down-sampling. This process inherently involves the aliasing and lossy compression of information from different frequency domains. For the insulator defect detection task, high-frequency details such as edges and textures correspond to subtle cracks, while low-frequency structures like overall contours aid in locking onto targets within cluttered backgrounds. Aliasing leads to the weakening of this task-specific prior information in the early stages of the network. Therefore, MFCD-Net introduces the Heterogeneous Parallel Feature Extraction Network (HPFEN) at the input stage. HPFEN demystifies the “black-box” process of feature extraction. Based on the physical prior knowledge of defect recognition, it performs explicit decoupling and parallelized enhancement processing on different frequency-domain information at the initial stage, laying the groundwork for subsequent targeted compensation. Its design comprises two key steps: (1) Frequency domain decoupling based on the Laplacian pyramid; (2) Parallel specialized enhancement for high and low-frequency information.

  1. 1.

    Frequency domain information decoupling: feature decomposition based on the laplacian pyramid.

To achieve explicit separation of different frequency domain information in the input image, we introduce a lightweight Frequency Domain Decomposition Module (FDD). This module is constructed based on the theory of the Laplacian pyramid, with the objective of decomposing the input image I losslessly into a low-frequency component L, representing the overall contour and smooth regions, and a high-frequency component H, containing detail information such as edges and textures. The specific procedure is as follows:

First, Gaussian low-pass filtering and down-sampling are applied to the input image I to obtain its low-frequency approximation. Let \({G_\sigma }\) be a Gaussian kernel with variance \(\sigma\), and \({ \downarrow _s}\)denote the down-sampling operation with stride S. Then, the low-frequency image \({L_l}\) at the l-th layer can be generated iteratively:

$${L_l}={ \downarrow _2}\left( {{G_\sigma }*{L_{l - 1}}} \right),\quad {L_0}=I$$
(15)

where * represents the convolution operation. To obtain the high-frequency details corresponding to that level (i.e., the Laplacian pyramid layer), we up-sample the low-frequency image from the previous layer to the current layer’s size and compute the difference with the current layer’s original low-frequency image:

$${H_l}={L_l} - { \uparrow _2}\left( {{L_{l+1}}} \right)$$
(16)

Here, \({ \uparrow _2}\) denotes the nearest-neighbor up-sampling operation. In the HPFEN of this paper, we primarily utilize the decomposition result from the first layer, i.e., taking L = L1 as the low-frequency component and H = H1 as the high-frequency component for input to the subsequent parallel branches. This process can be formally represented as:

$$H,L={\text{FDD}}(I)$$
(17)

Through Eqs. (15) to (17), the FDD module achieves the preliminary physical separation of the original image information, providing clear input sources for subsequent specialized feature extraction.

  1. 2.

    Heterogeneous parallel feature enhancement: specialized path design for diagnostic needs.

After obtaining the decoupled high- and low-frequency components, HPFEN performs independent and targeted feature enhancement through three functionally specialized parallel branches.

Main Backbone Branch (M-Backbone): Taking the original image I as input, it employs CSPDarknet53 as the backbone network. This branch serves as the primary pathway of the network, responsible for extracting robust, general-purpose mid-to-high-level semantic features \({F_m}\), providing the foundation for target classification and coarse localization. Its output features incorporate global information without separation, preserving the integrity of the feature representation.

High-Frequency Enhancement Branch (H-Backbone): Taking the high-frequency component H as input, it is specifically designed to capture the fine edges and texture patterns of defects. To strengthen the response to detailed features, this branch adopts a dual-path pooling enhancement strategy. Specifically, both average pooling (AvgPool2) and max pooling (MaxPool2) with a kernel size of 2 are simultaneously applied to the input high-frequency features:

$${Z_a}={\text{AvgPoo}}{{\text{l}}_2}(H),\quad {Z_m}={\text{MaxPoo}}{{\text{l}}_2}(H)$$
(18)

Average pooling can smooth noise and highlight stable texture patterns, while max pooling retains the most significant edge responses. The results from the two paths are summed and then undergo feature integration:

$${Z_h}={\text{Con}}{{\text{v}}_{1 \times 1}}({Z_a}+{Z_m}){\text{ }}$$
(19)

where \({\text{Con}}{{\text{v}}_{1 \times 1}}\)is a 1 × 1 convolution used to fuse the pooled features and adjust the channels. To further focus on spatially critical details, we introduce a lightweight Channel-Space Cooperative Attention (CSCA) module to process \({Z_h}\):

$${F_h}={\text{CSCA}}({Z_h})$$
(20)

The CSCA module sequentially computes channel attention weights and spatial attention weights, achieving cooperative modulation through element-wise multiplication. This enables the network to adaptively emphasize high-frequency detail regions relevant to defects. Finally, \({F_h}\) represents the enhanced high-frequency features rich in edge and micro-texture information.

Low-Frequency Enhancement Branch (L-Backbone): Taking the low-frequency component L as input, it focuses on extracting features related to the target’s structure and global context. Its network structure is symmetric to the high-frequency branch, also starting with the dual-path (average and max) pooling to capture structural information under different receptive fields. The difference lies in the subsequent feature enhancement module, which places greater emphasis on modeling inter-channel dependencies, employing the Efficient Channel Attention (ECA) mechanism:

$${F_l}={\text{ECA}}\left( {{\text{Con}}{{\text{v}}_{1 \times 1}}\left( {{\text{AvgPoo}}{{\text{l}}_2}(L)+{\text{MaxPoo}}{{\text{l}}_2}(L)} \right)} \right)$$
(21)

The ECA module efficiently models the importance between channels via 1D convolution, thereby strengthening the encoding capability for low-frequency structural information such as the overall shape and arrangement of insulators.

Through the design of HPFEN, we implement a “decoupling - specialized enhancement” processing flow at the front end of the network. From an information theory perspective, it proactively intervenes in the inevitable issues of frequency-domain information aliasing and loss inherent in the traditional single-path approach. The specialized design of the high-frequency and low-frequency branches allows the network to retain and enhance, in parallel and to the maximum extent, the two types of complementary information crucial for defect diagnosis: detail texture and global structure. This explicit, physics-prior-based feature engineering provides a superior, structured initial feature representation at the network’s entry point, laying a more solid foundation for the representation learning in subsequent deep networks. This directly addresses the core contradiction of “irreversible attenuation of features during long-path propagation” outlined in the introduction. The output features from the three branches, \({F_m}\), \({F_h}\), and \({F_l}\), will be adaptively fused in the subsequent Hierarchical Dynamic Compensation Module to serve detection heads at different scales.

Parallel enhancement for defect diagnosis

The heterogeneous features extracted by HPFEN—namely, the backbone features \({F_m}\), the enhanced high-frequency features \({F_h}\), and the enhanced low-frequency features \({F_l}\)—are information-rich. However, feeding them directly into detection heads still faces two major challenges. First, detection heads at different levels exhibit varying sensitivities to defects of different scales, requiring differentiated feature support. Second, within complex ROIs, features between defective regions, normal regions, and different defect categories still need further discrimination to enhance discriminative power. The static fusion strategy of traditional FPN cannot meet these demands.

To address this, this paper proposes a Hierarchical Dynamic Compensation Module (HDCM), whose core lies in constructing a content-aware gated fusion unit. This unit achieves a profound enhancement of feature representational capability by synergistically optimizing two core functions: (1) Targeted frequency-domain information compensation: dynamically fusing the \({F_h}\) and \({F_l}\) features based on the content semantics of the current feature map to precisely compensate for the key frequency-domain information missing for the detection head. (2) Discriminative feature reconstruction: adaptively recalibrating the fused features using a gating mechanism, thereby strengthening discriminative feature patterns related to defects while effectively suppressing irrelevant background interference and redundant feature responses.

  1. 1.

    Content-aware gated fusion mechanism.

Let \({N_i} \in {{\mathbb{R}}^{C \times H \times W}}\) denote the feature received by the i-th level detection head (\(i=3,4,5\) corresponds to multi-scale features) from the neck network. Simultaneously, HPFEN provides frequency-domain features at the corresponding scale: high-frequency feature \({H_i}\) (from down-sampling \({F_h}\)) and low-frequency feature \({L_i}\) (from down-sampling \({F_l}\)).

First, a gated weight generator \(\mathcal{G}( \cdot )\) is designed. It takes the contextual feature \({N_i}\) of the current layer as input and learns a pair of spatially adaptive weight maps, \({\alpha _i}\) and \({\beta _i} \in {{\mathbb{R}}^{1 \times H \times W}}\), for modulating the high-frequency and low-frequency components, respectively. This achieves “content-awareness”: the focus of compensation should be determined by the content of the current feature map, not fixed.

$$[{\alpha _i},{\beta _i}]=\mathcal{G}({N_i})={\text{Softmax}}\left( {{\text{Con}}{{\text{v}}_{1 \times 1}}\left( {{\text{CA}}({{\mathbf{N}}_i})} \right)} \right)$$
(22)

Here, \({\text{CA}}(\cdot )\) represents a lightweight channel attention module used to first aggregate global information across the channel dimension. \({\text{Con}}{{\text{v}}_{1 \times 1}}\) reduces the channel number to 2, followed by applying the Softmax function along the channel dimension to ensure \({\alpha _i}+{\beta _i}=1\). This enables the network to decide, based on the semantics of \({N_i}\) at each spatial location, whether to draw more information from the high-frequency or low-frequency features.

  1. 2.

    Dynamic compensation and reconstruction of frequency-domain features.

Using the generated gating weights, we dynamically fuse the frequency-domain features to obtain the compensation feature \({Q_i}\).

$${Q_i}={\alpha _i} \odot {H_i}+{\beta _i} \odot {L_i}$$
(23)

Here, \(\odot\) denotes element-wise multiplication after broadcasting along the channel dimension. \({Q_i}\) is an adaptively reorganized frequency-domain representation that dynamically balances detail-texture and structural information across different spatial positions.

However, simple weighted fusion may still contain noise. To further enhance its discriminative power, we introduce a feature reconstruction unit \(\mathcal{R}( \cdot )\). It consists of a lightweight residual block (containing a 3 × 3 convolution, batch normalization, and SiLU activation function) designed to learn a mapping from the fused feature \({Q_i}\) to a purer, more discriminative compensation quantity \({\tilde {Q}_i}\).

$${\tilde {Q}_i}=\mathcal{R}({Q_i})={\text{ResBloc}}{{\text{k}}_{3 \times 3}}({Q_i})$$
(24)
  1. 3.

    Compensation injection and final output.

Finally, the reconstructed compensation feature \({\tilde {Q}_i}\) is injected into the original neck feature \({N_i}\) in the form of a residual connection, forming the final enhanced feature \({T_i}\) fed into the i-th level detection head.

$${T_i}={{\mathbf{N}}_i}+\lambda \cdot {\tilde {Q}_i}$$
(25)

Here, \(\lambda\) is a learnable scalar parameter used to control the strength of the compensation, allowing the network to adaptively adjust the compensation amount.

Training scheme and optimization

Building upon the aforementioned network architecture design, this section details the training strategy and optimization objectives for MFCD-Net. Our training goal extends beyond achieving accurate defect localization and classification; more critically, it aims to ensure that the network can effectively learn the frequency-domain decoupling capability within HPFEN and the adaptive gating mechanism in HDCM. To this end, we design a composite loss function and employ specific training strategies to facilitate stable convergence and enhance the network’s generalization capability.

Composite loss function design

The insulator defect detection task requires simultaneous optimization of target localization, classification, and confidence of existence. Therefore, we adopt a composite loss function \({L_{{\text{total}}}}\), which consists of three components: bounding box regression loss \({L_{{\text{box}}}}\), classification loss \({L_{{\text{cls}}}}\), and confidence loss \({L_{{\text{obj}}}}\), weighted by hyperparameters:

$${L_{{\text{total}}}}={\lambda _1}{L_{{\text{box}}}}+{\lambda _2}{L_{{\text{cls}}}}+{\lambda _3}{L_{{\text{obj}}}}$$
(26)

where \({\lambda _1}\),\({\lambda _2}\) and \({\lambda _3}\) are weighting coefficients that balance the contribution of each loss term.

  1. 1.

    Bounding box regression loss \({L_{{\text{box}}}}\)

For bounding box regression, we employ the improved Complete IoU (CIoU) loss. It not only considers the overlap area between bounding boxes but also introduces consistency measures for center point distance and aspect ratio, providing more accurate gradient directions. This is particularly suitable for targets like insulator strings which possess a fixed aspect ratio:

$${L_{{\text{box}}}}=1 - {\text{IoU}}+\frac{{{\rho ^2}\left( {b,{b^{gt}}} \right)}}{{{c^2}}}+\alpha v$$
(27)

Here, IoU is the Intersection over Union between the predicted and ground-truth boxes. \({\rho ^2}\left( {b,{b^{gt}}} \right)\) is the squared Euclidean distance between the center point b of the predicted box and the center point \({b^{gt}}\) of the ground-truth box. c is the diagonal length of the smallest enclosing rectangle covering both boxes. The penalty term v measures the consistency of aspect ratios.

$$v=\frac{4}{{{\pi ^2}}}{\left( {\arctan \frac{{{w^{gt}}}}{{{h^{gt}}}} - \arctan \frac{w}{h}} \right)^2}$$
(28)

The weighting coefficient \(\alpha\) is defined as:

$$\alpha =\frac{v}{{(1 - {\text{IoU}})+v}}$$
(29)

Here, w and h represent the width and height of the predicted box, while \({w^{gt}}\) and \({h^{gt}}\) represent the width and height of the ground-truth box, respectively.

  1. 2.

    Classification loss \({L_{{\text{cls}}}}\)

For the classification of defect categories, we adopt the Binary Cross Entropy (BCE) loss. It supervises each prediction box with multi-label classification, allowing for the simultaneous existence of multiple defect categories (e.g., an insulator string may have both damage and flashover):

$${L_{{\text{cls}}}}= - \sum\limits_{{i=1}}^{{{N_{{\text{box}}}}}} {\sum\limits_{{j=1}}^{{{N_{{\text{class}}}}}} {\left[ {{y_{ij}}\log ({p_{ij}})+(1 - {y_{ij}})\log (1 - {p_{ij}})} \right]} }$$
(30)

Here, \({N_{{\text{box}}}}\) is the number of prediction boxes, \({N_{{\text{box}}}}\) is the number of defect categories excluding the background, \({y_{ij}} \in \{ 0,1\}\) indicates the ground-truth label for whether the i-th prediction box contains the j-th class defect, and \({p_{ij}}\) is the model-predicted probability that the i-th prediction box contains the j-th class defect.

  1. 3.

    Confidence loss \({L_{{\text{obj}}}}\)

The confidence loss also uses Binary Cross Entropy, supervising whether a prediction box contains any target object (insulator or defect):

$${L_{{\text{obj}}}}= - \sum\limits_{{i=1}}^{{{N_{{\text{box}}}}}} {\left[ {{{\hat {y}}_i}\log ({{\hat {p}}_i})+(1 - {{\hat {y}}_i})\log (1 - {{\hat {p}}_i})} \right]}$$
(31)

Here, \({\hat {y}_i} \in \{ 0,1\}\) is the true indicator for whether the i-th prediction box contains any target, and \({\hat {p}_i}\) is the model-predicted confidence that this box contains a target.

Loss Function Weight Design: In our task, accurately identifying minor defects (e.g., fine cracks) is crucial. Therefore, we assign a higher weight (\({\lambda _2}=0.6\)) to the classification loss \({L_{{\text{cls}}}}\). Simultaneously, to ensure the network effectively learns the fine-grained feature representations within HPFEN and HDCM, we appropriately reduce the weight (\({\lambda _1}=0.06\)) for the bounding box regression loss \({L_{{\text{box}}}}\), preventing the localization task from prematurely dominating the training process and impeding feature learning. The confidence loss weight \({\lambda _3}=1.2\) is used to reinforce the judgment of target presence.

Training strategy and parameter settings

To ensure the network converges stably and fully learns the proposed complex architecture, we adopt the following training strategies:

Optimizer: The Stochastic Gradient Descent (SGD) optimizer is used, with the momentum set to 0.937 and weight decay to 0.0005. Compared to adaptive optimizers (e.g., Adam), SGD generally yields better generalization performance in object detection tasks.

Learning rate scheduling: A Cosine Annealing learning rate decay strategy is employed, with an initial learning rate set to 0.01. This strategy smoothly decreases the learning rate, avoiding performance oscillations in the later stages of training caused by abrupt learning rate drops.

Data augmentation: Tailored to the characteristics of aerial insulator images, we apply a series of data augmentation techniques, including random rotation (± 30°), random scaling (0.5–1.5x), Mosaic augmentation, and HSV color space perturbation. These augmentation strategies effectively enhance the model’s robustness to variations in lighting, scale, and spatial rotation.

Batch processing and input size: The batch size is set to 8, and input images are uniformly resized to 640 × 640 pixels. The relatively small batch size, combined with a gradient accumulation strategy, helps maintain training stability under limited GPU memory constraints.

Training epochs: The total number of training epochs is set to 300. A warm-up phase is conducted for the first 3 epochs, during which the learning rate linearly increases from 0.001 to 0.01 to stabilize the initial training stage.

Through the aforementioned composite loss function and training strategies, MFCD-Net is enabled to effectively and synergistically optimize the frequency-domain feature extraction capability of HPFEN, the adaptive compensation mechanism of HDCM, and the final defect detection performance, thereby achieving high-precision insulator defect identification in complex scenarios.

Experiments and analysis on insulator string segmentation and local region cropping

To systematically evaluate the accuracy of the robust insulator string segmentation method based on RGB and texture analysis, we first validated the correctness of the insulator recognition and segmentation algorithm on the complete dataset comprising 5793 aerial insulator images. Subsequently, a comparative analysis was conducted under challenging conditions such as complex terrestrial backgrounds and strong-light sky backgrounds to delve deeper into the method’s performance under extreme environments.

Segmentation accuracy analysis on the full dataset

To objectively assess the generalization capability and robustness of the method, thereby avoiding the randomness of single-sample analysis, segmentation was performed on all 5793 images. The dataset collectively contains 9813 annotated insulator strings. String-level recall rate and pixel-level completeness were adopted as the core evaluation metrics. The string-level recall rate is the proportion of correctly segmented insulator strings to the total number, while pixel-level completeness is the proportion of pixels belonging to insulators within the segmented region to the total number of pixels.

Statistical results show that the proposed method achieved a string-level recall rate of 98.7% and a pixel-level completeness of 97.2% on the full dataset. This indicates that the strategy integrating adaptive illumination normalization, RGB thresholding, and texture verification can stably and completely locate and segment insulator targets in the vast majority of complex scenarios, providing a high-quality input foundation for subsequent defect detection. The images that failed segmentation were primarily concentrated among those with severe occlusion, motion blur, or extreme adverse weather conditions like heavy fog.

Case study on extremely complex backgrounds and extreme lighting conditions

To thoroughly analyze the adaptability of the proposed method under extremely complex backgrounds and extreme lighting conditions, two representative and challenging cases were selected for qualitative and quantitative comparative analysis: a complex earth/vegetation background and a strong-light sky background, as shown in Fig. 4.

Fig. 4
Fig. 4
Full size image

Cropped local regions of insulator strings under extremely complex background and extreme lighting condition. (a) Image 1, (Extremely Complex Background), (b) Image 2, (Extreme Lighting Condition)

Under both extreme conditions, the proposed method successfully extracted the complete insulator strings. Quantitative results, presented in Tables 1 and 2, demonstrate that compared to the methods from references11,12, our method achieves thorough background removal with zero missed insulator discs. Both the string count error rate and the disc omission rate are zero, significantly outperforming the comparative methods.

Table 1 Comparison of the results of aerial images 1 using the algorithm in the literature and this paper.
Table 2 Comparison of the results of aerial images 2 using the algorithm in the literature and this paper.

While the algorithms from literature11,12 can achieve satisfactory recognition results for specific images, it is evident from Tables 1 and 2 that their performance degrades significantly when processing the two aerial images with extreme backgrounds and lighting conditions.

Furthermore, acknowledging that the proposed insulator recognition and segmentation method yields superior extraction results, a further evaluation criterion is necessary. We first construct ground-truth data and then perform quantitative comparisons with the computational results from the algorithm. The involved physical quantities include the vertical width, horizontal width, and center coordinates of the horizontal bounding boxes, which are compared against the ground-truth data to obtain error rates. For aerial image 2, calculation shows that the first and second bounding boxes are at the same vertical position; therefore, they can be combined and compared with the standard position of the first string. The results are shown in Tables 3 and 4.

Table 3 Data of the algorithm deals with legend 1.
Table 4 Data of the algorithm deals with legend 2.

As can be seen from Tables 3 and 4, the vertical width of the horizontal bounding boxes obtained by processing the sample images with the proposed algorithm is relatively accurate, and the distance between the calculated bounding box center and the ground-truth center is close. However, the horizontal width error of the bounding boxes is relatively large (around 5%).

The reasons for the larger error are: During RGB model threshold segmentation, the side edges of the insulator string lie between the target and the background. Due to influences from lighting or weather, the R, G, B tri-components of pixels in these regions may fail to meet the threshold conditions, leading to their erroneous removal as background. Prior to marking potential target regions, mathematical morphology denoising is applied to the image. The effectiveness of denoising depends on the choice of structuring element type and size. An overly aggressive denoising process may remove part of the insulator string’s edge width, resulting in the horizontal width of the bounding box being smaller than the standard.

Experiments and analysis on insulator defect recognition

To systematically evaluate the overall performance of the proposed insulator defect detection method that integrates image color analysis and multi-scale feature compensation, the validation in this chapter encompasses the complete pipeline from the input of raw aerial images to the final output of defect detection results. By comparing against advanced models covering different technical paradigms and designing rigorous ablation experiments, we analyze the effectiveness of the proposed method in addressing the challenges of insulator defect detection in complex environments.

Dataset and evaluation criteria

To verify the complete algorithmic pipeline, the experimental input consists of raw UAV inspection aerial images. This dataset contains 5793 images, divided into a training set (3476), a validation set (1738), and a test set (579). The specific distribution is shown in Table 5.

Table 5 Insulator and defect detection dataset.
$$\begin{array}{*{20}{l}} P&{=\frac{{TP}}{{TP+FP}}{\text{ }}} \\ R&{=\frac{{TP}}{{TP+FN}}{\text{ }}} \\ {AP}&{=\int_{0}^{1} P (R)dR{\text{ }}} \\ {mAP}&{=\frac{1}{N}\sum\limits_{{i=1}}^{N} A {P_i}{\text{ }}} \end{array}$$
(32)

where TP, FP, and FN represent the number of true positive, false positive, and false negative detection boxes, respectively. The mAP is the arithmetic mean of the AP values for all target categories (including the insulator body and various defects). Here, N represents the total number of target categories. To comprehensively measure the model’s balanced capability between precision and recall, this chapter uniformly reports results at an Intersection over Union (IoU) threshold of 0.5, denoted as AP(0.5) and mAP(0.5).

For a fair comparison, the baseline models need to directly process the raw aerial images with complex backgrounds, whereas the proposed method first locates and crops the target regions through its unique preprocessing stage. This comparison aims to simulate real application scenarios, comprehensively evaluating the performance of the entire process from raw image input to final defect detection, thereby demonstrating the overall advantage and engineering practicality of our complete scheme integrating prior knowledge with deep learning when facing complex background interference.

Fig. 5
Fig. 5
Full size image

Visual comparison of insulator defect recognition in this paper with other 8 advanced target recognition algorithms.

Table 6 Quantitative comparison of performance with 8 currently advanced methods based on the dataset.

Overall performance comparison and analysis

To establish a fair performance benchmark, this section selects eight advanced models covering different technical paradigms for comparison with our complete algorithmic pipeline (i.e., Preprocessing + MFCD-Net). The compared models include single-stage detectors (YOLOv730, YOLOX31, RTMDet32, RetinaNet33, two-stage detectors (Mask R-CNN-Swin34, Sparse R-CNN35, a Transformer-based detector (DETR36, and the task-specific improved model YOLOv7 + MCI-GLA. To ensure fairness, all comparative models were reproduced and trained under identical experimental conditions (PyTorch framework, NVIDIA RTX 4090 GPU), using the same preprocessed image dataset described in “Dataset and evaluation criteria”, the same training strategy (SGD optimizer, initial learning rate 0.01, batch size 8, cosine annealing scheduler), and the same data augmentation scheme.

A deep dive into the data in Table 6 reveals that the performance advantage of MFCD-Net exhibits distinct category heterogeneity. For the “Damage” and “Flashover” defect classes, its AP(0.5)/% reaches 79.7% and 80.5% respectively, showing the most prominent leading advantage. In contrast, for the insulator body (“Insulator”) and “Shed Missing” categories, which have distinct and fixed morphological features, the performance gain is relatively modest. This phenomenon preliminarily indicates that the performance improvement of MFCD-Net is not a uniform enhancement for all target categories. Considering its network architecture design, this differentiated performance likely stems from the targeted enhancement mechanisms specifically designed for subtle damage features and easily confused flashover features.

Ablation experiments

To dissect the contribution of each innovative component at a principled level and validate the effectiveness of the MFCD-Net design philosophy, this section designs a systematic set of ablation experiments. The experiment takes the widely adopted CSPDarknet53 backbone with a Path Aggregation Network as the performance baseline (denoted as Base). Upon this framework, we progressively integrate the proposed core modules to construct an incremental improvement sequence:

  1. 1.

    Base + Preprocessing: Adds the insulator string segmentation and cropping module based on RGB color space and texture analysis in front of the Base model. This configuration aims to evaluate the foundational impact of input space purification on the overall detection task.

  2. 2.

    Base + Preprocessing + HPFEN: On top of the previous configuration, introduces the Heterogeneous Parallel Feature Extraction Network (HPFEN) at the entrance of the backbone network to assess the improvement in feature representation brought by front-end frequency-domain decoupling and enhancement.

  3. 3.

    Base + Preprocessing + HPFEN + HDCM (MFCD-Net): Integrates all components to form the complete MFCD-Net, used to verify the synergistic effects of the Hierarchical Dynamic Compensation Module with the aforementioned modules.

All models were evaluated under identical training and testing conditions, with the results shown in Table 7.

Table 7 MFCD-Net ablation experimental results and mechanism analysis AP(0.5)/%, mAP(0.5)/%.

First, analyzing the performance change from the Base model to the “Base + Preprocessing” model. As seen in Table 7, introducing the preprocessing module increased the model’s mAP(0.5)/% on the test set from 79.40% to 82.63%, a significant gain of 3.23% points. This improvement is universal across all defect categories, particularly prominent for “Damage” and “Flashover”.

The core reason lies in the decoupling of task complexity and the fundamental enhancement of the input signal-to-noise ratio. The original Base model must simultaneously handle two highly coupled subtasks: locating insulator strings within complex aerial backgrounds, and identifying and classifying defects within them. The vast amount of irrelevant information in the background constitutes powerful interference noise, occupying part of the model’s capacity that should have been allocated to learning subtle defect features. The preprocessing module in this paper stably separates insulator strings from the background by leveraging the strong visual cue of color priors and texture consistency. This effectively strips the “localization” task from the deep learning model and completes it with an efficient, robust deterministic algorithm. Consequently, the subsequent deep network receives “purified” inputs that are cropped images containing only the target region, and its task is simplified to focusing solely on “defect recognition”. This purification of the input space significantly reduces the learning difficulty for the model, providing a higher starting point for the detection performance of all categories. This proves that in complex industrial inspection scenarios, injecting domain prior knowledge into the system in the form of preprocessing is an effective strategy for enhancing both the efficacy and efficiency of deep learning models.

Building upon the acquisition of high-quality input, the “Base + Preprocessing + HPFEN” configuration further elevates the mAP(0.5)/% to 85.40%, representing a growth of 2.77% points compared to the previous stage. It is worth examining in depth the non-uniform class-specific gains: the AP(0.5)/% for the “Damage” defect increased by 5.2% points, while the gain for the “Flashover” defect was only 1.3% points.

This differential improvement effect validates the targeted design philosophy of HPFEN. As discussed in the introduction, Convolutional Neural Networks pursue high-level semantic abstraction by stacking down-sampling layers, but this process inevitably leads to the progressive attenuation of high-frequency details from the input image, such as edges and fine textures. For defects like “Damage”, their key discriminative features—cracks and fractured boundaries—are predominantly distributed in the high-frequency domain. Traditional single-path backbone networks may alias these weak yet critical signals with low-frequency information in the early layers, which are subsequently smoothed by pooling operations. The core innovation of HPFEN lies in implementing a front-end frequency-domain decoupling strategy. Through Laplacian pyramid decomposition, it physically separates high-frequency and low-frequency components at the very beginning of feature extraction, followed by targeted enhancement and preservation via parallel specialized subnetworks. This architectural design ensures that the high-frequency features, which are crucial for identifying “Damage”, have a protected “dedicated pathway” that propagates through the network’s forward pass, effectively mitigating their fading issue in deep layers. Consequently, the significant boost in the detection accuracy of the “Damage” category is a direct result of the HPFEN module explicitly countering the fundamental problem of high-frequency information attenuation in the feature pipeline. In contrast, the features of “Flashover” defects manifest more as mid-to-low-frequency texture and color variations, which are relatively less affected by high-frequency attenuation, hence the improvement is not as pronounced as for “Damage”.

Finally, the complete MFCD-Net model integrating the HDCM module achieves an mAP(0.5)/% of 89.90%, realizing a gain of 4.50% points compared to the “Base + Preprocessing + HPFEN” configuration.

The core reason lies in the HDCM module addressing the limitations of static multi-scale feature fusion in Feature Pyramid Networks (FPN). Traditional FPN employs fixed strategies, such as element-wise addition or channel concatenation, to fuse features from different levels. This strategy lacks adaptability to specific tasks and local contexts. When confronting defects with variable appearances and features similar to background clutter (e.g., “Flashover” vs. shadows, “Damage” vs. stains), static fusion struggles to dynamically reinforce the most discriminative feature patterns and suppress irrelevant background responses. The hierarchical dynamic compensation mechanism introduced by HDCM centers on a content-aware gating unit. This unit takes contextual features from the neck network as input and dynamically generates a set of spatially adaptive weight maps to modulate the fusion ratio of the already decoupled heterogeneous feature streams (high-frequency details and low-frequency structure) provided by HPFEN. This process can be viewed as a dynamic, defect-recognition-oriented feature reorganization and reconstruction. For different spatial locations in the input feature map, the network can autonomously decide: whether the current region requires the injection of more high-frequency details to discern tiny cracks, or it should rely more on low-frequency structure to confirm the overall presence of the target. This dynamic, non-linear fusion approach greatly enhances the model’s ability to construct clear decision boundaries within the complex feature space.

More importantly, there exists a profound synergistic effect between HDCM and HPFEN. HPFEN provides information-complete and decoupled heterogeneous features; HDCM dynamically allocates them according to the current task demands. This synergy results in the performance improvement of the complete MFCD-Net surpassing the linear superposition of the independent contributions of each module.

Conclusion

  1. 1.

    The robust pre-segmentation method based on adaptive color and texture priors achieves accurate and stable localization of insulator strings against complex backgrounds by utilizing history-data-guided dynamic RGB threshold segmentation and GLCM texture consistency verification. This effectively strips away background interference and provides subsequent deep networks with high signal-to-noise ratio input regions.

  2. 2.

    The Multi-scale Feature Compensation Detection Network (MFCD-Net) realizes frequency-domain information decoupling and enhancement at the source through its Heterogeneous Parallel Feature Extraction (HPFEN) module. Subsequently, via the Hierarchical Dynamic Compensation Module (HDCM), it employs a content-aware gating mechanism to adaptively compensate for missing high- and low-frequency information for detection heads at different scales. This systematically mitigates the irreversible attenuation of features along the propagation path from the backbone network to the detection heads.

  3. 3.

    The proposed method demonstrates significant advantages in the insulator defect detection task. Evaluations on the same dataset show that our method achieves an mAP(0.5)/% of 89.90%, outperforming the eight advanced methods compared. Ablation experiments further verify that the proposed preprocessing, HPFEN, and HDCM modules are all critical contributors to the performance improvement.

However, the proposed method still has certain limitations: the pre-segmentation stage may still produce misjudgments for backgrounds with colors highly similar to insulators (e.g., specific rust or shadows). Moreover, the multi-branch parallel architecture introduced to enhance performance also increases the model complexity and computational burden to some extent. Therefore, future research will focus on: (1) exploring more robust segmentation strategies incorporating multi-spectral or contextual information; (2) designing more efficient and lightweight network architectures to promote the real-time deployment and application of the algorithm on mobile or embedded devices.