Introduction

Insulator is a significant device that plays an insulating and supporting role in transmission lines. They are often exposed to the outdoor environment for extended periods of time, making them susceptible to aging, pollution, and even self-explosion. The insulator failure has direct influence on the normal operation of power system. Statistics1 manifest that accidents caused by insulators account for more than 81.3% of all accidents in the power system. To improve inspection efficiency and reduce labor cost, many studies have turned to insulator detection based on artificial intelligence(AI). Consequently, rapid and accurate AI detection methods have become a major research focus in recent years2,3,4,5.

Over the past fifteen years, traditional image processing methods have been widely explored for insulator detection. Representative approaches rely on color cues, edges, and morphological operations6 used spatial morphological features with simple color models to detect insulator faults7 extracted wavelet edges and applied template matching with a geometric model to estimate icing on insulators8 designed dual-parity morphological gradients to delineate porcelain insulator strings in infrared images. In addition9 detected insulators by image binarization followed by morphology. These methods require few parameters and are easy to implement, but they are sensitive to clutter, illumination, and occlusion, so false detections remain common in complex scenes.

To enhance the accuracy of insulator detection in complex scenes, some studies have attempted to combine image processing with machine learning for insulator detection. In10, similar characteristic regions of the insulators were extracted from the inspection images according to their grayscale and color characteristics, then, SVM (Support Vector Machine) was used to identify and locate the insulators with the accuracy, recall and average accuracy all higher than 90%. Similarly in11, a histogram feature based on local directional patterns was proposed for insulator detection with SVM acting as the classifier. Aiming at the problem that the traditional features of insulator were easily submerged in the background, a new depth horizontal histogram (DHH) feature12 was proposed and used to achieve the accurate and fast positioning of insulator shed. In addition, to improve the quality of inspections, k-nearest neighbors (k-NN) was used to classify the levels of insulator contamination13 with an average accuracy higher than 82% and being superior to the compared models. A novel highly-efficient method14 based on the multiple linear regression (MLR) algorithm and the random forest (RF) algorithm, was proposed in14 for classifying the pollution degree detection of insulators. Aiming to fully describe the characteristics of insulator and enhance the robustness of insulator against the complex background in aerial images15 proposed an insulator detection technology based on the combination of K-nearest neighbor algorithm and multi-type features of the Internet of Things. Briefly, these approaches, which integrate image processing techniques with machine learning have demonstrated significant advancements in achieving high detection accuracy.

With the development of neural network theory, the application of deep learning in insulator detection has increased rapidly, and both detection accuracy and robustness in complex scenarios have been significantly improved16. For example17 proposed the YOLOv5s-KE achieved an average accuracy of 92.3% on publicly available insulator detection data, with an inference speed of approximately 94.3 FPS, but the model size was 254.5 MB. Similarly, the improved YOLOv5 by18 achieved an average accuracy of 95.60% in insulator detection, but the parameter count was still 18.36M and the computational load was 30.10G. Although the detection performance was further enhanced, it put forward higher requirements for the computing power and storage of the equipment. In addition19 proposed the ML-YOLOv5 achieved an average accuracy of 97.0% on the CPLID dataset. Moreover, the model has only 3.73M parameters and 9.0 GFLOPs, with an inference speed of approximately 63.6 FPS. However, there is still pressure in deployment. Overall, deep learning methods have more comprehensive recognition capabilities and higher accuracy in complex contexts. Compared with traditional machine learning, their computing time and model scale are often larger, which is not conducive to low-cost and lightweight deployment. To maintain the efficiency of machine learning training and deployment while approaching the detection accuracy of deep learning as closely as possible, we turn our attention to AdaBoost in ensemble learning: it forms a strong classifier by superimposing simple weak learners, with a compact model that is easy to run efficiently on general-purpose CPUs. Recent studies have also shown that AdaBoost can achieve stable results with relatively low complexity in scenarios such as intelligent diagnosis of power equipment, providing a basis for this work’s attempt to construct an AdaBoost framework for insulator detection20.

AdaBoost, also known as Adaptive Boosting, is the most prominent example of Boosting. The classical AdaBoost algorithm was introduced by Freund and Schapire21, and has been effectively utilized in insulator detection, as demonstrated by22. Real AdaBoost23, which extends the classical AdaBoost, addresses a confidence-rated weak classifier instead of a discrete one. Furthermore, Friedman24 proposed another modified version of AdaBoost called Gentle AdaBoost, which can be obtained by using Newton–Raphson stepping instead of exact optimization at each step. Gentle AdaBoost was employed for insulator detection in25, achieving an accuracy rate of 94.2%. It can be seen that the cascaded AdaBoost algorithm has demonstrated high accuracy in insulator detection and has significantly advanced in recent developments. Therefore, in response to the actual demand for establishing a fast and accurate lightweight insulator detection model, this paper conducts relevant research on the AdaBoost algorithm.

In this paper, we are interested in developing an ensemble learning for insulator detection in complex environment. Firstly, we proposed a modified method called Log AdaBoost in which Polylog loss function was adopted rather than Exponential loss, for Polylog loss function is more tolerant of outliers. Secondly, different from the classical AdaBoost optimizing each weak classifier by coordinate descent, our method optimized the classifier by gradient descent. In boosting, gradient descent and coordinate descent are two common methods. Coordinate descent, also called forward stagewise, updates one coefficient at a time and usually uses small steps. Gradient descent fits the negative gradient of the loss at each iteration and then selects the step size by line search. We adopt gradient descent because it directly minimizes our Polylog loss through residual fitting, ensures the empirical risk decreases with line search, and is more stable under noise in our datasets26. This view is standard in gradient boosting and has well-studied convergence27. Besides, we adopted a relatively mild weight updating strategy under gradient descent optimization and proved the convergence performance of Log AdaBoost. Finally, we find that the classical Haar-like features compute intensity differences in fixed rectangular patterns. They are sensitive to pose and illumination, which limits their ability to represent the curved, repetitive shed–cap geometry of insulator strings28. In aerial scenes, simple edge or line cues respond strongly to wires, towers, and vegetation, leading to unstable edges and false line detections under complex backgrounds29. Therefore, guided by insulator characteristics, we retain the edge features and linear features but remove the central feature from the Haar-like feature. A new neighborhood feature is designed to describe the difference between insulator cap and background pixels.

The novelty of this work can be summarized as follows:

·An AdaBoost algorithm with Polylog as loss function, called Log AdaBoost is proposed. Compared with Exponential loss, our method has higher tolerance to outliers and is more robust under complex background.

·A more moderate weight update strategy is adopted. Each weak classifier is optimized by gradient descent and the mathematical proof is presented.

·A new neighborhood feature is proposed, and this Haar-like feature can make the pixel difference between the insulator cap and the background obvious. In addition, we remove the central and diagonal features to reduce the redundancy of feature sets.

Polylog loss function

As a greedy optimization of an Exponential loss function, the classical AdaBoost always pays too much attention to the difficult instances in boosting processing, such as noise sample30. To improve the generalization ability of AdaBoost algorithm, asymmetric loss function was introduced in31 and the misclassification of instances with small margin were penalized in32. In this paper, the loss function of AdaBoost is redefined with a polylogarithm to improve generalization. The polylogarithm of order \(s>0\) is defined for \(u\in (\text{0,1})\) by

$$Li_{s} (u) = \sum\nolimits_{k = 1}^{\infty } {\frac{{u^{k} }}{{k^{s} }}}$$
(1)

Given a sample \(({x}_{i},{y}_{i})\) with \({x}_{i}\in {\mathbb{R}}^{n}\) and \({y}_{i}\in \{-1,+1\}\), let the margin be \({z}_{i}={y}_{i}F({x}_{i})\). In this work we instantiate the loss as the dilogarithm: \({\ell}({z}_{i})=-\text{\hspace{0.17em}}{Li}_{2}(-{e}^{-{z}_{i}}), {z}_{i}={y}_{i}F({x}_{i}).\) The empirical risk used below is given by (2), and its derivative by (3). The loss curve is shown in Fig. 1. When \({y}_{i}F({x}_{i})<0\) the sample is misclassified; when \({y}_{i}F({x}_{i})>0\) it is correctly classified.

Fig. 1
figure 1

The difference of loss functions.

We position the Polylog loss against two standard robust losses in terms of outlier tolerance. Huber loss has a bounded slope in the tails (controlled by \(\delta\)), so its influence is bounded but does not redescend to zero; performance depends on \(\delta\). Tukey’s biweight is redescending (its score decreases to zero beyond a cutoff \(c\)) but is non-convex and requires choosing \(c\). In contrast, the Polylog loss used here produces a smooth, parameter-free redescending influence around the decision boundary and downweights extreme margins automatically, which matches the boosting view of minimizing empirical risk while improving tolerance to outliers.

Obviously, all the loss functions in Fig. 1 can be considered as the approximations of 0–1 loss functions. For they are continuous and convex functions, they can be used to replace the 0–1 loss function for optimization. When \({y}_{i}F({x}_{i})\) is negative, the instance weights are amplified, which are amplified exponentially in the classical AdaBoost, while slightly in our method. From the perspective of the model training, the classical AdaBoost algorithm is inclined to focus on the misclassified instances, resulting in the instances are given higher weights. Because of the presence of noise instances, their weight will be amplified exponentially. However, excessive attention to noisy instances may lead to poor prediction of the model on the normal instances, thus, reducing the accuracy.

Then, our Polylog loss function is optimized by using gradient descent. Given the training instances (x1, y1), ··, (xN, yN), where N is the number of instances, xi and yi belong to instance space X and label space Y, respectively.

Definition and derivative (for completeness). We use the dilogarithm instantiation of the Polylog loss: \({\ell}\left(z\right)=-\text{\hspace{0.17em}}{\text{Li}}_{2}\left(-{e}^{-z}\right), z={y}_{i}F({x}_{i})\), where \({Li}_{s}(u)={\sum }_{k=1}^{\infty }{u}^{k}/{k}^{s}\). Using \(\frac{d}{dt}{Li}_{s}(t)={Li}_{s-1}(t)/t\) and \(t=-{e}^{-z}\) with \(dt/dz={e}^{-z}\), we obtain \(\frac{d{\ell}(z)}{dz}=-\text{\hspace{0.17em}}{Li}_{1}(-{e}^{-z})=-\text{\hspace{0.17em}}\text{ln}(1+{e}^{-z}).\) Hence the gradient used in (3) is \(\frac{\partial L(F)}{\partial F({x}_{i})}=-\text{\hspace{0.17em}}{y}_{i}\text{\hspace{0.17em}}\text{ln}(1+{e}^{-{y}_{i}F({x}_{i})})\), which keeps the notation consistent with (2)–(3).

In each iteration, we choose the basis function which is closest in direction to the negative gradient of the loss function. Our loss function is

$$L(F) = \sum\limits_{i = 1}^{N} {( - {\text{Li}}_{2} ( - \exp ( - y_{i} F(x_{i} ))))}$$
(2)

Its partial derivatives with respect to F(xi) are

$$\frac{\partial L(F)}{{\partial F(x_{i} )}} = - y_{i} \ln (1 + \exp ( - y_{i} F(x_{i} )))$$
(3)

Given iteratively updating F(xi) needs to find the appropriate basis function h(xi), so that.

$$F\left( {x_{i} } \right)\, \leftarrow \,F\left( {x_{i} } \right)\, + \,h\left( {x_{i} } \right)$$
(4)

We need to maximize the inner product of the basis function with the negative gradient of the loss function:

$$- \nabla L(F) \cdot h(x_{i} )$$
(5)

equivalently

$$\sum\limits_{i = 1}^{N} {[y_{i} h(x_{i} )} \ln (1 + {\text{e}}^{{ - y_{i} F(x_{i} )}} )]$$
(6)

Then, the instance i is weighted by

$$D(i) = \ln (1 + {\text{e}}^{{ - y_{i} F(x_{i} )}} )$$
(7)

where D represents the instance weight distribution. Thereby, the basis function h(xi) which is the most relevant to the instance label yi is found under the current weight distribution D(i). Since inner product evaluates how much two vectors are aligned, then the key of optimization is to choose a basis function that maximizes the inner product as Eq. (6). It can be seen that the Polylog loss function is optimized by gradient descent.

We can make this question gotten across in another angle. Minimizing our loss function is, in fact, equivalent to that the value of the loss function decreases with the addition of a basis function for each iteration. That is.

$$\Delta L\, = \,L\left( {F\, + \,h} \right) - L\left( F \right)\, < \,0$$
(8)

In implementation we update the additive model by \({F}_{t+1}(x)={F}_{t}(x)+{\eta }_{t}{h}_{t}(x)\), where \({\eta }_{t}\) is computed from the learning rate in (12). The backtracking rule in Sect. “How does Log AdaBoost work?” chooses \({\eta }_{t}\) such that \(L({F}_{t+1})\le L({F}_{t})\). For clarity, we sketch the main steps that lead to the upper bound of the loss change \(\Delta L\). Substituting \({F}_{t+1}({x}_{i})={F}_{t}({x}_{i})+{\eta }_{t}{h}_{t}({x}_{i})\) into (2) gives \(\Delta L=L({F}_{t+1})-L({F}_{t})=\sum_{i=1}^{N}[{\ell}({y}_{i}({F}_{t}({x}_{i})+{\eta }_{t}{h}_{t}({x}_{i})))-{\ell}({y}_{i}{F}_{t}({x}_{i}))].\) By the mean value theorem, for each \(i\) there exists \({\xi }_{i}\in (\text{0,1})\) such that \(l(y_{i} (F_{t} (x_{i} ) + \eta_{t} h_{t} (x_{i} ))) - l(y_{i} F_{t} (x_{i} )) = \eta_{t} \, l^{\prime}(y_{i} (F_{t} (x_{i} ) + \xi_{i} \eta_{t} h_{t} (x_{i} )))y_{i} h_{t} (x_{i} )\) Using the derivative in (3) and the definition of the weight distribution \({D}_{t}(i)\) in (7), the first–order term in \(\Delta L\) can be written as a weighted inner product between the negative gradient and \({h}_{t}({x}_{i})\). In addition, the boundedness of \({h}_{t}({x}_{i})\) and of the second derivative of \({\ell}(\cdot )\) allows us to bound the remainder term. Combining these bounds yields the following upper bound for the loss change, as reported in (9).

$$\begin{array}{c}\Delta L=L(F+h)-L(F)\\ =\sum_{i=1}^{N} \{{-\text{Li}}_{2}(-{\text{e}}^{-{y}_{i}(F({x}_{i})+h({x}_{i}))})\}-\sum_{i=1}^{N} \{{-\text{Li}}_{2}(-{\text{e}}^{-{y}_{i}F({x}_{i})})\}\\ =\sum_{i=1}^{N} \{\sum_{k=1}^{\infty } \frac{(-1{)}^{k}{\text{e}}^{-k{y}_{i}F({x}_{i})}}{{k}^{2}}-\sum_{k=1}^{\infty } \frac{(-1{)}^{k}{\text{e}}^{-k{y}_{i}F({x}_{i})}\cdot {\text{e}}^{-k{y}_{i}h({x}_{i})}}{{k}^{2}}\}\\ =\sum_{i=1}^{N} \{\sum_{k=1}^{\infty } \frac{(-1{)}^{k}{\text{e}}^{-k{y}_{i}F({x}_{i})}}{{k}^{2}}(1-{\text{e}}^{-k{y}_{i}h({x}_{i})})\}\\ \simeq \sum_{i=1}^{N} \{\sum_{k=1}^{\infty } \frac{{\left(-1\right)}^{k}{\text{e}}^{-k{y}_{i}F({x}_{i})}}{{k}^{2}}(1-1+k{y}_{i}h({x}_{i})-\frac{1}{2}{k}^{2}{y}_{i}^{2}{h}^{2}({x}_{i}))\}\\ \le \sum_{i=1}^{N} \{\sum_{k=1}^{\infty } \frac{(-1{)}^{k}{\text{e}}^{-k{y}_{i}F({x}_{i})}}{k}\cdot {y}_{i}h({x}_{i})\}\\ =-\sum_{i=1}^{N} \{ln(1+{\text{e}}^{-{y}_{i}F({x}_{i})})\cdot {y}_{i}h({x}_{i})\}\\ =-\sum_{i=1}^{N} \{D(i)\cdot {y}_{i}h({x}_{i})\}\end{array}$$
(9)

Since the right–hand side of (9) is strictly negative whenever at least one training instance has a non–zero gradient, the empirical risk \(L({F}_{t})\) decreases monotonically with the iteration number until convergence.

Log AdaBoost algorithm

In this paper, for binary classification problems are considered, we use a classification and regression tree (CART). Given the set S = {(x1, y1), (x2, y2), …, (xN, yN)} with the instance \({x}_{i}\in {\mathcal{R}}^{\text{n}}\) and the label yi {−1, + 1}, where N represents the instance number.

How does log AdaBoost work?

In Log AdaBoost, the strong classifier is composed of a linear combination as follow

$${F}_{t}\left({x}_{i}\right)={\sum }_{\tau =1}^{t}{f}_{\tau }({x}_{i})$$
(10)

where Ft (xi) is a strong classifier, t represents the current iteration number, and the expression of a strong classifier with binarization is listed as followed.

$$H(x_{i} ) = {\text{sign}} [\sum\limits_{t = 1}^{T} {f_{t} (x_{i} )} - b]$$
(11)

where T represents the maximum iteration number, b is a threshold with the default value zero.

The confidence of the instance \({x}_{i}\) is defined as

$${f}_{t}({x}_{i})=(1+\mu )\frac{{W}_{t+}-{W}_{t-}}{{W}_{+}+{W}_{-}}$$
(12)

In (12), \(\mu \in\) [0,1] serves as the learning rate that scales the gradient-descent step. Unless otherwise noted, we set \(\mu =0.5\) in all experiments. At each iteration we accept the update only if the empirical risk decreases, i.e., \(L({F}_{t}+{f}_{t})<L({F}_{t})\); otherwise we back off by halving \(\mu\) and recomputing \({f}_{t}\), up to three times. This simple backtracking ensures monotone decrease without changing the update rule. And, \({W}_{t+}\) and \({W}_{t-}\) represent the sum of the weights of the positive and negative instances at the current iteration, respectively, and can be calculated by

$${W}_{t+}=\sum_{i:{x}_{i}\in {S}_{j}}{D}_{t}(i)({y}_{i}=+1\cap {h}_{t}({x}_{i}))$$
(13)
$${W}_{t-}=\sum_{i:{x}_{i}\in {S}_{j}}{D}_{t}(i)({y}_{i}=-1\cap {h}_{t}({x}_{i}))$$
(14)

where Dt(i) = wi and the weight distribution Dt = {w1, w2, …, wi, …, wN}, ht(xi) represents a weak classifier for the instance \({x}_{i}\) at the iteration t, \({S}_{j}\) represents the \({j}^{th}\) partition in training a CART.

For the weight updating in our algorithm, a relatively mild strategy is adopted as follows:

$$D_{t + 1} (i) = \frac{1}{{Z_{t} }}\ln (1 + {\text{e}}^{{ - y_{i} F_{t} (x_{i} )}} )$$
(15)

where \(Z_{t} = \sum\limits_{i = 1}^{N} {D_{t} (i)}\) is a normalization factor. Let the right-hand side of (15) define the unnormalized weight \({\widetilde{w}}_{i}^{(t+1)}\). We set the normalization factor as

$${Z}_{t}\text{\hspace{0.05em}}=\text{\hspace{0.05em}}\sum_{j=1}^{N}{\widetilde{w}}_{j}^{(t+1)}, {w}_{i}^{(t+1)}=\frac{{\widetilde{w}}_{i}^{(t+1)}}{{Z}_{t}}$$
(16)

By construction,

$$\sum_{i=1}^{N}{w}_{i}^{(t+1)}=\frac{1}{{Z}_{t}}\sum_{i=1}^{N}{\widetilde{w}}_{i}^{(t+1)}\text{\hspace{0.05em}}=\text{\hspace{0.05em}}1$$
(17)

Since all terms in (15) are nonnegative and the update is “mild” (e.g., with \(\mu \in [\text{0,1}]\)), we have \({Z}_{t}>0\), so the normalization is well defined. The pseudocode of our Log AdaBoost is described in the following.

Compared with the exponential update in classic AdaBoost, the update in (15) scales the sample weights by a linear function of the confidence \({f}_{t}({x}_{i})\). As a result, the weights of misclassified instances grow only mildly instead of exponentially, and the distribution \({D}_{t}\) is less dominated by a few very hard or noisy samples. Intuitively, this moderated update is expected to stabilize the optimization trajectory and improve generalization. Sec. “Evaluation on UCI Database” and Sec. “Evaluation on the dataset of anti-breast cancer drug candidates (ACDC)” further quantify this effect in terms of convergence speed and test error Table 1.

Table 1 The Log AdaBoost algorithm.

Margin in log AdaBoost

Gradient descent optimization is one of the advantages of Log AdaBoost that is different from classic AdaBoost. In Log AdaBoost, the weights of misclassified instances are amplified linearly rather than exponentially, which ensures that the model is not too sensitive to noise and is conducive to the generalization performance. In addition, the generalization performance of the model is closely related to whether the model is over-fitted. During the training process, the training error decreases sharply with the increase of training iteration at the beginning, so as the testing error. When the training error converges, the change of the testing error depends on whether the model is over-fitted. As shown in Fig. 2 (a) and (b), it can be seen that although the training error converges to zero, the testing error still continue to decrease with the increase of training iteration when the model is not over-fitted. However, when the training model is over-fitted, the testing error will increase instead.

Fig. 2
figure 2

Changing curves of the training error and the test error.

What the AdaBoost algorithm can avoid overfitting is related to the margin, which is a method to quantify the “Confidence”of the prediction. Larger margin values on the training data set can ensure better generalization performance of the model. In Log AdaBoost, the classification margin is defined as the difference between prediction confidence of weak classifiers giving correct classification and that of weak classifiers leading to misclassification32. Given a training instance xi, its margin can be calculated as follow:

$$Margin(x_{i} ) = \frac{{\sum\limits_{t = 1}^{T} {y_{i} f_{t} (x_{i} )} }}{{\sum\limits_{t = 1}^{T} {\left| {f_{t} (x_{i} )} \right|} }}$$
(18)

It can be seen that the classification margin is restricted in [−1, 1], and the margin is positive if and only if the classifiers correctly classify this instance. [Mar] implied that the larger margin of instances the stronger generalization ability for the classification model. And it is proved that the generalization error \({\varepsilon }_{g}\) has an upper bound in the following with the probability P at least 1-δ:

$${\varepsilon }_{g}\le {P}_{x\in S}\left\{yH\left(x\right)\le \text{m}\right\}+O(\sqrt{\frac{d}{N{m}^{2}}+ln\frac{1}{\delta }})$$
(19)

where m is a positive threshold of margin over the training data D, δ is linked to confidence of classification, and d represents the complexity of the training model. The equation above shows that the bound of generalization error is negatively correlated with the margin.

In Log AdaBoost, the margin value will increase with the increase of training iteration. Therefore, the upper bound of generalization error will decrease and the generalization performance of the model will be improved. To make this more specific, the cumulative distributions of margins after 20, 80 and 320 iterations for the training data of Diabetes have been plotted in Fig. 3. It can be seen that when the iteration reaches 80, the training error converges to zero, meanwhile, the minimum margin changes from negative to positive. We will further verify this view from the following derivation.

Fig. 3
figure 3

Cumulative distributions of margins for different iterations.

Then we verify the relationship between the training error convergences to zero and the margin value changes from negative to positive in the following.

Assuming the training error εt with the expression as

$$\varepsilon_{t} = \frac{1}{N}\sum\limits_{i}^{N} {I(H(x_{i} ) \ne y_{i} )} = \frac{1}{N}\sum\limits_{i}^{N} {I({\text{sign[}}\sum\limits_{t = 1}^{T} {f_{t} (x_{i} )} ] \ne y_{i} )}$$
(20)

where I represents a 0–1 loss function.

As the training iteration increases, the training error εt converges to 0, then for any instance xi:

$$H(x_{i} ) = y_{i}$$
(21)

And the following formula will be established.

$$y_{i} \cdot F(x_{i} ) = y_{i} \sum\limits_{t = 1}^{T} {f_{t} (x_{i} )}> 0$$
(22)

Thus, according to the definition of margin, the margin for any instance xi must be positive and vice versa. Therefore, we can deduce that the training error of Log AdaBoost converges to 0 if and only if the margins of all training instances are increased to be positive.

The above convergence analysis is carried out under the binary classification setting with labels \({y}_{i}\in \{-1,+1\}\) and a real-valued additive model \(F(x)\), which is also the setting used in our experiments on the UCI, ACDC and CPLID datasets. For multi-class problems that are decomposed into several binary sub-tasks (for example, one-vs-rest coding), each sub-task still satisfies the same conditions on the Polylog loss and on the step size, so the monotone decrease of the empirical risk holds for each binary classifier. Extending the convergence proof to direct multi-class formulations with vector-valued margins is in principle possible because the Polylog loss is defined in terms of the margin, but this extension is beyond the scope of the present paper.

Experiment

In this section, we discuss the performance of Log AdaBoost algorithm on two different databases, UCI and anti-breast cancer drug candidates. We also analyzed the experimental results for the proposed method. The Log AdaBoost was implemented in MATLAB2021b and executed on an AMD 3500U CPU, 8G RAM.

Purpose and relevance note. The UCI experiments are used as an algorithmic sanity check to examine optimization stability (Convergence-Cost) and generalization on heterogeneous binary tasks. They are not intended to replace insulator imagery. Task-relevant validation is provided on ACDC (Sect. “Evaluation on the dataset of anti-breast cancer drug candidates (ACDC)”) and the CPLID insulator images (Sect. “Insulator Detection”). Among the UCI sets, several datasets contain image-/signal-derived features (e.g., Banknote, Raisin, WDBC, Ionosphere, Sonar), which are closer to visual recognition. For Banknote and WDBC, the features are simple statistical descriptors (such as variance, skewness, texture and perimeter) computed from gray-scale images; Raisin consists of basic shape descriptors (area, major/minor axis length, eccentricity, etc.) extracted from raisin images; Ionosphere and Sonar encode the response of radar or sonar signals over time or frequency bands. The CPLID insulator detector in (Sect. “Insulator Detection”) also relies on low-level descriptors, namely Haar-like contrast features and local edge responses computed on fixed windows. Therefore, these image/signal-derived UCI sets and the CPLID features share a similar construction pipeline: raw images or waveforms are passed through simple filters and then summarized into low-dimensional descriptors that describe local intensity variation and boundary structure. The remaining purely tabular UCI sets serve to test robustness across non-visual domains.

Evaluation on UCI database

In this subsection, we evaluate the performance of Log AdaBoost (L-AB) on 20 binary classification datasets from UCI. Three-fold cross validation is used, which means two thirds of the data are used for training and one third for testing. Modest AdaBoost(M-AB)33 and Gentle AdaBoost(G-AB) are selected for comparisons. For the three methods, tree stumps are used as weak classifiers. The training and test errors of different methods for Diabetes with 500 positive instances and 268 negative instances are shown in Fig. 4. Curves of the training and test errors on other 19 binary classification datasets have been presented in the appendix. It can be seen that Log AdaBoost outperforms other two compared methods in terms of both the training and test errors.

Fig. 4
figure 4

Training error and Test error on Diabetes.

“Convergence-Cost” and “Test-Error” of the three different algorithms on 20 binary classifications are listed in Table 2. "Convergence-Cost" refers to the number of weak classifiers required when the training error converges to zero or a constant, while "Test-Error" refers to the generalization error of converged strong classification. The numbers shown in bold represent the best performance among the three algorithms. In terms of Convergence-Cost, L-AB ranks first among the three methods on 11 databases, ranks second on 8 databases, and ranks third only on database Ionosphere. G-AB ranks first on 6 databases, ranks second on 10 databases and ranks third on 4 databases. M-AB ranks first on 4 databases, ranks second on 4 databases and ranks third on 12 databases. It can be clearly seen that Log AdaBoost improves the training speed significantly compared with other two methods. In terms of Test-Error, L-AB ranks first on 11 databases, ranks second on the other 9 databases. G-AB ranks first on 6 databases, ranks second on 7 databases and ranks third on 4 databases. M-AB ranks first on 3 databases, ranks second on 4 databases and ranks third on 16 databases. It can be seen that our proposed method Log AdaBoost (L-AB) has better generalization performance than other two compared methods.

Table 2 Results on 20 datasets.

These results also provide an intuitive quantitative view of the proposed weight update strategy. On the one hand, Log AdaBoost achieves the lowest or second-lowest Convergence-Cost on 19 out of 20 datasets, which shows that the milder weight growth on misclassified instances does not slow down optimization; instead, it often reaches the zero-training-error regime with fewer weak classifiers. On the other hand, the consistently lower Test-Error indicates that this moderated update avoids over-emphasizing noisy samples and yields better generalization than Gentle AdaBoost and Modest AdaBoost across heterogeneous binary tasks.

In addition, on the image/signal-derived subset (Banknote, Raisin, WDBC, Ionosphere, Sonar), L-AB achieves the lowest test-error on 4 out of 5 datasets (Table 2: Banknote, Raisin, WDBC, Ionosphere), and is comparable on Sonar. Although L-AB does not always give the smallest Convergence-Cost on these sets, the test-error advantage is consistent. This is in line with the feature similarities described above: in all these datasets and in CPLID, the input variables are low-level descriptors extracted from images or signals, and a small portion of samples may contain very large feature values caused by noise, segmentation errors or background structures. The Polylog-based Log AdaBoost downweights such outlying samples without ignoring informative hard examples, so the same type of robustness that improves performance on Banknote, Raisin, WDBC, Ionosphere and Sonar also appears in the CPLID insulator detection task. We therefore use the image/signal-derived UCI datasets as a cross-dataset feature reference to support the transfer of the observed advantages from general image/signal descriptors to the insulator features in (Sect. “Insulator Detection”), while the remaining tabular UCI sets still provide a broad generalization and stability probe.

Evaluation on the dataset of anti-breast cancer drug candidates (ACDC)

In order to further evaluate the proposed Log AdaBoost, this section conducts experimental simulation on the dataset of anti-breast cancer drug candidates. This dataset was provided by the 18th Huawei Cup China Postgraduate Mathematical Modeling Competition and aims at the breast cancer therapeutic target ERα, on which the biological activity data of 1974 compounds are provided. The algorithm performance evaluation on this dataset is recognized. Different from the UCI database, which has only one set of labels, the data set of ACDC has five labels (Caco-2, CYP3A4, hERG, HOB and MN), four (Caco-2, CYP3A4, HOB and MN) of which are used in this experiment.

The remaining label hERG corresponds to cardiotoxicity and shows a markedly more imbalanced and noisy distribution than the ADMET properties of the other four labels: the number of positive compounds is much smaller, and several molecules have uncertain or conflicting annotations in the original competition materials. In our preliminary inspection, all boosting variants produced highly unstable test errors across different splits on hERG, so the samples of this label would require additional imbalance-handling techniques and careful label cleaning, which is beyond the scope of this work. For the purposes of this section, we therefore restrict the detailed comparison to the four labels with more stable behavior (Caco-2, CYP3A4, HOB and MN), which already cover the characteristics of the overall dataset (absorption, metabolism, oral bioavailability and genotoxicity)and are sufficient to evaluate the convergence speed and margin distribution of Log AdaBoost. We also explicitly regard hERG as a more challenging, extreme-imbalance case that will be investigated in future work.

We compare our proposed Log AdaBoost with three different versions of the AdaBoost, which are Gentle AdaBoost, Modest AdaBoost and the Real AdaBoost. To avoid the degradation of classification, weak classifiers for all four algorithms with depth of three are set and strong classifiers are set with single node. Three-fold cross-validation is used here. The curves of the testing errors for Caco-2, CYP3A4, HOB and MN are plot in Fig. 5. It can be seen that the proposed Log AdaBoost achieves a lower testing error and converges faster than other three AdaBoost variants. It means that the generalization performance of Log AdaBoost on the dataset of ABCCD is better than that of other three methods.

Fig. 5
figure 5

Generalization performance of four different AdaBoost variants for four labels.

The low false positive rate of the classification model is an important indicator of target detection model. The receiver operating characteristic (ROC) curves for these four classifiers were plotted in Fig. 6. It can be observed that Log AdaBoost is obviously better than other three methods when the false positive rate is smaller than 0.2.

Fig. 6
figure 6

ROC curves for four different AdaBoost variants.

Sec. “Margin in Log AdaBoost” has proved that when all the margins of training instances change from negative to positive, the training error of the AdaBoost variant converges to zero. In this experiment, we conducted model training on all 1974 instances of this dataset. When the margin of all instances becomes positive, model training stopped. Tables 345 and 6 record the index values of four different AdaBoost variants, which are the training duration, the number of weak classifiers and the variance of margin. The four tables show that in terms of training duration, the proposed Log AdaBoost is the fastest among all four methods for Caco-2, CYP3A4 and HOB. For MN, Log AdaBoost needs 14.24 s of training time, which is slightly inferior to that of Real AdaBoost requiring 14.09 s. In terms of number of weak classifiers, Log AdaBoost ranks first for all labels. For MN, Gentle AdaBoost achieves the same number of weak classifiers as Log AdaBoost and ranks second for the other three labels. Real AdaBoost performs the worst among all four methods in terms of number of weak classifiers. The generalization ability is closely related to the overall distribution of margins. The more concentrated the overall distribution of the margin, the better the generalization performance of the algorithm. Hence, margin variance is adopted to indicate the distribution of margin. It can be clearly seen that Log AdaBoost has the smallest margin variance and shows the best generalization performance among all four methods. The cumulative distribution of margins over the whole set of training instances has been plot in Fig. 7. We find that our proposed Log AdaBoost enlarges margins more than other three variants. And our method has less training instances with negative margins at the same iteration from the enlarged parts in Fig. 3. Thus compared with the other three methods, our algorithm can obtain a faster convergence of training error.

Table 3 Performance of four different AdaBoost variants for Caco-2.
Table 4 Performance of four different AdaBoost variants for CYP3A4.
Table 5 Performance of four different AdaBoost variants for HOB.
Table 6 Performance of four different AdaBoost variants for MN.
Fig. 7
figure 7

Cumulative margin distribution for four labels.

From the viewpoint of weight updating, Tables 345 and 6 show that Log AdaBoost combines fast convergence with a compact ensemble and a more concentrated margin distribution. For all four labels, it requires the smallest number of weak classifiers to drive the training error to zero, and it also attains the smallest margin variance. This means that the proposed mild update reaches the same zero-training-error regime with fewer boosting rounds while keeping most training instances away from the decision boundary, which is known to be beneficial for generalization. Together with the UCI results in Sect. “Evaluation on UCI Database”, these quantitative indicators support the claim that the moderated weight update strategy improves both convergence behavior and generalization error.

Insulator detection

Insulator is a significant device that plays an insulating and supporting role in transmission lines. It is indicated in1 that the accidents caused by insulator defection accounted for more than 81.3% of the total accidents in power system. Fast and accurate detection of insulators in complex background images has become a research hotspot. In this paper, we have developed an ensemble learning method named Log AdaBoost for insulator detection in complex aviation environment. In order to verify the ability of the proposed method in insulator detection, the experiment was carried out on the China Power Line Insulator Dataset3. Images of CPLID dataset were captured by the UAV during the transmission line inspection, including 248 defective insulator images and 600 normal insulator images.

To enhance the robustness of the model, we applied a fixed 7 × augmentation to each labeled insulator image. This factor was chosen as a compromise between enlarging the dataset and avoiding excessive redundancy or distribution drift under the limited size of CPLID. Concretely, for each original image we generated four small-angle rotations (± 15°, ± 30°), one horizontal flip (geometric transformation), one HSV histogram equalization, and one salt-and-pepper noise version, as shown in Fig. 8. These operations mimic the typical variations in UAV inspection, including viewpoint changes, left–right symmetry, illumination differences and sensor noise. This schedule produces 4,800 normal and 1,984 defective images after augmentation. We kept the original class balance and did not oversample the minority class to avoid changing the data distribution. Finally, the training set, test set and validation set were divided in an 8:1:1 ratio. The counts for each dataset are as follows: 3,840 normal/1,600 defective for the training set (total 5,440), 480/192 for the test set (672), and 480/192 for the validation set (672). In other words, 5,440 insulator images are used as training set, 672 as validation set and 672 as test set. These settings make the process reproducible and show that the augmented dataset keeps the original class balance while increasing diversity in a controlled way.

Fig. 8
figure 8

Examples of image processing (a) Geometric transformation (b) Histogram equalization of HSV space (c)Adding salt-and-pepper noise.

Haar-like features and the improvement

Haar-like features are widely used digital image features. In this paper, we use three types of features, edge features, linear features and neighborhood features, as shown in Fig. 9. Edge features and linear features are directly adopted from Haar-like features. Diagonal features and center feature are abandoned for they do not conform to the shape of insulator. In order to better identify insulators, we designed a new Haar-like feature template named neighborhood feature. The neighborhood feature is a 3 × 3 center operator. For a 3 × 3 window centered at pixel \((x,y)\), we denote the four corner pixels by \([{p}_{1}, {p}_{2}, {p}_{3}, p4]\in {C}_{4}\), and the four central pixels by \({q}_{1}, {q}_{2}, {q}_{3},{q}_{4}\in {M}_{4}\). Here, the “central” pixels refer to the four axis-adjacent pixels directly above, below, left and right of the center, that is, the pixels at positions \((x-1,y)\), \((x+1,y)\), \((x,y-1)\) and \((x,y+1)\), as illustrated by the rightmost template in Fig. 9. The response of the neighborhood feature is defined as

$$N\left(x,y\right)=\frac{1}{|{C}_{4}|}\sum_{p\in {C}_{4}}I\left(p\right)-\frac{1}{|{M}_{4}|}\sum_{q\in {M}_{4}}I\left(q\right)$$
(21)

where \(I(\cdot )\) denotes the gray intensity. In practice, this keeps the efficiency of Haar-like evaluation while directly enhancing the separability between insulator-cap pixels and their immediate background.

Fig. 9
figure 9

Haar-like features and the improvement.

Design rationale. The neighborhood feature is designed for the local structure of insulator caps and sheds. Near a cap–background boundary, the four axis-adjacent pixels \({M}_{4}\) usually lie on the cap, while the four corners \({C}_{4}\) fall partly on the surrounding background (or the other way around). This makes the difference between the two means large, so \(\mid N(x,y)\mid\) is large. In a uniform region—inside the cap or in the background—the two means are similar and \(N(x,y)\approx 0\). Thus the feature acts as a simple center–surround contrast that matches the repeated ring-like sheds of insulators34. The sign of \(N(x,y)\) tells whether the cap is brighter or darker than its immediate surround, and the classifier uses both cases through learned thresholds.

Why not diagonal or center feature? The diagonal feature mainly responds to 45° edges and thin wire-like structures that are common in transmission scenes, so its response is unstable in cluttered backgrounds35. A center feature by itself has no surrounding reference and cannot measure the contrast between the cap and the background; its response changes little when the window crosses a shed boundary. In contrast, the neighborhood feature compares the center with nearby pixels while keeping the fast integral-image computation of Haar-like features. It gives a more stable and less rotation-sensitive response on cap–background transitions36. See Fig. 9 and Fig. 11.

Results of insulator detection

We set the minimum precision rate of the strong classifier for each stage as 99.5% and the maximum false alarm rate as 10% in training procedure. As shown in Fig. 10, the precision rate of our detector for each stage converges fast.

Fig. 10
figure 10

Precision rate at all levels.

Some of the detection results by the proposed Log AdaBoost Algorithm are shown in Fig. 11. Our method has achieved rather good detection results in different scenarios such as insulators with different sizes, dense insulators and background-fusion insulators. Table 7 lists the detection rate for various numbers of false detections for two AdaBoost variants.

Fig. 11
figure 11

Some detection results of CPLID dataset.

Table 7 Detection Rate of Log AdaBoost for various numbers of false positives.

When the number of false detections of insulators is 20, the detection rate of Log AdaBoost is 48.91%. As the number of false insulators is increasing, the detection rate of Log AdaBoost is also increasing after the threshold b of strong classifier H was continuously adjusted. When the number of false detections is 110, the detection rate of AdaBoost.M1 is slightly higher. However, the proposed method is obviously better than the AdaBoost.M1 in general. The ROC curve of our method is shown in Fig. 12 and the AUC of the proposed insulator detector has 0.82 while the AUC of AdaBoost.M1 just has 0.79. More importantly, the model complexity of the cascade Log AdaBoost classifier is very small, with only 21 K parameters.

Fig. 12
figure 12

ROC curves of Log AdaBoost and AdaBoost.M1 on CPLID dataset.

It is worth noting that Table 7 reports sensitivity at fixed false-positive (FP) counts. For each method, we set the decision threshold to reach the same FP count, which is equivalent to reading the detection curve at given FP levels. At FP = 110, AdaBoost.M1 (99.42%) is higher than Log AdaBoost (98.57%). This is expected because the exponential loss in M1 gives more weight to hard or misclassified samples, which can raise sensitivity when a large FP budget is allowed. Log AdaBoost reduces the impact of difficult samples, so when allowing more false positives, it does not blindly pursue higher sensitivity and performs more conservatively (with slightly lower sensitivity).

In insulator inspection, backgrounds are cluttered and manual review is limited, so lower-FP settings (e.g., 20–80 FPs in our setup) are more practical. In this range, Log AdaBoost shows higher sensitivity and precision than M1 (see Table 7), which helps reduce false alarms in field use. We therefore view the low-FP gains as the more useful result, while the FP = 110 point reflects the expected behavior when the false-alarm budget is relaxed.

For the Log AdaBoost result (AUC = 0.82) in Fig. 12, our analysis is as follows. The ROC curve rises steeply near the origin and then levels off. This means the detector works well when few false positives (FPs) are allowed, but gains become smaller as the FP budget increases. This pattern fits the CPLID data: backgrounds include wires, towers, and vegetation, and insulators are small or partly occluded. Boundaries are clear in clean views but become unclear when insulator parts overlap with wires or bright sky, which limits further increases in the true positive rate at higher FP levels.

Feature effects. Our Haar-like edge and line features, together with a 3 × 3 center–surround template N(x, y), work well at boundaries (see 5.1) but are local and intensity-based. Under strong lighting changes, partial occlusion, or wire-like clutter inside the 3 × 3 window, these features can give similar scores for foreground and background. The resulting score overlap flattens the tail of the ROC, which explains the AUC of 0.82 even though sensitivity is strong in the low-FP range (see Table 7).

As shown in Table 8, Log AdaBoost is further compared with three lightweight non-deep learning models on the CPLID dataset, namely CNN-AdaBoost, SVM-AdaBoost and PSO-SVM. When the number of false detections is 20, the detection rate of Log AdaBoost reaches 48.91%, which is higher than 44.21% of CNN-AdaBoost, 36.29% of SVM-AdaBoost and 38.64% of PSO-SVM. As the allowed number of false positives increases from 40 to 110, the detection rate of all methods gradually improves, but Log AdaBoost always keeps the highest value at each operating point. In the low–false-positive region (20–80 FPs), Log AdaBoost improves the detection rate by about 4–5 percentage points over CNN-AdaBoost and by about 6–13 percentage points over SVM-AdaBoost and PSO-SVM. When more false positives are allowed (80–110), the gaps become smaller, but Log AdaBoost still outperforms the other three models. This comparison indicates that the proposed method has a clear advantage within the category of lightweight non-deep learning models for insulator detection.

Table 8 Detection Rate of different models for various numbers of false positives.

To further assess the robustness of the proposed Polylog loss, we additionally trained two variants of the cascade detector by replacing the Polylog loss with Huber loss and Tukey’s biweight loss, respectively, while keeping all the other hyper-parameters unchanged. The results are shown in Table 9. Based on the result analysis in Table 8, it can be seen that under the same Number of False Positives (FP), the Polylog loss function can achieve the highest or similar detection rates. Compared with the other two loss functions, when the allowable number of false detections (20–80 FPs) is relatively small, the Polylog loss function can detect insulators more efficiently in complex insulator backgrounds due to its smoothness and no need for additional parameters. These results are consistent with the theoretical analysis of the three loss functions in Sect. “Polylog loss function”, demonstrating the robustness advantage of the adopted Polylog loss function in insulator detection under complex backgrounds.

Table 9 Detection rate of Log AdaBoost with different robust losses on the CPLID dataset.

Remark on data augmentation. In this work we do not perform a full ablation over different augmentation factors such as 3 ×, 5 × or 10 × on CPLID. Instead, we treat the 7 × schedule described above as a practical design choice. On the one hand, it increases the amount of training data by roughly one order of magnitude, which helps the cascade classifier see enough pose, illumination and noise variations while keeping the total of 6,784 images manageable on a CPU-only platform. On the other hand, all compared methods in Sect. “Insulator Detection” share exactly the same augmented training set, so the relative gains of Log AdaBoost over AdaBoost.M1 and over different robust losses are not affected by the specific augmentation factor. A more systematic study of augmentation intensity and operators for insulator detection is an interesting direction for future work and may further refine the performance of lightweight detectors on CPLID.

Finally, to highlight the friendliness of the model in insulator detection deployment, we compared the parameters of the Log AdaBoost in this paper with those of the mainstream YOLO series of improved models in recent years, as shown in Table 10. The results show that the Log AdaBoost in this paper has greater advantages in parameter counting compared with the existing lightweight insulator detection models.

Table 10 Comparison of Model Parameter Counts.

Conclusion

This paper explained the basic ideas of Log AdaBoost we proposed. We used gradient descent method for optimizing the new loss of object, Polylog function, which makes the weight adjustment towards the negative gradient of the function. Compared with Real AdaBoost and Gentle AdaBoost, the new weight updating strategy proposed is more moderate. Therefore, our method has better generalization performance than Real AdaBoost and Gentle AdaBoost.

In our future work, we will continue to address remaining limitations in insulator detection. First, for heavily occluded insulators we plan to refine the feature extraction stage. We will extend the current Haar-like edge, line and neighborhood features to multi-scale and larger neighborhood templates, so that the detector can use more contextual pixels around partially visible caps or sheds. We also intend to introduce a simple region weighting mechanism in the cascade, so that image areas with stable neighborhood responses receive higher weights during classification, while areas that are severely occluded or highly cluttered are downweighted. Second, for different lighting conditions we will study illumination normalization and color correction before feature extraction, and we will add targeted data augmentation such as simulated shadows, over-exposure and under-exposure. This is expected to reduce the influence of strong light and backlighting on the handcrafted features and to stabilize the decision thresholds. Third, we will build new experimental sets that contain other insulator types (e.g., composite and ceramic) and evaluate whether the proposed feature templates and Log AdaBoost still maintain good performance, adjusting the templates if necessary to match different geometries and materials. Finally, we will further compare the method in this paper with more lightweight deep learning detectors on the same CPLID dataset, and analyze in detail the trade-off between accuracy, speed and model size.

In addition, we will investigate the real-time behavior of the proposed Log AdaBoost detector on practical embedded hardware used in power inspection. Specifically, we plan to port the cascade to typical ARM-based industrial platforms or low-power GPU devices, measure the achievable frame rate and latency under different image resolutions, and analyze the trade-off between detection accuracy, speed and resource consumption in comparison with other lightweight detectors. Moreover, we will extend the current study from the glass insulators in CPLID to multi-type insulator datasets that include composite and ceramic insulators. In future work, we plan to collect or adopt such datasets, analyze the shape and surface differences among these insulator types, and adjust the Haar-like feature templates and cascade thresholds accordingly. By retraining and evaluating Log AdaBoost on these multi-type datasets, we aim to verify and improve its adaptability across different insulator structures and materials, which is important for the practical promotion of the proposed method in real power inspection systems.