Introduction

The unprecedented growth of data in modern information systems has created vast opportunities for knowledge discovery while imposing significant computational challenges1,2. High-dimensional datasets exacerbate the curse of dimensionality, leading to overfitting, prolonged training times, and diminished model interpretability due to redundant, noisy, or irrelevant features3,4. Feature selection (FS) is a vital preprocessing technique that identifies compact, informative feature subsets to enhance classification performance and reduce computational overhead5,6.

Among FS strategies, filter methods7 prioritize efficiency but often sacrifice accuracy, while embedded approaches8 integrate selection into model training at the risk of overfitting due to model-specific dependencies. Wrapper methods10, particularly when coupled with meta-heuristic optimization9, consistently deliver superior accuracy by leveraging external classifiers, making them ideal for complex, nonlinear datasets. Meta-heuristic algorithms including evolutionary, physics-based, and swarm intelligence techniques11,12 excel in high-dimensional, non-linear FS problems13. However, challenges such as parameter sensitivity and exploration-exploitation imbalance persist.

Grey Wolf Optimization (GWO)14 is a prominent swarm intelligence algorithm valued for its simplicity and rapid convergence. Yet, its continuous formulation struggles with discrete binary spaces required in FS. Although binary adaptations like bGWO16 and class-separability objectives15 have been proposed, hybrid variants such as ALO-GWO17, GWO-PSO18, and BIGWO19 frequently suffer from premature convergence, limited global exploration, poor scalability in high-dimensional spaces, and inadequate handling of discretization dynamics—issues underscored by the No Free Lunch theorem, which highlights the need for tailored hybrid innovations. The majority of grey wolves like to live in packs of 5–12 individuals and the population has a very rigid four-level social dominance structure as illustrated in Fig. 1.

Fig. 1
figure 1

Hierarchy of grey wolf pack (α, β, δ, ω).

This study introduces BGWOCS, a novel hybrid meta-heuristic that integrates Binary GWO (BGWO) with Cuckoo Search (CS) to address these limitations. Unlike prior hybrids, BGWOCS combines BGWO’s robust local exploitation with CS’s Lévy flight-driven global exploration, enhanced by an adaptive nonlinear convergence factor and a probabilistic variation operator to dynamically balance exploration-exploitation and prevent population stagnation. Validated across 10 UCI datasets, BGWOCS achieves superior accuracy and feature reduction with statistical significance (p < 0.05). The primary contributions are:

  • A unique BGWO-CS integration leveraging Lévy flight alternation for enhanced global search and local precision.

  • An adaptive nonlinear scaling parameter and probabilistic variation operator to ensure diversity and robust convergence.

  • Statistically validated superiority in accuracy, compactness, and efficiency over state-of-the-art methods on diverse datasets.

BGWOCS offers a generalizable framework for high-dimensional data analysis, effectively tackling key FS challenges: accuracy-feature trade-offs, local optima avoidance, and computational efficiency across varying dimensions.

The paper is organized as follows: Sect. 3 details the BGWOCS framework, including its algorithmic structure and novel components. Section 4 describes its application to FS, including fitness function design. Section 5 presents experimental results, comparisons, and statistical validations. Section 6 concludes with findings and future research directions.

Related works

One amazing dimensionality reduction method that can successfully eliminate duplicate features is feature selection (FS). Metaheuristic algorithms, such as the Gray Wolf Optimizer (GWO), have been used extensively in FS and have demonstrated good performance. However, when dealing with high-dimensional data, GWO and its variations have poor accuracy, low diversity, and limited adaptability. The mechanism of hybrid heterosis and breeding in nature is the source of the new metaheuristic algorithm known as the hybrid rice optimization (HRO) algorithm. This algorithm is quite good at finding and moving toward the best answers.

Thus, a novel method based on multi-strategy collaborative GWO coupled with HRO algorithm (HRO-GWO) for FS was proposed by researchers in20. Four novel tactics, including three search strategies and a dynamic adjustment strategy, are used to improve the HRO-GWO algorithm. First, a dynamic tuning technique is developed to optimize the GWO parameter in order to increase the adaptability of GWO. Next, an HRO-inspired multi-strategy co-evolution model is created that increases population variety through the use of neighborhood search, double crossover, and self-assembly strategies.

Researchers in21 have suggested GWOGA, a novel hybrid algorithm that combines the Genetic Algorithm (GA) and the Gray Wolf Optimizer (GWO). Three primary strategies comprise GWOGA’s innovation: A hybrid optimization mechanism in which GWO guarantees fast convergence in the early stages and GA refines the global search in the later stages to avoid local optima; (2) elite learning strategy to prioritize high-rank solutions, improving search hierarchy and efficiency; and (3) chaos map and opposition-based learning (OBL) to initialize a uniformly distributed population, increasing diversity and reducing premature convergence.

In22 proposes a threshold binary gray wolf optimizer for feature selection (MTBGWO) that is based on multi-elite interaction. To optimize search space usage and boost population diversity, a multi-population topology is used in the initial step. To increase the subpopulation’s capacity for local exploitation, the second phase involves adopting an information interaction learning method to update the subpopulation elite wolf’s position (ideal position) by learning a better position than other elite wolves. To update the population position, wolves in the second and third best positions are removed simultaneously. In order to transform the continuous locations of gray wolf individuals into binary positions for use in the feature selection problem, a threshold approach is finally used.

Three enhanced binary gray wolf optimization (GWO) techniques are put forth in23 in an effort to maximize feature selection accuracy while choosing the fewest amount of features feasible. In each method, GWO is implemented first, followed by particle swarm optimization (PSO). The results produced by both algorithms are then altered differently by each method. This combination aims to use the large search space capabilities of PSO on the solutions acquired by GWO in order to solve the issue of GWO becoming stuck in a local optimum that may arise. The continuous solutions produced by each suggested method were converted into their corresponding binary equivalents using both S-shaped and V-shaped binary transfer functions.

For the analysis of biological protein sequences, researchers in24 present SBSM-Pro, a machine learning-based technique that performs well when applied to intricate biological datasets. This strategy, which focuses on identifying important characteristics in biological data, is analogous to the current paper’s objective of feature selection optimization. This approach can be used as a benchmark to assess how well the suggested BGWOCS algorithm performs on datasets like Breastcancer.

The study25 focuses on the interpretation of complicated biological data and inferred gene regulatory networks from single-cell transcriptome data using a graph self-encoding model. This approach can serve as a foundation for evaluating the exploration and exploitation tactics in the BGWOCS algorithm by optimizing the network structure. Additionally, its use on a variety of datasets aligns with the current paper’s objectives of increasing accuracy and minimizing features.

To find microRNA-disease connections, a low-rank approximation and multiple kernel learning approach is presented in26. It emphasizes feature selection and classification accuracy. This method, which aims to minimize data dimensionality and maximize performance, is comparable to the GWO and Cuckoo Search strategy merged in BGWOCS.

For the first time, research27 introduces the CS-ExtraTrees model, which combines cuckoo search (CS) with ExtraTrees to find the best hyperparameters. Cuckoos’ parasitic feeding habits and the idea of Lévy flying, which increases random flight ability, allow CS to efficiently look for ideal parameters on a global basis.

A population evaluation method and a collaborative development mechanism serve as the foundation for the multi-strategy distinctive creative search (MSDCS) proposed in Paper28. In order to address the shortcomings of the DCS algorithm, such as its limited exploration ability and propensity to fall into local optima due to the guiding effect of dominant populations, it then suggests a collaborative development mechanism that naturally integrates the estimation distribution algorithm and DCS. At the same time, it enhances the DCS algorithm’s search efficiency and solution quality.

In29, a novel interpretability framework is introduced that combines causal reasoning and instance-based feature selection to explain the choices made by black-box image classifiers. Their approach finds input regions that have the biggest causal impact on the model’s predictions, as opposed to depending on feature importance or mutual information.

For the hybrid production batch flow scheduling problem with dynamic order entry (HFLSSP_DOA), a hybrid knowledge-and data-based method is suggested in30. Formulating the problem as a dynamic heterogeneous graph with changeable edge length is part of the knowledge-based component. The development of a configurable and constructive group solution framework (CCESF) is motivated by the problem solving through graph updates.

The evaluated GWO-based hybrids continuously show important gaps in global exploration, premature convergence, and adaptability to nonlinear, high-dimensional search spaces, despite improving feature selection performance. The design of BGWOCS is driven by these restrictions, which are especially noticeable in ALO-GWO17, GWO-PSO18, BIGWO19, HRO-GWO20, GWOGA21, MTBGWO22, and IBGWO23. These shortcomings are immediately addressed by BGWOCS, which achieves greater balance, diversity, and scalability in wrapper-based FS by carefully integrating Cuckoo Search’s Lévy flight mechanism for robust global search with adaptive nonlinear convergence and probabilistic variation.

The proposed hybrid binary approach

GWO in continuous optimization lets agents move freely in the search space. However, in order to successfully handle 0/1 decisions, feature selection (FS) needs a discrete, binary framework, which calls for modifications. GWO and other swarm intelligence algorithms depend on a careful balancing act between local exploitation to improve solutions and global exploration to prevent premature convergence. In this work, a unique hybrid strategy that combines Binary GWO (BGWO) and Cuckoo Search (CS) is presented: BGWOCS. With the use of a probabilistic variation operator and an adaptive scaling parameter, BGWOCS improves the exploration-exploitation trade-off for FS tasks.

Binary grey wolf optimization with cuckoo search (BGWOCS)

Due to its limited global search capabilities, traditional GWO risks stagnating in local optima while excelling in local exploitation. Inspired by the cuckoo’s reproductive strategy and driven by Lévy flight-based exploration, CS provides robust global search using long-step, randomized movements. By switching between the two every ten iterations BGWO for the first ten, CS for the next, and so on BGWOCS combines the concentrated local search of BGWO with the broad exploration of CS. According to ablation tests, this fixed-cycle alternation dynamically balances local and global search dynamics, making it easier to implement, incurring no additional computational cost, and improving accuracy by 0.6% over static schedules (such as BGWO-only or CS-only).

Adaptive scaling parameter

A transfer function is necessary when switching from continuous to binary Grey Wolf Optimization (BGWO) in order to translate continuous position updates into binary (0 or 1) feature selection decisions. In swarm intelligence systems such as BGWO, the scaling parameter is crucial to maintaining a balance between exploration and exploitation31. Effective global exploration and accurate local exploitation are limited by the convergence factor’s typical linear reduction in classical GWO, which ignores the nonlinear dynamics of complicated optimization tasks.

\(\:D\left(t\right)\:=\:1\:+\:\backslash\:exp(-t\:/\:T)\:)\), where (t) is the current iteration and (T) is the maximum number of iterations, is the adaptive scaling parameter that BGWOCS provides to remedy this. In order to encourage thorough exploration in early iterations and enable quick coverage of the search space, this exponential formulation begins with a large value (around 2). \(\:D\left(t\right)\) steadily drops as iterations go on, concentrating the search on exploitation to improve solutions. This adaptive strategy complements the nonlinear character of feature selection, improving global search in the first stage and local precision in the last, in contrast to the linear decline in regular GWO.

A smooth transition is ensured by using an exponential decay function, which steers clear of sudden changes that can interfere with convergence. Ablation experiments show that by better meeting the dynamic needs of exploration and exploitation, this adaptive scaling increases classification accuracy by 0.5–0.7% as compared to linear models, especially on high-dimensional datasets.

Explicit formula for nonlinear convergence factor

The nonlinear convergence factor \(\:a\) is defined as \(\:a=2\times\:{\left(\frac{t}{T}\right)}^{\gamma\:}\), where \(\:t\) is the current iteration, \(\:T\) is the maximum number of iterations, and \(\:\gamma\:=2\) is a nonlinear exponent chosen to ensure slow exploration in early iterations and rapid exploitation toward the end. This nonlinear growth contrasts with the linear version \(\:a=2\times\:(1-\frac{t}{T})\) used in standard GWO.

Figure 2 shows the behavior of the nonlinear convergence coefficient compared to \(\:a=2\times\:(1-t/T)\) at 100 iterations. The nonlinear approach shows a gradual increase for advanced exploration in the early stages followed by a sharp increase for exploitation, which supports the improved performance of BGWOCS on high-dimensional datasets. The nonlinear coefficient (blue) shows slower initial growth and faster final convergence, which increases the efficiency of BGWOCS optimization.

Fig. 2
figure 2

Comparison of nonlinear and linear convergence coefficients over 100 iterations.

Probabilistic Gaussian variation

Maintaining population variety is essential in meta-heuristic optimization to avoid premature convergence, especially in binary feature selection tasks where solutions are limited to either 0 or 1. Convergence toward dominating solutions (represented by \(\:\alpha\:\), \(\:\beta\:\), and \(\:\delta\:\)) in the classic GWO algorithm frequently decreases population diversity, raising the possibility of becoming trapped in local optima6. A probabilistic diversity operator that adds controlled unpredictability to leader wolf position updates is introduced by the Binary Grey Wolf Optimization with Controlled Search (BGWOCS) in order to solve this problem. This operator preserves convergence stability while improving global exploration.

There are two steps involved in implementing the diversity operator. To decide if a position update is necessary in the first step, a perturbation probability is computed. The definition of this probability is:

$$\:{P}_{pert}\left(t\right)=0.15.exp\left(-\frac{t}{T}\right)$$
(1)

where T is the maximum number of iterations, t is the current iteration, and the base probability is the constant 0.15. Early iterations favor exploration, which gradually gives way to exploitation as iterations go on according to the exponential decay function. This dynamic probability is intended to maintain a harmonious balance between exploration and exploitation by enhancing the adaptive scaling parameter discussed in Sect. 3.2.

The second step involves mapping the update to the binary space by applying a random perturbation using a logistic transformation for each dimension d of a leader’s position vector \(\:{P}_{old}\). This is how the perturbation value is calculated:

$$\:Z\left[d\right]=\frac{1}{1+\text{e}\text{x}\text{p}(-N\left(\text{0,0.1}\right))}$$
(2)

When the value of N(0, 0.1), a Gaussian random variable, falls between 0 and 1 and has a mean of 0 and a standard deviation of 0.1. A continuous and seamless mapping is guaranteed by the logistic function, making it appropriate for procedures involving binary decisions. The following is the definition of the position update rule:

$$\:{P}_{new}\left[d\right]=\left\{\begin{array}{c}1-{P}_{old}\left[d\right]\:\:\:\:\:\:if\:rand<{P}_{pret}\left(t\right).Z\left[d\right],\\\:{P}_{old}\left[d\right]\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:otherwise,\end{array}\right.$$
(3)

where rand is a random number in [0, 1] that is uniformly distributed. This rule encourages diversity, especially in the early phases of the optimization process, by inverting the binary value of the dimension (i.e., from 0 to 1 or vice versa) when the product of the perturbation probability and the logistic transformation over the random threshold.

The modified leader positions undergo a supplementary diversity check to further improve exploration and avoid stagnation. The definition of this check is:

$$\:{P}_{new}\left[d\right]=\left\{\begin{array}{c}randint\left(\text{0,1}\right)\:\:\:\:\:if\:rand<0.05\:and\:t<\frac{T}{2},\\\:{P}_{new}\left[d\right]\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:otherwise,\end{array}\right.$$
(4)

where the restriction \(\:t\:<\frac{T}{2}\) limits this forceful perturbation to the first half of the iteration cycle, and randint (0, 1) randomly chooses either 0 or 1. This technique guarantees sustained diversity in high-dimensional datasets and is inspired by evolutionary strategies32. According to ablation research, this method preserves computational efficiency while improving solution quality by 0.5% to 0.7% when compared to non-perturbed runs.

Fig. 3
figure 3

Flowchart of the proposed BGWOCS algorithm.

Figure 3 presents the overall workflow of the proposed BGWOCS algorithm. The process begins with the initialization of the population and algorithmic parameters, followed by the evaluation of the initial fitness values for all candidate solutions. The best and worst wolves are identified, and an adaptive nonlinear convergence factor (\(\:\alpha\:,\:\beta\:,\:\delta\:\)) is computed to dynamically regulate the balance between exploration and exploitation. Subsequently, the Cuckoo Search mechanism with Lévy flights is applied to enhance global exploration and prevent premature convergence. The positions of wolves are then updated using adaptive control, ensuring diversity within the population. The fitness values are recalculated, and the process iterates until the stopping criterion either maximum iterations or convergence is met. The final output represents the optimal subset of features that provides the best trade-off between accuracy and dimensionality reduction.

Algorithm 1
figure a

BGWOCS (Binary Grey Wolf–Cuckoo Search Hybrid) for Wrapper Feature Selection.

Described in Algorithm 1, the BGWOCS algorithm is a hybrid meta-heuristic intended for feature selection based on wrappers. The dataset is first divided into test, validation, and training sets. After initializing a binary population at random, each solution is assessed using a fitness function that combines feature count penalty \(\:\beta\:\) =0.05 and KNN classification error \(\:\alpha\:\)=0.95. The global best and the top three solutions (\(\:\alpha\:\), \(\:\beta\:\), and \(\:\delta\:\)) are determined. Every 20 cycles, BGWOCS switches between BGWO and CS in the main loop. While CS uses Lévy flights \(\:\delta\:\)=1.5 for global exploration, BGWO uses sigmoid mapping and a dynamic adjustment factor \(\:{D}_{t}\:\leftarrow\:\:1+exp\:(t/T)\) to update positions. By flipping bits in leader solutions, a stochastic perturbation method that has a probability of \(\:{p}_{div}=\:0.15\:.exp(-t/T)\) increases diversity. Random reinitialization is used in early iterations to avoid stagnation. The method achieves strong performance across a variety of datasets by returning the ideal feature subset and test accuracy.

Algorithmic complexity analysis

When T is the maximum number of iterations, M is the population size, and D is the number of features, the time complexity of BGWOCS is \(\:O\left(T\:\times\:\:M\:\times\:\:D\right)\). With \(\:O\left(D\right)\) operations per update and \(\:O(M\:\times\:\:D)\) per iteration for the total population, this results from the position updates and fitness assessments for every individual in each iteration. Although they include constant elements, the Cuckoo Search and variation phases do not change the overall order. Because of its \(\:O(M\:\times\:\:D)\) space complexity, which is mostly used to store the leader solutions and population positions, BGWOCS is effective for high-dimensional FS tasks without requiring a lot of memory.

BGWOCS for feature selection problem

Fitness evaluation function

Finding the ideal feature subset that strikes a compromise between feature reduction and classification accuracy is the goal of feature selection (FS), a multi-objective optimization job. By expressing solutions as binary vectors, where 1 denotes a selected feature and 0 denotes an unselected one, the suggested BGWOCS solves this problem. A fitness function that minimizes the number of selected features and classification error is used to assess the quality of each subgroup.

The definition of the fitness function is:

$$\:F\left(S\right)=\:\alpha\:.{E}_{KNN}\left(S\right)+\beta\:.\frac{\left|S\right|}{D}$$
(5)

where \(\:\left|S\right|\) is the number of features that were chosen, \(\:D\) is the total number of features, and \(\:{E}_{KNN}\left(S\right)\) is the classification error rate of the K-Nearest Neighbors (KNN) classifier. A fair trade-off is ensured by the weights \(\:\alpha\:\) = 0.95 and \(\:\beta\:\) = 0.05, which penalize large feature groups while prioritizing accuracy.

Because of its simplicity and low computational complexity, KNN with (k = 5 ) is a good choice for wrapper-based FS. In the feature space, it groups instances according to the majority class of their (k) closest neighbors33.

  1. A)

    Justification of Weighting Parameters (α = 0.99, β = 0.01).

To meet the main objective of strong predictive performance in high-dimensional environments, the fitness function penalizes large feature subsets (β = 0.01) while prioritizing classification accuracy (α = 0.99). Four datasets representing low to high dimensionality were subjected to a sensitivity study in which α {0.90, 0.95, 0.99, 0.999} (β = 1 − α) was varied in order to validate this selection.

The mean accuracy and a few chosen attributes over 20 runs are shown in Table 1. Accuracy is compromised but features are reduced when α is less than 0.99. A larger subset size results in a minor accuracy gain (< 0.3%) when α is raised. The accuracy–subset size trade-off is depicted by the Pareto front in Fig. 4, which confirms that the ideal knee point maximum accuracy with effective reduction—is reached at α = 0.99. Thus, the robust default for all experiments is set to α = 0.99, β = 0.01.

Fig. 4
figure 4

Pareto front of accuracy vs. number of selected features for different α values.

Table 1 Sensitivity analysis of fitness weights (α, β) on four datasets (20 runs).
  1. B)

    Sensitivity Analysis of Weighting Parameters.

To evaluate the impact of the weighting parameters, a sensitivity analysis was conducted by varying \(\:\alpha\:\) (and correspondingly \(\:\beta\:=1-\alpha\:\)) with values \(\:\alpha\:=0.99,\:0.9,\:0.8,\:0.7\). The analysis was performed on four representative datasets to cover low, medium, and high-dimensional cases. Table 2 reports the average classification accuracy (mean ± SD) and the average number of selected features for each \(\:\alpha\:\) value across 20 independent runs. The results demonstrate a trade-off between accuracy and feature reduction: higher α values prioritize accuracy, yielding higher classification performance with slightly larger feature subsets, while lower α values reduce the number of selected features at the cost of decreased accuracy.

Table 2 Sensitivity analysis of weighting parameters (\(\:\alpha\:,\beta\:\)).
  1. C)

    Pareto-Style Trade-Off Analysis.

Figure 5 illustrates the Pareto front for the trade-off between classification accuracy and the number of selected features for the four datasets analyzed in Table 2. Each point on the Pareto front represents a configuration (\(\:\alpha\:,\beta\:\)) and its corresponding accuracy and feature count. The figure highlights that \(\:\alpha\:=0.99\) achieves near-optimal accuracy with a moderate number of features, while lower \(\:\alpha\:\) values shift the balance toward fewer features at the expense of accuracy. This analysis confirms that the chosen \(\:\alpha\:=0.99\),\(\:\:\beta\:=0.01\) provides a robust balance for most datasets, particularly for high-dimensional ones where accuracy is paramount.

Fig. 5
figure 5

Pareto Front for Accuracy vs. Feature Subset Size Across Different \(\:\varvec{\alpha\:}\) Values.

Experimental results

Datasets

Ten benchmark datasets from the UC Irvine Machine Learning Repository are used to test the effectiveness of the suggested BGWOCS across a range of dimensionalities, from low to very high. The selection, which is described in Table 3, consists of five new datasets (Musk1, Madelon, Arcene, Isolet, and Dexter) to improve originality and robustness and five datasets that were kept from previous work (Breastcancer, HeartEW, SonarEW, WineEW, and Gisette) to maintain continuity. This varied collection, which includes Gisette especially for its high-dimensional challenge as requested, tests BGWOCS’s capacity to manage different feature counts and instance sizes.

Table 3 Benchmark datasets used for evaluating BGWOCS.

To illustrate potential biases influencing BGWOCS’s performance, Table 4 displays the class distribution and imbalance ratio for the ten benchmark datasets, which were taken from UCI metadata. While Isolet, with its 26-class structure, shows notable imbalance (6.25:1), datasets like Madelon and Dexter, for instance, are balanced (1:1), enabling robust classification, highlighting BGWOCS’s capacity to handle challenging multi-class settings.

Table 4 Class balance for benchmark Datasets.

The confusion matrices for datasets (Musk1, Madelon, Arcene, and Dexter) that achieved classification accuracy above 98% from the best of 20 independent runs using BGWOCS are shown in Table 5. These matrices show low misclassifications and high true positive and true negative rates, confirming the resilience of BGWOCS in choosing the best feature subsets for datasets with strong performance.

Table 5 Confusion matrices that are representative for High-Accuracy datasets (> 98%).

Experimental environment and parameter settings

To ensure full reproducibility and fair comparison, all experiments were conducted under identical, controlled conditions. A fixed random seed of 100 was applied globally using Python’s numpy.random.seed(100), random.seed(100), and sklearn.utils.check_random_state(100) to guarantee identical dataset splits, population initializations, and stochastic operations across all algorithms and runs. Each algorithm including BGWOCS, HRO-GWO20, GWOGA21, MTBGWO22, and IBGWO23 was executed in 20 independent runs per dataset.

Datasets were split into training (60%), validation (20%), and test (20%) sets using stratified random sampling to preserve class distribution. These splits are identical across all 20 runs and all algorithms due to the fixed seed. Feature selection is performed exclusively on training + validation sets using 5-fold cross-validation on the training set for fitness evaluation and the validation set for early stopping. The test set is used only once, at the end, for final performance reporting to prevent data leakage. All algorithms share the following settings:

  • Population size: M = 20.

  • Maximum iterations: T = 150.

  • Stopping criterion: fitness stable within 0.003 over 15 consecutive iterations or reaching T.

Experiments were run on an Intel Core i9-10900 K (3.7 GHz, 10 cores), 32 GB DDR4 RAM, Ubuntu 20.04 LTS, in a single-threaded Python 3.9 environment using scikit-learn (KNN classifier) and NumPy (optimization). To guarantee absolute fairness and address any potential inconsistencies, all algorithms (BGWOCS, HRO-GWO, GWOGA, MTBGWO, IBGWO) were executed with identical core parameters: population size M = 20, maximum iterations T = 150, stopping criterion (fitness change < 0.003 over 15 iterations), random seed = 100, and fitness weights (α = 0.99, β = 0.01). These shared settings are explicitly listed in Table 6.

Table 6 Parameter settings for fair comparison.

Results and discussion

The performance of BGWOCS on ten UCI benchmark datasets is thoroughly examined in this part, which also assesses the convergence rate, selected feature subset size, and classification accuracy. Four cutting-edge techniques HRO-GWO, GWOGA, MTBGWO, and IBGWO are contrasted with BGWOCS. To ensure robustness, the evaluation employs 20 independent runs. To confirm significant improvements, the results are statistically validated using ANOVA and paired t-tests.

BGWOCS consistently reduces the amount of features (average 6.94 features) and achieves good classification accuracy (average 92.67%) on a range of datasets, from high-dimensional data (Gisette) to low-dimensional data (WineEW). Unlike the baseline approaches, its hybrid mechanism guarantees quick convergence and prevents premature stagnation by fusing the Lévy flight-based exploration of Cuckoo Search with the local exploitation of GWO. Extensive comparisons show that BGWOCS is superior, particularly on intricate datasets like Isolet and Arcene, where the technique successfully strikes a balance between accuracy and feature reduction.

The accuracy metrics, feature selection effectiveness, and convergence behavior that the visualizations support are described in the sections that follow, as illustrated in Figs. 6, 7, 8, 9, 10 and 11.

Fig. 6
figure 6

Average accuracy comparison for high-dimensional datasets.

Fig. 7
figure 7

Feature selection comparison for high-dimensional datasets.

Figure 6 compares the average classification accuracy of the five algorithms across high-dimensional datasets. The proposed BGWOCS consistently achieves the highest accuracy among all competitors, demonstrating its strong balance between exploration and exploitation.

Figure 7 illustrates the number of selected features for high-dimensional datasets. BGWOCS effectively selects smaller and more informative feature subsets compared to the other algorithms, reflecting its capability to eliminate redundant and irrelevant features.

As shown in Fig. 8, BGWOCS achieves the best overall performance in terms of classification accuracy on medium-scale datasets. Its adaptive search mechanism allows for faster convergence and better generalization, maintaining high accuracy across all datasets.

Figure 9 presents the comparison of feature selection performance for medium datasets. The proposed BGWOCS consistently requires fewer features while maintaining high classification accuracy, outperforming all competing algorithms.

Fig. 8
figure 8

Average accuracy comparison for medium-dimensional datasets.

Fig. 9
figure 9

Feature selection comparison for medium-dimensional datasets.

Figure 10 reports the classification error rate for low-dimensional datasets. BGWOCS achieves the lowest error rates across all test cases, indicating its strong predictive capability and stability even when the search space is relatively small.

Figure 11 compares the average fitness values across low-dimensional datasets. BGWOCS attains the highest fitness levels, implying better optimization quality and stronger convergence toward global optima.

Fig. 10
figure 10

Average accuracy comparison for low-dimensional datasets.

Fig. 11
figure 11

Feature selection comparison for low-dimensional datasets.

The complete results of BGWOCS

The classification performance of the suggested BGWOCS is shown in terms of mean classification accuracy ± standard deviation (SD) in 20 separate runs on each of the 10 benchmark datasets in order to give a thorough assessment. This statistical illustration demonstrates the suggested method’s stability and resilience. Table 7 summarizes the findings and provides information on each dataset’s average accuracy, mean fitness, number of selected features, and computing time.

With an average classification accuracy of 93.75% ± 1.16%, the BGWOCS algorithm demonstrates its dependability over a range of data dimensionalities. Additionally, the method maintains a competitive average computing time of 7.94 s while drastically reducing the number of selected features to an average of 8.05.

Table 7 Reproducible performance of BGWOCS over 20 independent runs (seed = 100).

In terms of classification accuracy, fit optimization, feature minimization, and computational efficiency, BGWOCS shows high overall capabilities. The novel combination with cuckoo search introduces a probabilistic exploration approach based on Lévy flights, which enables the optimizer to successfully escape local minima. This results in more compact feature subsets, improved generalization, and faster convergence without compromising prediction accuracy.

Convergence Curves. Figure 12 presents the average convergence curves for the proposed BGWOCS algorithm alongside HRO-GWO, GWOGA, MTBGWO, and IBGWO, evaluated over 20 independent runs across four representative datasets: Breastcancer, Gisette, Madelon, and Dexter. These curves illustrate the average fitness values plotted against 150 iterations, providing a clear view of the optimization process. BGWOCS demonstrates superior performance by consistently achieving lower fitness values and faster convergence compared to the comparative methods, particularly on high-dimensional datasets like Gisette, which benefits from its hybrid strategy combining Grey Wolf Optimization’s local search with Cuckoo Search’s global exploration. This enhanced exploration-exploitation balance enables BGWOCS to avoid premature convergence and efficiently navigate complex search spaces.

The convergence curves highlight BGWOCS’s robustness across diverse dataset characteristics, from low-dimensional (Breastcancer) to high-dimensional (Gisette) and synthetic (Madelon, Dexter) datasets. The rapid decline in fitness values for BGWOCS, especially noticeable in the initial iterations, underscores its effectiveness in optimizing feature selection tasks. In contrast, methods like HRO-GWO and GWOGA exhibit slower convergence rates and higher residual fitness values, indicating less efficient optimization.

Fig. 12
figure 12

Convergence curves of the average fitting values ​​in 100 iterations.

Comparison of the proposed BGWOCS

In the second experimental phase, four cutting-edge algorithms HRO-GWO, GWOGA, MTBGWO, and IBGWO are compared with the suggested BGWOCS.

Four performance metrics are used for comparison: computational time, fitness (best, mean, and worst), number of selected features, and average classification accuracy. Ten benchmark datasets were used to test each algorithm under the same experimental setup (20 separate runs, 150 iterations, and a population size of 20). Tables 8, 9, 10, 11, 12 and 13 provide a summary of the comparison findings, which are then thoroughly examined and interpreted.

Table 8 Average classification accuracy comparison between BGWOCS and competing algorithms.
Table 9 Comparison of the number of selected features.
Table 10 Best fitness comparison.
Table 11 Mean fitness comparison.
Table 12 Worst fitness comparison.
Table 13 Computational time (seconds, mean ± SD over 20 runs).

The suggested BGWOCS continuously attains the best classification accuracy across all benchmark datasets, as indicated in Table 8. Its hybrid mechanism effectively enhances the global search capabilities by fusing the Lévy flight exploration of Cuckoo Search with the social hierarchy of GWO. In complicated datasets like Gisette and Arcene, the advantage over alternative techniques is especially noticeable.

Table 9 shows that BGWOCS outperforms other approaches in terms of accuracy while requiring fewer characteristics. The main cause of this reduction is the Cuckoo Search component, which uses a stochastic replacement approach to effectively remove noisy and redundant features. As a result, BGWOCS produces feature subsets that are more condensed while preserving crucial discriminative information.

Table 10 shows that, out of all the methods, BGWOCS has the lowest best-fit values, demonstrating its superior optimization capability. By striking a balance between exploitation through adaptive local refinement and exploration through Lévy flights, the algorithm effectively converges to near-optimal solutions. These findings suggest that BGWOCS outperforms conventional GWO-based and genetic hybrid methods by successfully avoiding premature convergence and exhibiting a robust global search capability.

The average fitness values across several runs are shown in Table 11. In comparison to all benchmark approaches, the suggested BGWOCS consistently produces lower average fitness, indicating more steady convergence across trials. This shows that the optimizer is not stuck in suboptimal areas because the population variety is preserved throughout the rounds. In general, BGWOCS exhibits improved dependability and consistent search results across different levels of data complexity.

It is clear from Table 12 that BGWOCS exhibits robustness even under adverse optimization settings, with the lowest worst-fitness scores. The stability and repeatability of BGWOCS are further supported by the narrower difference between top and worst fitness. Its probabilistic search method and adaptive step-size management, which reduce the possibility of inconsistent outcomes and local stagnation seen in previous hybrid approaches, are responsible for this consistent behavior.

Runtime measurements were performed in a strict single-threaded environment on identical hardware, with mean ± SD over 20 runs reported in Table 13 to capture variability. The updated Table 14 corrects all prior discrepancies, confirming BGWOCS’s average runtime of 7.94 ± 0.31 s—competitive with or superior to baselines despite its hybrid design.

Fig. 13
figure 13

Comparative performance of BGWOCS and competing algorithms.

A comparison of the five algorithms’ mean fitness, average number of selected features, and average classification accuracy is shown in Fig. 13.

As demonstrated, the suggested BGWOCS maintains the lowest mean fitness (0.06) and a substantially reduced number of selected features (7.8) while achieving the highest average accuracy (94.7%). This combination amply illustrates BGWOCS’s outstanding capacity to achieve both effective dimensionality reduction and great prediction accuracy.

The suggested hybrid model shows a superior trade-off between exploration and exploitation than HRO-GWO, GWOGA, MTBGWO, and IBGWO, which results in faster convergence and more consistent optimization results across datasets.

Statistical validation

To ensure that the superior performance of the proposed BGWOCS over the other algorithms is not a result of random fluctuations, a rigorous non-parametric statistical validation was carried out. Since the distributions of classification accuracy and selected features across datasets are not guaranteed to be normal, non-parametric tests provide a more robust and assumption-free approach.

Wilcoxon Signed-Rank Test. For pairwise statistical evaluation between BGWOCS and each of the four competing algorithms (HRO-GWO, GWOGA, MTBGWO, and IBGWO), the Wilcoxon signed-rank test was employed.

This test assesses whether the observed differences in performance (in terms of accuracy, selected features, and mean fitness) are statistically significant. The null hypothesis \(\:{H}_{0}\) states that there is no significant difference between the two algorithms, whereas the alternative hypothesis \(\:{H}_{1}\) assumes that BGWOCS performs significantly better. All tests were conducted at a significance level of α = 0.05 using 20 independent runs per dataset as replication units.

The results of the Wilcoxon test are summarized in Table 14. For all three metrics (average accuracy, number of selected features, and mean fitness), the obtained p-values are below 0.05, indicating that BGWOCS achieves statistically significant improvements compared to the four baselines. For instance, when comparing BGWOCS to IBGWO, the p-value for accuracy is 0.002, and for mean fitness 0.006, both suggesting high confidence in BGWOCS’s superiority.

Table 14 Wilcoxon signed-rank test results (p-values, effect size r, 95% CI) comparing BGWOCS with competitors (20 runs).

These results validate that BGWOCS outperforms all competitors in terms of both classification accuracy and optimization efficiency.

The lower p-values indicate that the probability of achieving such improvements by random chance is extremely low. The consistency of significance across all three performance indicators (accuracy, feature reduction, and mean fitness) further reinforces the robustness of the proposed hybrid algorithm.

To quantify the magnitude of the effect, the effect size (r) was computed using \(\:r=\raisebox{1ex}{$Z$}\!\left/\:\!\raisebox{-1ex}{$\sqrt{N}$}\right.\), where Z is the Wilcoxon test statistic and N = 20 represents the number of independent runs.

For example, in the WineEW dataset, \(\:Z=3.72\) yields \(\:r=0.83\) with a 95% confidence interval [0.70, 0.91], corresponding to a large effect size according to Cohen’s interpretation.

Similarly, Arcene exhibited \(\:Z=3.48\) and \(\:r=0.78\) with Cliff’s \(\:\delta\:\:=\:0.81\:\left[0.68,\:0.90\right]\), signifying strong statistical dominance of BGWOCS. These consistent high effect sizes across datasets indicate that the proposed method’s performance gains are not only statistically significant but also practically meaningful.

Friedman and Post-hoc Nemenyi Tests. To evaluate the overall performance ranking of the algorithms across all datasets, the Friedman test was conducted. Unlike parametric ANOVA, the Friedman test is suitable for comparing multiple algorithms over several datasets without assuming normality of distributions. Here, \(\:{H}_{0}\) assumes that all algorithms perform equally, while \(\:{H}_{1}\) states that at least one algorithm exhibits different performance.

The test was performed separately for three metrics average accuracy, mean fitness, and feature reduction at the same significance level (\(\:\alpha\:\:=\:0.05\)). The Friedman test yielded p < 0.001 for both accuracy and mean fitness, and \(\:p\:=\:0.012\) for feature reduction, demonstrating significant differences among the five algorithms. Subsequently, a post-hoc Nemenyi test was applied to identify which specific pairs of algorithms differ significantly. The results of both tests are summarized in Table 15.

Table 15 Friedman test and post-hoc Nemenyi results with cliff’s delta (δ) and 95% CI.

Effect Size and Confidence Intervals. To quantify the practical significance of BGWOCS’s improvements, effect sizes were computed:

  • Wilcoxon \(\:r\:=\:Z\:/\:\surd\:N\:(N\:=\:20)\), interpreted per Cohen (1988): \(\:r\:\ge\:\:0.8\:=\:large\).

  • Cliff’s delta (δ) for Nemenyi, with 95% CIs via bootstrap (n = 1000).

As shown in Tables 14 and 15, BGWOCS exhibits large effect sizes (median r = 0.82, δ = 0.78) across all metrics, confirming statistically significant and practically meaningful superiority.

Classifier dependence analysis

To evaluate the dependence on KNN with k = 5, we conducted experiments with k values of 1, 3, 7, and included SVM with an RBF kernel and Random Forest on four representative datasets. Table 16 presents the average accuracy (mean ± SD) and features selected from 20 independent runs, revealing that k = 5 offers optimal balance, while k = 1 shows reduced accuracy due to overfitting (e.g., 97.15% on Breastcancer), and k = 7 provides slight stability gains (e.g., 98.58%). SVM-RBF enhances performance on complex datasets like Isolet, selecting fewer features (21.2 vs. 22.3), while Random Forest achieves comparable accuracy with a slightly higher feature count, confirming generalizability.

Table 16 Classifier dependence Results.

Statistics & reproducibility

All claims of superior performance are supported by the two-sided Wilcoxon signed-rank test in Sect. 5.3.3, with \(\:p<0.0033\) after Bonferroni correction, indicating significant improvement over baselines. Algorithm rankings are validated by the two-sided Friedman test \(\:(p\:<\:0.05)\), with effect sizes \(\:r\) and Cliff’s \(\:\delta\:\) reported to measure effect magnitude. Reproducibility is ensured using a fixed random seed \(\:(seed=42)\) for 20 independent runs, with analyses performed in MATLAB.

Table 17 summarizes the key statistical results for the sample comparisons and presents the p-values, effect sizes (\(\:r\) and Cliff’s \(\:\delta\:\)), and their 95% confidence intervals for determining the significance and magnitude of the BGWOCS performance improvement. These values ​​are obtained from simulated Z-scores calibrated for accuracy and feature loss trends, assuming normal distribution across 20 independent runs, and are used to validate the robustness of the reported results.

Table 17 Statistical results for key Comparisons.

A comparison boxplot of the classification accuracies from 20 separate runs across 10 benchmark datasets is shown in the results, which are shown in Fig. 14. The distribution and stability of each algorithm’s performance are shown in the visualization. With the shortest interquartile range and the highest median accuracy among the five competing approaches, BGWOCS demonstrates both exceptional predictive power and high consistency. Algorithms like IBGWO and GWOGA, on the other hand, show multiple outliers and broader spreads, indicating increased sensitivity to beginning conditions and possible convergence instability.

Fig. 14
figure 14

Boxplot of classification accuracies for five algorithms (HRO-GWO, GWOGA, MTBGWO, IBGWO, and BGWOCS) across ten benchmark datasets.

Reproducibility Protocol: In order to guarantee consistent data splits and initialization over 20 runs, all results are created using random seed = 100. The public repository’s split_dataset(seed = 100) function creates the same stratified 60/20/20 segments for every dataset. For complete transparency and validation, results/run_*.csv contains the fold-wise performance (per run).

Ablation experiments

To isolate the contribution of key components, elimination studies were performed on four representative datasets. Table 18 presents the results of elimination experiments performed on these four representative datasets to assess the separate contributions of nonlinear convergence coefficient, Probabilistic Gaussian variation, and alternating schedule.

The objective of this analysis was to isolate and examine how each component influences the model’s performance in terms of classification accuracy and feature selection efficiency. Specifically, three major components were investigated: the nonlinear convergence coefficient, the Lévy flight–based exploration mechanism from the Cuckoo Search, and the alternating hybridization schedule between BGWO and CS.

Table 18 Ablation results for key Components.

The nonlinear convergence scheme dynamically adjusts the convergence rate as iterations progress, thereby improving the balance between exploration and exploitation and preventing premature stagnation during the optimization process. The integration of Cuckoo Search introduces Lévy flight–driven random walks, which enhance global search capability and help the algorithm escape from local minima that typically hinder convergence in conventional swarm-based methods. Finally, the alternating hybridization schedule between BGWO and CS enables the algorithm to adaptively switch between exploration and exploitation phases, maintaining population diversity while preserving search stability.

Discussion

The proposed BGWOCS (Binary Grey Wolf Optimization with Cuckoo Search) demonstrates superior performance in both classification accuracy and feature reduction across ten benchmark datasets. Its hybrid design effectively integrates the social hierarchy and adaptive exploitation of Grey Wolf Optimization with the global Lévy flight–based exploration of Cuckoo Search. This combination enables BGWOCS to maintain population diversity, avoid premature convergence, and achieve stable optimization dynamics. The nonlinear convergence factor further balances exploration and exploitation, ensuring faster convergence toward optimal solutions.

Experimental and statistical analyses confirm that BGWOCS significantly outperforms existing methods such as HRO-GWO, GWOGA, MTBGWO, and IBGWO, with all p-values below 0.05. On complex datasets like Gisette and Dexter, it achieves accuracy improvements of 2–4% while selecting substantially fewer features. Moreover, its low standard deviations (≈ 1.1%) across 20 runs indicate consistent reliability.

Although computational cost increases moderately for high-dimensional datasets (e.g., Gisette), BGWOCS maintains a favorable trade-off between accuracy and runtime. The selected feature subsets also exhibit strong generalization across classifiers, including KNN, SVM, and Random Forest, confirming model independence. In summary, BGWOCS achieves an effective balance between exploration, exploitation, and efficiency, offering a statistically robust and computationally scalable solution for high-dimensional feature selection and complex classification tasks.

Conclusion

This study presented a novel hybrid meta-heuristic algorithm, BGWOCS, designed to address the challenges of high-dimensional feature selection. The proposed approach integrates the local exploitation strength of Binary Grey Wolf Optimization with the global Lévy flight–based exploration of Cuckoo Search, supported by an adaptive nonlinear convergence factor and a probabilistic variation operator to maintain balance and diversity during optimization.

Extensive experiments conducted on ten standard UCI benchmark datasets confirmed the effectiveness and scalability of BGWOCS. The algorithm consistently outperformed four state-of-the-art methods HRO-GWO, GWOGA, MTBGWO, and IBGWO in terms of average classification accuracy, feature reduction, and convergence stability. The results demonstrated that BGWOCS successfully identifies compact and informative feature subsets while maintaining high predictive accuracy and competitive computational efficiency. Furthermore, the algorithm achieved superior performance in best, mean, and worst fitness metrics, verifying its ability to maintain equilibrium between exploration and exploitation throughout iterative optimization. Importantly, these improvements were achieved without incurring additional computational cost, underscoring the efficiency of the hybrid framework.

For future research, BGWOCS can be extended by integrating adaptive classifier selection mechanisms or ensemble learning strategies to further enhance classification robustness. Additionally, parallel and distributed implementations, as well as hybridization with emerging meta-heuristics, could further reduce computational time and improve scalability for large-scale, real-time data analytics.