Introduction

Recently, the swift progress in high-throughput technologies has resulted in a significant growth in data, both in its complexity and the volume of samples. The challenge of managing this extensive and intricate data efficiently is becoming more pronounced. The conventional manual approaches to dealing with these data sets are now considered unfeasible. Consequently, data mining (DM) and machine learning (ML) methods have risen to the forefront, offering automated knowledge extraction and pattern identification solutions within this vast data.

A notable obstacle encountered in this procedure is the prevalent noise within the gathered data. This noise can result from multiple factors, including imperfections in the data collection technologies and the data sources’ intrinsic characteristics. For example, in medical imaging, any malfunction in the imaging devices can lead to noise in the data, which can interfere with further analysis. Furthermore, the rise of social media has shifted online users from merely consuming content to producing and consuming it. The quality of data from social media platforms can vary dramatically, from extremely valuable to spam or offensive content. Additionally, social media data often features informal language characterized by grammatical mistakes, typos, and incorrect punctuation. This diversity and lack of formality increase the difficulty of deriving meaningful knowledge and patterns from such broad and noisy datasets. In the process of classification for machine learning and Data mining, the primary aim is to identify the category of each instance in a given dataset using a two-phase approach—training and testing. For this goal, the classifier model is created during the training phase to classify each instance in the training set, which consists of available records. During the later stages of testing, the classifier’s precision is evaluated using a group of testing sets. These sets were not employed during the training phase, but this research concerns their respective classes. Dealing with high dimensionality can pose a significant obstacle and may hinder the effectiveness of the classification process. Datasets containing many features may be utilized in specific practical applications and fields, such as the medical field, bioinformatics, text mining, and image classification. However, some of these features may need to be more relevant, redundant, or contain noise. Such characteristics in the dataset could result in over-fitting data or create ambiguity in the learning mechanism1,2.

Feature Selection (FS) is commonly employed as a prepossessing step to improve the accuracy of a classification model. The core objective of FS is to identify the most relevant features that positively impact model performance while discarding irrelevant or harmful features at a minimal cost3. Various algorithms have been created to identify the most effective set of features that can improve the accuracy of a classification model for a given dataset. When dealing with datasets containing many features, traditional algorithms encounter challenges in identifying the significant features.

There are three FS (Feature Selection) algorithm types: filter, wrapper, and embedding. Regarding filtering algorithms, the FS process and classifier model are treated as distinct phases. During the initial phase, specific metrics extract features from the dataset that significantly impact the classification process while ignoring the others. In the feature selection process, only the chosen attributes are used in the classification model for its phase. However, wrapper algorithms modify the selected feature subsets dynamically, depending on the accuracy of the classifier. In Feature Selection (FS), the wrapper approach is commonly used. This approach involves generating subsets of features using specific search methods and determining their relevance by running a classification algorithm. Embedded algorithms are then combined with a classifier to decide which features should be kept or removed from the dataset4,5,6.

As per reference7, FS is widely believed to present a combinatorial optimization problem that is most likely NP-complete. Each feature in a dataset has twice as many potential solutions, making it challenging and time-consuming to determine the most efficient subset of features. Additionally, in references8,9, the feature selection (FS) problem is a problem in the field of optimization that is considered to be NP-hard. This means that the more complex the problem, the longer it takes to compute the solution, with computational time increasing exponentially. Hence, researchers have shown a keen interest in meta-heuristic (MH) algorithms10; four main categories of algorithms excel in solving various optimization problems. These categories include Human-based algorithms, Swarm intelligence algorithms (SI), Physics-based algorithms (PA), and Evolutionary Algorithms (EA).

Swarms and animal behavioral patterns are the basis for SI algorithms11. A commonly employed algorithm in optimization problems is Particle Swarm Optimization (PSO). The algorithm is designed based on the collective behaviors of swarm objects. In this approach, every individual object represents a potential solution12. The concept behind Artificial Fish Swarm (AFS) involves replicating the actions of fish, such as hunting, gathering in groups, and tracking, to perform a localized search of individuals to attain a global optimal solution. This technique is discussed in reference13. Bacterial Foraging Optimization (BFO) is a recently developed algorithm that draws inspiration from the foraging behavior of Escherichia coli in humans. It involves competition and cooperation among bacterial populations and is employed as a global random search algorithm14. Ant Colony Optimization (ACO) is a well-known swarm intelligence algorithm that imitates the foraging behavior of different ant species. In natural settings, ants use chemical pheromones to identify the most optimal path for the colony members to follow15. A swarm intelligence optimizer known as pigeon-inspired optimization solves air-robot path planning problems. The technique involves using a map and compass operator model based on a magnetic field and the sun and a landmark operator model that utilizes landmarks16. The bat algorithm is a metaheuristic algorithm based on the behavior of animal groups or herds. It uses the echolocation behavior of bats to generate solutions for domains with single- or multi-objectives that exist within a continuous solution space. This information is based on reference17. The grey wolf optimizer is an algorithm that imitates the leadership hierarchy and hunting mechanisms of grey wolves in nature and is categorized as a swarm intelligence algorithm18.

To effectively search a given space, any search algorithm must balance exploring new areas within that space with exploiting already known areas. This means it must balance venturing into uncharted territory and focusing on areas near previously explored locations. By achieving an optimal balance between exploration and exploitation, a search algorithm is more likely to succeed in its search efforts19.

There have been multiple attempts to understand the mechanism that regulates the equilibrium between exploration and exploitation in search algorithms. However, due to the need for more consistent knowledge, several interesting metrics have been proposed to quantify the level of exploration and exploitation in metaheuristic schemes. These metrics monitor the current diversity of the population and have been suggested in various indexes. Despite several indexes and ongoing proposals, there is yet to be a definitive or objective way to measure metaheuristic algorithms' exploration/exploitation rate20. Achieving success with metaheuristic algorithms requires a careful balance between exploration and exploitation throughout the evolutionary process. To achieve this balance more effectively, it is important to optimize the level of exploration and exploitation21.

Many SI algorithms that show high performance in various optimization problems have been developed in the literature. Some of these algorithms include the sailfish optimizer (SFO)22, Chaotic Coyote Algorithm23, Modified Social-Spider Optimization Algorithm24, Cheetah Optimization Algorithm25, Migrating Birds Optimization26, Owl Optimization Algorithm27, Bacterial Foraging Optimization Algorithm28, Salp Swarm Algorithm (SSA)29.

Many metaheuristic algorithms are based on evolutionary behaviors that emulate biological processes such as mutation, crossover, and selection, and they are named EA algorithms. Some of these algorithms include Differential Evolution (DE)30, Genetic Algorithm (GA)31, Invasive Tumor Growth Optimizer (ITGO)32 and Biogeography-Based Optimizer (BBO)33.These algorithms have shown great efficiency in various optimization applications.

Optimization algorithms that are based on physical laws are called PhA algorithms and include Big Bang-Big Crunch BBBC34, Multi-verse Optimizer (MVO)35, and Gravitational Search Algorithm (GSA)36.

Contribution

The proposed framework in this paper puts forward a hybrid algorithm that combines the DE algorithm with the SFO algorithm to handle the FS strategy. It offers novel contributions that can be summarized as follows:

  1. 1.

    A new algorithm called the DESFO algorithm has been created by integrating and reproducing DE and SFO.

  2. 2.

    The transfer function (TF) is the V-shaped function to convert position values into binary format.

  3. 3.

    The periodic mode boundary handling (PMBH) approach and a novel local search (LS) strategy are used to improve the exploration and exploitation process.

  4. 4.

    In supervised classification, the DESFO algorithm is used for wrapper feature selection.

  5. 5.

    The DESFO’s performance is evaluated through metrics such as average fitness rate, average accuracy rate, and average number of selected features.

  6. 6.

    To assess the effectiveness of the suggested DESFO algorithm with the RF and K-NN machine classification algorithms, a Wilcoxon's non-parametric rank-sum test (with a significance level of 5%) is used to compare it with similar algorithms.

Structure

The paper follows the structure outlined below:

  1. 1.

    Section “Related works” provides the recent stats of art and related works.

  2. 2.

    Section “Preliminary work” provides Preliminary works and explanations about the original DE and SFO algorithms.

  3. 3.

    Section “Methodology of the proposed DESFO” introduces the methodology of the proposed algorithm DESFO, along with the related steps.

  4. 4.

    Section “Experimental results and analysis” presents the experimental results of the DESFO algorithm and compares it with other MH algorithms.

  5. 5.

    Section “Conclusion and future works” concludes the paper.

Related works

Numerous research studies have been conducted in feature selection utilizing metaheuristic algorithms. Some of these efforts are outlined below.

Rodrigues et al.37 introduced a binary cuckoo search algorithm called BCS, which uses a function to convert continuous variables to their binary form to obtain the optimal feature subset. The Optimum Path Forest classifier was used to apply BCS on two datasets related to theft detection in a power system. The results indicated that BCS was the most efficient and appropriate method for solving feature selection issues in industrial datasets while also being the fastest.

In their study, Emary et al.38 introduced the initial binary edition of the firefly algorithm (FFA) for addressing feature selection issues by utilizing a threshold value. The algorithm exhibited a high level of exploration quality, enabling it to swiftly identify a solution to the problem.

To tackle feature selection problems, Nakamura et al.39 developed a binary version of BA called BBA. They used a sigmoid function to confine the position of bats to binary variables. They employed the optimum path forest classifier and applied BBA to five datasets to evaluate the accuracy.

Zawbaa et al.40 proposed a binary version of the ALO algorithm to address the feature selection problem by applying a threshold value to continuous variables. In their study, Emary et al.41 employed the sigmoidal transfer function to obtain binary vectors, also known as bGWO. They evaluated the classification accuracy of these vectors using a K-NN classifier across eighteen distinct UCI datasets. The researchers also utilized small, random, and large initialization methods during the initialization phase to facilitate thorough exploration.

Hussien et al.42,43 utilized S and V-shaped transfer functions in conventional WOA to solve binary optimization problems. They also applied this method to solve feature selection problems with eleven UCI datasets. To ensure the relevance of the selected features for classification, they used the K-NN classifier.

In their study, Gad et al.44 introduced a new version of the sparrow search algorithm, which has been developed. This version uses a combination of random agent repositioning and the LS method to handle feature selection effectively in supervised classification tasks. This approach is particularly useful for choosing the best or nearly optimal subset of attributes from a given dataset while maintaining maximum accuracy rates.

Ghosh et al.45 have presented a new variant of the latest and most powerful optimizer, the Sailfish Optimizer (SFO), called the Binary Sailfish (BSF) optimizer for solving FS problems. They utilized the sigmoid transfer function to convert the continuous search space of SFO into a binary one. They also incorporated adaptive β-hill climbing (AβHC), a recently proposed meta-heuristic algorithm, with the BSF optimizer to enhance its exploitation ability.

Emrah et al.46 have proposed a new filter criterion that mutual information, ReliefF, and Fisher Score inspire. Rather than relying on mutual redundancy, this criterion aims to select the most highly ranked features determined by Relief and Fisher Score while ensuring mutual relevance between the features and class labels. Based on this new criterion, the team has developed two novel differential evolution (DE) based filter approaches.

Bacanin et al.47, presented a diversity-oriented social network search to tackle the feature selection problem in detecting phishing websites. The authors aimed to enhance the detection of phishing websites by refining an extreme learning model that leverages the most pertinent subset of features from the phishing websites dataset. A new algorithm was developed and integrated into a two-level cooperative framework to accomplish this. The efficacy of the proposed algorithm was then evaluated and compared against six other state-of-the-art metaheuristics algorithms.

Alrefai et al.48 Proposed an effective method for cancer classification using ensemble learning. The study employed particle swarm optimization and an ensemble learning method for feature selection and cancer classification. The study's findings indicate that the proposed method is effective for cancer classification based on microarray datasets. Furthermore, the accuracy of the proposed method proves its superiority over other methods.

Gomez et al 49 proposed a new technique called Two-Step Swarm Intelligence. The method involves breaking down the heuristic search carried out by agents into two stages. In the first phase, agents generate partial solutions, used as starting states in the second phase. Our study aimed to assess the effectiveness of this approach in resolving the Feature Selection Problem using Ant Colony Optimization and Particle Swarm Optimization. The feature selection is based on the reduction concept in the Rough Set Theory. The results demonstrate that the Two-Step Swarm Intelligence method improves the performance of ACO and PSO metaheuristics regarding computation time and the quality of reduction produced.

Bezdan et al.50 proposed an algorithm based on a binary hybrid metaheuristic approach to select the optimal feature subset. Specifically, they combined the brainstorm optimization algorithm with the firefly algorithm to create a wrapper method for feature selection problems on classification data sets. The performance of the proposed algorithm was evaluated on 21 data sets and compared against 11 other metaheuristic algorithms. Additionally, the algorithm was applied to the coronavirus data set.

Gao et al.51 Introduced a Clustering Probabilistic Particle Swarm Optimization (CPPSO) to improve the traditional particle swarm optimization approach. CPPSO incorporates probabilities to represent velocity and an elitism mechanism. Additionally, CPPSO uses the K-means algorithm to cluster the population based on the Hamming distance into two sub-populations, which enhances its performance. The effectiveness of CPPSO is evaluated by comparing it against seven existing algorithms using twenty diverse datasets.

Latha et al.52 Addressed the feature selection problem by implementing grey wolf optimization (GWO) with decomposed random differential grouping (DrnDG-GWO) as a supervised learning technique. The study found that combining supervised machine learning with swarm intelligence techniques yielded the best feature optimization results.

Motivations

Storn et al.30 proposed the differential evolution (DE) algorithm in 1997, a powerful and straightforward stochastic search method operating on populations. DE is an effective global optimizer for continuous search problems and has been successfully applied in various domains, such as pattern recognition53, communication54, and mechanical engineering55,56.

The Sailfish Optimizer (SFO) is a highly effective optimization algorithm developed and presented in 2019 by a team of researchers known as Shadravan et al.22. This algorithm is based on the concept of population, and it mimics the hunting behavior of a group of sailfish as they hunt for a school of sardines. The strategy employed by the sailfish group involves alternating between attacking a group of sardines and retreating to capture their prey. The SFO algorithm has become popular in the optimization community due to its robustness and effectiveness. In this paper, an algorithm called DESFO that integrates both DE and SFO has been proposed. Due to their power and superiority, the proposed algorithm can attain satisfactory search accuracy, swift convergence speed, and improved stability.

Moreover, it can prevent getting stuck in local optima, which is an issue that still needs to be systematically addressed for the FS problem. On the other hand, compared to the state-of-the-art meta-heuristic techniques, including the original DE and SFO, the DESFO approach yields superior results by producing optimal or near-optimal outcomes for numerous problems. The proposed feature selection algorithm method was tested on 14 benchmarks using multi-scale attributes and records from the UCI machine learning repository. This implementation was carried out 30 times to validate its efficacy57. The average classification accuracy is calculated using two standard machine learning classification algorithms: Random Forest (RF) and k-nearest Neighbor (k-NN).

Preliminary work

As mentioned in the previous section, meta-heuristics have several benefits, but can existing methods adequately solve the FS problem? The No Free Lunch theorem (NFL)58 answers this question. This theorem suggests that no single algorithm can perfectly solve all optimization problems. In the case of FS on a dataset, an algorithm may perform exceptionally well for one dataset but inadequately for another. Therefore, there is still a need for an advanced metaheuristic approach that can efficiently solve almost all possible FS dataset types, which is currently an open research question. From this point in this section of the paper, the basic DE algorithm and SFO algorithm will be explained. The two algorithms will be integrated under the DESFO algorithm to optimize the feature selection problem and enhance classification accuracy.

Differential evolution algorithm (DE)

In 1997, Storn et al.30 introduced a Differential Evolution (DE) algorithm, considered one of the most reliable versions of Evolutionary Algorithms. It is known for its fast convergence, user-friendly nature, and ease of implementation. Additionally, the same set of parameters, such as Population size (NP), Crossover rate (Cr), and Scaling Factor (F), can be applied to address various optimization problems. The process begins with a given set of solutions. Then, a modified or mutant solution is produced for each solution vector in the current set by adding the weighted difference between two candidate solutions to other candidate solutions. This method, known as Differential Evolution (DE), has proven effective and widely applied in various optimization problems in different scientific and engineering domains59.

The structure and primary search operators utilized by the DE algorithm are explained as the following:

Mutation

In every epoch (t), a mutation operator is applied by DE to generate a new donor vector, also known as a mutant vector, for each target solution. The mutation operator randomly selects three candidate solutions according to Eq. (1); it demonstrates that the donor vector is created by scaling the difference vector between two vectors and then adding the result to the third solution30.

$${V}_{i,G+1}={x}_{r1,G}+F\left({x}_{r2,G}-{x}_{r3,G}\right)$$
(1)

In this process, three distinct integers \(r1, r2 and r3\) are randomly selected, and  [1, NP] where NP is a positive integer greater than or equal four. Additionally, these integers are different from the running index i. The differential amplification \(\left({x}_{r2,G}-{x}_{r3,G}\right)\) is then amplified by a constant factor F, which ranges from 0 to 2.

Crossover

After mutation, a crossover search operator produces an offspring (trial) vector from the target solution. The exponential and binomial crossover search operators are the most frequently used and uncomplicated ones. Please keep in mind that for each decision variable (DV) \(j\) in the scenario where (\(rand\le {C}_{r}\)), do the following:

$${u}_{i,j,G}=\left\{\begin{array}{c}{u}_{i,j,G}\,\,\,\,\,\,\,\,\, if \,rand \left(j\right)\le {C}_{r} \,or \,j={j}_{rand} \\ { x}_{i,j,G}\,\,\,\, otherwis{e}{\prime}\,\,\, j=\text{1,2},\dots D \end{array}\right.$$
(2)

where a random value \({j}_{rand}\) is selected from the range of, where \({N}_{x}\) is a specified value, a value chosen at random and referred to as “jth evaluation,” denoted by \(rand (j)\) is selected from a uniform random number range of [0, 1]. This ensures that at least one DV (design variable) is obtained from the trial vector. The crossover rate \({C}_{r}\), which is used to control the number of variables, is obtained from the donor vector, and it is guaranteed that \({V}_{i,G+1}\) provides at least one parameter to \({u}_{i,j,G}\)

Selection

A selection operator is utilized to determine the optimal solution by comparing the objective function values of both the parent and offspring. If the offspring has a lower objective function value, it is preserved for the subsequent iterations. If not, the parent vector is mathematically represented within that particular generation, and it is obtained using:

$${x}_{i,G+1}=\left\{\begin{array}{c}{u}_{i,G} \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,if\left(f\left({u}_{i,G}\right)\le \left({x}_{i,G}\right)\right) \\ { x}_{i,G} \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,otherwise \end{array}\right.$$
(3)

To determine if it should join generation G + 1, the trial vector \({x}_{i,G+1}\) is evaluated against the target vector \({x}_{i,G}\) using the greedy criterion. If the trial vector \({x}_{i,G+1}\) results in a lower cost function value compared to the target vector \({x}_{i,G}\), then the trial vector \({x}_{i,G+1}\) replaces the target vector \({u}_{i,G}\); if not, the original target vector \({x}_{i,G}\) value is kept.

The sailfish optimizer (SFO)

Shadravan et al.22 developed a unique algorithm called sailfish optimizer (SFO) in 2019, which is based on swarm intelligence and is a population-based algorithm. To devise this technique, the scientists took cues from a pack of predatory sailfish. The approach involves the use of two distinct populations. The sailfish population is responsible for intensifying the search around the current best solution, while the sardine population diversifies the search space. The sailfishes are considered potential solutions, and their positions in the search space represent the problem's variables. The algorithm aims to randomize all search agents’ movement (sailfish and sardine) to the greatest extent possible. Sailfishes are dispersed throughout the search space, while the positions of sardines aid in discovering the optimal solution in the search space.

The algorithm identifies the sardine with the best fitness value as the ‘injured’ fish, with its position denoted as (\({P}_{srdinj}^{i}\)) at the \({i}^{th}\) iteration. During each iteration, the positions of both sardines and sailfishes are updated. For the \({i}^{th}\) iteration, the position of a sailfish is updated using the ‘elite’ sailfish \({P}_{Slfbest}^{i}\) and the ‘injured’ sardine based on a specific criterion.

The position of sailfishes and sardines is modified at each iteration denoted by \(i+\), and the (elite) and (injured) alter or update the position of a sailfish to a new one denoted by. The updating is done according to Eq. (4) 37:

$${P}_{Slf}^{i+1}={P}_{Slfbest}^{i}-{\mu }_{i}\left(rand*\frac{{P}_{Slfbest}^{i}+{P}_{srdinj}^{i}}{2}- {P}_{Slf}^{i}\right)$$
(4)

where the value of \(rnd \in (\text{0,1})\) is a random value, and the coefficient \({\mu }_{i}\) is generated by Eq. (5):

$${\mu }_{i}=\left(3*rand*PrD-PrD\right)$$
(5)

where In each iteration, the prey density (\(PrD\)), which represents the number of prey available, is determined using Eq. (3). As the number of prey decreases during group hunting, the value of \(PrD\) decreases accordingly.

$$PrD=1-\frac{{N}_{Slf}}{{N}_{Slf}-{N}_{srd}}$$
(6)

Sailfish’s and sardine numbers are represented by \({N}_{Slf} and {N}_{srd,}\) respectively. The \({Num}_{Slf}\) can be calculated according to Eq. (7):

$${N}_{Slf}= {N}_{srd}*Prcent$$
(7)

Please keep in mind that (\(Prcent\)) refers to the percentage of the sardine population that constitutes the initial sailfish population. It is also assumed that the initial number of sardines exceeds the number of sailfish.

The positions of the sardines are updated in each iteration according to Eq. (8):

$${P}_{Srd}^{i+1}=rand*({P}_{Slfbest}^{i}-{P}_{Srd}^{i}+ATK)$$
(8)

The old position and the updated position of the sardine are represented by \({P}_{Srd}^{i}\) and \({P}_{Srd}^{i+1,}\) respectively. While the \(ATK\) represents the power of the sailfish attack at each iteration \({i}^{th}\) and can be calculated by Eq. (9):

$$ATK=A*\left(1-\left(2*itr*k\right)\right)$$
(9)

ATK is crucial in determining the number of sardines that update their positions and the extent of their displacement. Decreasing ATK can facilitate the convergence of search agents. Based on the \(ATK\) parameter, the values of \(\gamma\) (number of sardines that update their position) and \(\delta\)(number of variables) of the sardines are computed using Eqs. (10) and (11):

$$\gamma =ATK*{N}_{Srd}$$
(10)
$$\delta =ATK*v$$
(11)

where \({N}_{Srd}\) and \(v\) denote the sardine number and the number of variables, respectively, if a sardine surpasses the fitness level of any sailfish, the sailfish will adjust its position to follow that sardine. In contrast, the sardine is removed from its population.

To explore the search space effectively, it’s important to select both sailfishes and sardines randomly. Sailfishes have a decreasing attack power after each iteration, allowing sardines to escape from the most aggressive sailfish. This helps to balance the exploration and exploitation of the search space. The \(ATK\) parameter is used to find the optimal balance between both of them.

Methodology of the proposed DESFO

Improving the accuracy of classifiers involves focusing on pertinent features. Some Recent research studies1,60 suggest utilizing the methodology of feature selection (FS) to substitute a sizable quantity of insignificant features with a more concise and applicable subset of features. FS categorizes features as essential or non-essential, marking them as 1 or 0. This paper presents a hyped algorithm named (DESFO) which consists of two algorithms, (DE) differential evolution and (SFO) sailfish optimizer, for implementing FS. The algorithm comprises several stages: initialization, position updating, binary conversion, exploration optimization via a new strategy, and exploitation optimization.

Table 2 displays the number of iterations allocated for each algorithm, which is 100. For the proposed algorithm, DESFO, this number was distributed equally between DE and SFO, with 50 iterations each. DE optimized the first 50 iterations to obtain the optimal solution, which was then passed on to SFO to enhance selected relevant features and achieve the best classification accuracy. The following sections provide detailed explanations of each of these stages.

Initial population generation

The first step in using the DESFO algorithm is generating an initial population of X positions representing potential solutions in a D-dimensional space. The population size is determined using a specific formula.

$$X=Round\left(10+2*\sqrt{D}\right).$$
(12)

X signifies the overall number of positions, while D represents the problem's dimensionality. The position matrix is defined as:

$$M=\left[\begin{array}{c}{m}_{\text{1,1}}, {m}_{\text{1,2}}, \dots {m}_{1,p}\\ {m}_{\text{2,1}}, {m}_{\text{2,2}},\dots {m}_{2,p}\\ \vdots \,\,\,\,\,\,\,\,\,\vdots \,\,\,\,\,\,\,\ddots \,\,\,\,\,\vdots \\ {m}_{X,1}, {m}_{X,2}, \dots {m}_{X,p}\end{array}\right]$$

The \({j}^{th}\) solution is represented by \({M}_{i,j}\), where j is the \({j}^{th}\) component. \(M\), the initial population, is generated within predefined bounders as:

$${M}_{i}^{u}=u\left(\text{0,1}\right)*\left(UB-LB\right)+LB$$
(13)

Position update in DESFO

Updating the position involves using the equations of DE and SFO as described in subsections 3.1 and 3.2. After updating the position, it goes through binary conversion, as explained in Subsection 4.3. The fitness function then assesses the binary-transformed vector to calculate the classification error while keeping the original format of the vector for future updates.

Position binary conversions

Converting the values of meerkat positions from continuous to binary is necessary before assessing their fitness using the FS method. This is because the DESFO method, which is used to derive the position values, differs from the binary framework of FS, making it challenging to apply the latter directly to binary/discrete problems.

The feature selection (FS) method uses a vector of binary values, where the selected features are represented by 1s, indicating 0s represent their continuous values and the non-selected features. The length of the solution vector is equivalent to the count of features in the original dataset.

A transfer function (TF) has been utilized in the proposed algorithm, which Fang et al. suggested61, which has a V-shaped curve and is known for its exceptional global search capability. The function is expressed as follows:

$$v\left(y\right)=\alpha *\frac{\text{arctan}\left(y\right)*\frac{\pi }{\sqrt{1+{y}^{2}}}}{\pi }$$
(14)

The position value obtained is represented by \(y\), and a DESFO position is considered to have a valid TF output where \(\alpha\) is less than 0.64 and falls within the range of [0, 1]. The defined update rule for DESFO’s position is based on the following equation:

$${Y}_{i}^{bin}=\left\{\begin{array}{c}1, \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,If \,rand<v({Y}_{i})\\ 0, \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,otherwise\end{array}\right.$$
(15)

Fitness evaluation

The DESFO framework and a new FS-based technique incorporate k-NN and RF as evaluative mechanisms. The k-NN method62 selects the most common class among the closest neighbors to predict the classification of new instances. On the other hand, the RF, explained in44, uses decision trees to recursively divide the training data into small sets, which helps optimize the classification task by using an impurity criterion such as information gain or “gini” index63. These classifiers are particularly efficient in handling high-dimensional data and require minimal computational effort, as stated in62.

Achieving the right balance between accuracy and feature set size is crucial in DESFO. While opting for a smaller feature set can improve the precision of classifiers such as k-NN and RF, it may also compromise accuracy due to the reduced feature set64. The relationship between the size of the feature set and the preferred features is inversely proportional, which means there is a potential trade-off between accuracy and feature set size. Therefore, the PMBH method is vital in balancing feature selection and classification accuracy65.

When assessing the effectiveness of an algorithm, it is essential to consider the trade-off between precision and feature size. This trade-off can be mathematically represented as:

$$FIT={\alpha }_{1}*\left(1-accuracy\right)+{\alpha }_{2}+\left|\frac{\left|{D}^{*}\right|}{\left|D\right|}\right|$$
(16)

In the given equation, there are two weight coefficients, α1 and α2, where α1 is a value between 0 and 1, and α2 is determined by subtracting α1 from 1. These values have been determined through extensive testing, as mentioned in reference, and the expression \(represents\) the ratio of the selected features to the total number of features in the original dataset. The main objective of this design is to increase precision while reducing the length of the feature set, as suggested in reference38. The value |D | represents the size of the selected feature set, while |D| represents the total number of features in the original dataset.

Improving exploration

Search agents like meerkats tend to explore outside their assigned search areas to find optimal solutions. However, issues may arise when using boundary-handling techniques to keep an agent within the initial search territory, as discussed in61. The two primary traditional methods for boundary handling are Boundary and Random modes. In Boundary mode, if a solution's dimension d goes beyond the search space S, it gets repositioned to the nearest boundary, either lower bound L or upper bound U. Conversely, dimension d of S receives random value mutations in Random mode. These traditional methods, however, have limitations in fully exploring the search space. Therefore, Periodic Mode Boundary Handling (PMBH) was developed as per61, aiming to improve the exploration phase. PMBH allows for infinite search space for agent movement, consisting of periodic replicas of the original space S, maintaining the same fitness landscape, as shown in Fig. 1.

Figure 1
figure 1

PMBH70.

Exploitation optimization

This particular segment notices the updated LS principles of the enhanced DESFO. These principles aim to improve the efficiency of algorithms and ensure better utilization by generating a fresh population with optimal positions while maintaining the essential structure.

Three main principles guide the proposed approach. Firstly, to address the limitation of the original algorithm that lacks a mechanism to recall and preserve the best solutions over iterations, a binary matrix has been introduced to store the top solutions obtained previously. Secondly, repetitive best solution patterns resulting from binary conversion can reduce exploitation effectiveness, which can be improved by incorporating distinct solutions in the binary matrix. Lastly, the LS strategy relies on identifying solutions close to the best discovered by converting continuous positions into binary format and following a constrained normal distribution, as shown in Eq. (17).

$${x}_{d}^{l+1}={x}^{L}+{\beta x}^{L}$$
(17)

The solution obtained through minor mutation slightly deviates from the current best, due to a random factor represented by \(\beta\) which is normally distributed \(N(0.0, 0.4)\). The optimal solution is initially added to an empty set to find local search solutions. The set has a fixed maximum size, \({LS}_{max}\). Then, a new solution is generated by applying Eq. (17) on the current \({g}_{best}\), which is then converted to binary and assessed for fitness. If this new solution outperforms the current best, it is considered the best solution.

The flowchart and Pseudo code of DESFO

In Fig. 2, the steps of the proposed DESFO algorithm are demonstrated.

Figure 2
figure 2

DESFO flowchart.

Algorithm 1
figure a

Differential Evolution with Sailfish Optimizer (DESFO)

Complexity analysis

In analyzing the complexity of the DESFO, we can delve deeper into the computational processes involved. This includes looking at the computational demands of evaluating classifiers and the benefits of using combined methods in terms of efficiency.

  • Complexity Breakdown by Component

    1. 1-

      Initialization: Initializes NP individuals, each possessing D features. This operation has a complexity of O (NP × D).

    2. 2-

      Differential Evolution Operations:

  • Mutation: For the mutation step to be executed across all individuals, it involves choosing three different individuals and then computing the vector differences for each, which amounts to a complexity of (D) for each individual. Consequently, the total complexity for the mutation step applied to all individuals is O (NP × D).

  • Crossover: for each person, determined by the probability CR, this leads to O (NP × D).

  • Selection: Evaluating and selecting the better individual between the target and trial vector typically involves fitness computation, which can be a significant factor depending on the complexity of the fitness function. the complexity is O(NP)

    1. 3-

      Sailfish Optimizer Updates:

  • Position Update: Each sailfish updates its position based on the positions of elite and injured sardines, the complexity per generation is O(NP)

    1. 4-

      Binary Conversion and Fitness Evaluation:

  • Binary Conversion: Each of the DD features of each NPNP individuals is converted from a real number to a binary value based on a transfer function, totaling O(NP × D)

  • Fitness Evaluation: The evaluation of fitness depends on the classification algorithm used. For k-NN or RF, the time complexity might depend on the number of features D and possibly the sample size if a wrapper method is used. The complexity is therefore O (NP × f (D)), where (D) represents the computational complexity of evaluating one individual.

    1. 5-

      Local Search:

  • Local Search Operations: Assuming that local search is applied to a subset of the population (say k best individuals) and each local search operation has a complexity O ((D)), where (D) might involve multiple evaluations of minor variations of the individual. If LS iterations of local search are performed for each individual, the complexity for this part would be (k × LS × (D)).

  • Overall Complexity

Combining all these elements, the total complexity per generation of the DESFO algorithm would be: (NP×D+NP×D+NP+NP×D+NP×(D)+k×LS×g(D))

This simplifies to: (3×NP×D+NP×(D)+k×LS×g(D))

For all generations, MaxGens, the overall complexity becomes: (MaxGens×(3×NP×D+NP×f(D)+k×LS×g(D)))

  • Comparing the complexity of DESFO with DE and SFO shows that the total complexity for DESFO is O(MaxGens × (3 × NP × D + NP × f(D) + k × LS × g(D))) however SFO complexity is O(MaxGens × 4 × NP) and DE complexity is O(MaxGens × 3 × NP)which mean that The DESFO has more computational complexity due to its integrated steps and phases

Experimental results and analysis

The following part of the paper presents the results from the proposed DESFO algorithm and compares them with those reported in prior studies. To verify the proposed algorithm, 14 multi-scale benchmarks were utilized—the mean values in the results are represented as evaluation metrics. To showcase the efficacy of the suggested algorithm, in all experiments, we employed the datasets that are elaborated in subsection 5.1, Moreover, the metaheuristic techniques’ main parameters utilized in this paper are outlined in subsection 5.2, in subsection 5.3, evaluation measures are explained, then, in subsection 5.4, the proposed DESFO algorithm is evaluated and compared with the k-NN and RF algorithms to investigate their respective results, in subsection 5.5, An investigation was conducted to compare the outcomes of the suggested DESFO algorithm with those of other methods, Convergence graphs are depicted in Sect. 5.6, in subsection 5.7, the Wilcoxon's test is conducted to assess the credibility of differences in fitness rates between the proposed DESFO algorithm and its counterparts and the final Sect. 5.8 is for discussion of the results.

Benchmarks description

The proposed algorithm’s performance is demonstrated using 14 multi-domain features and instance benchmarks. These benchmarks are obtained from the UCI machine learning repository57. A variety of attributes and instances in each benchmark is beneficial in validating the proposed algorithm. Table 1 provides an overview of the benchmarks used in this paper, along with their respective properties and descriptions. The datasets shown in Table 1 are sorted in descending according to the number of features.

Table 1 Dataset characteristics.

Parameters configuration

The DESFO algorithm proposed in this study was evaluated against several meta-heuristic algorithms, including the two original algorithms that were combined, the Differential Evolution (DE) algorithm30 and the sailfish optimization (SFO) algorithm22, as well as nine of the other algorithms, including Harris Hawks Optimization (HHO)66, Particle Swarm Optimization (PSO)67, Bat Algorithm (BA)17, Whale Optimization Algorithm (WOA)68, Grasshopper Optimization Algorithm(GOA)69, Grey Wolf Optimization (GWO)18, Bird Swarm Algorithm (BSA)70, Henry gas solubility optimization (HGSO)71, and Artificial Bee Colony (ABC)11. In this work, the ML classifiers' primary parameters have been established as follows: the k-NN classifier’s Euclidean distance metric has been approximated to be 5. The estimation was based on the outcomes obtained from previous papers, such as72. On the other hand, the Random forest (RF) classifier73 is a popular machine-learning algorithm often used for complex tasks such as time-series forecasting, image classification, facial expression recognition, action recognition and detection, visual tracking, label distribution learning, and more. Every method is evaluated on each dataset by conducting 30 distinct experiments. The results are reported according to the mean performance measures. To maintain equality in the evaluation process, each method had a population size of 10 and a maximum of 100 iterations. The size of the datasets used was proportional to the complexity of the problem. The exploration of the continuous search space was confined yet extensive by establishing the search domain as [−1, 1].

A validation process is necessary to assess the optimality achieved by the outcomes in the framework, so a tenfold cross-validation method is employed. This ensures that the values obtained are reliable. The benchmark is randomly split into two subsets, with 80% of the benchmark used for training and the remaining for testing purposes3. During the learning process of the machine learning classifier, sunset for training is used and optimized, while the test subset is used to evaluate the selected features. Table 2 displays the standard configurations for all techniques and the parameter settings for each method, which were determined based on the original variants and the data included in their initial publications. Python is used to run the processes on a computer system environment equipped with a CPU, an Intel i7 processor, RAM, which is 16 GB, and a GPU, which is NVIDIA GTX 1050i.

Table 2 All Algorithms parameter’s configuration.

Metrics of performance

The DESFO algorithm performance is compared to other methods, and each approach is assessed independently in 30 runs per benchmark. The evaluation of the FS strategy employs certain measures to conduct this assessment.

Mean accuracy: The accurate data classification rate (\({Mean}_{acc}\)) can be determined by executing the method independently for 30 runs:

$${Mean}_{acc}=\frac{1}{30}\frac{1}{m} \sum_{k=1}^{30}\sum_{r=1}^{m}match({PL}_{r},{AL}_{r})$$
(18)

where mean accuracy is represented by \({Mean}_{acc}\), while the number of samples in the subset of testing is denoted by m, the predicted class label for a sample is denoted by PLr. In contrast, the reference class label is denoted by ALr. A function called match (PLr, ALr) compares these labels. When PLr is equal to ALr, the value of match (PLr, ALr) is 1; otherwise, it is 0.

Mean fitness value: The metric (MeanFit) measures the average fitness results achieved through the recommended approach by running it individually for 30 runs. This highlights how decreasing the number of chosen features can lead to a lower error classification rate, as per Eq. (16). The best result is indicated by the minimum value, which is evaluated based on fitness as:

$${Mean}_{Fit}=\frac{1}{30}\sum_{k=1}^{30}{f}_{*}^{k},$$
(19)

The \({Mean}_{Fit}\) denotes the mean or average fitness value, while \({f}_{*}^{k}\) indicates the best possible fitness outcome attained during each run of the 30 k-th runs.

The mean number of features selected: This metric, which MeanFeat denotes, represents the mean or average count of chosen features obtained by performing the technique independently for 30 runs and is defined as:

$${Mean}_{Feat}=\frac{1}{30}\sum_{k=1}^{30}\frac{\left|{d}_{*}^{k}\right|}{\left|D\right|},$$
(20)

where \(\left|{d}_{*}^{k}\right|\) denotes the selected features, the number of features for the optimal solution for each run of the thirty k-th runs, while |D| denotes the number of the complete features used from the benchmarks.

  • Wilcoxon’s rank-sum test: To gain a deeper insight into the importance of the method discussed statistical evidence must demonstrate its effectiveness. Therefore, the efficacy of the results derived from the methods used is often validated by employing the Wilcoxon rank-sum non-parametric test. This is favored for its ability to statistically distinguish the significance and dependability of various competing methods74. In this study, the focus is on evaluating the proposed DESFO method in comparison with other algorithms. A null hypothesis is put forward, suggesting no difference in performance between the DESFO algorithm and the others when compared pairwise. Conversely, if proven otherwise, the DESFO algorithm outperforms the rest. The assessment hinges on the calculation of a p-value through the Wilcoxon rank-sum test, which helps analyze the differences in outcomes from 30 separate executions of both the DESFO and competing algorithms.

The results of ML classifiers (k-NN and RF) and DESFO

The mean accuracy (\({Mean}_{acc}\)) was used to compare the performance of the presented ML classifiers (RF and k-NN) with the proposed methods (DESFO-RF and DESFO-K-NN) and the mean number of selected features (\({Mean}_{Feat}\)) in this subsection are also given. This was done to evaluate the effectiveness and scope of the DESFO approach.

Comparisons of DESFO- K-NN and K-NN

In Table 3, a comparison between the DESFO-K-NN technique and the basic K-NN algorithm is demonstrated. The evaluation is centered on two metrics to measure performance: the average accuracy of classification (MeanAcc) and the average count of selected features (MeanFeat).

Table 3 Comparison of MaeanAcc and MeanFeat for DESFO-K-NN & the basic K-NN.

After analyzing Table 3, it is worth mentioning that the DESFO–K-NN technique led to an increase in MeanAcc on all benchmarks. The increase was more than 15% on four of them. Moreover, MeanAcc had a score of over 93% on nine out of the total fourteen benchmarks. It even achieved 100% MeanAcc on four of them. It is worth mentioning that the MeanFeat has decreased in 93% of the benchmarks due to implementing the DESFO–K-NN method as suggested. However, the DESFO–K-NN method could not improve the MeanFeat on the Tic-tac-toe benchmark. Finally, it was found that the DESFO–K-NN technique outperformed the basic K-NN in terms of MeanAcc and most of the benchmarks. On the other hand, the suggested MeanFeat of the DESFO–k-NN approach has shown promising results in feature selection compared to the basic k-NN tested with the chosen datasets.

Comparisons of DESFO- RF and RF

In Table 4, a comparison between the DESFO-RF algorithm and the basic RF algorithm is demonstrated. The comparison is based on two performance metrics: the mean accuracy of classification (MeanAcc) and the mean number of chosen features (MeanFeat).

Table 4 Comparison of MaeanAcc and MeanFeat for DESFO-RF & the basic RF.

After analyzing Table 4, it is worth mentioning that the DESFO–RF technique led to an increase in MeanAcc on 93% of all benchmarks. The increase was more than 15% on four of them. Moreover, MeanAcc had a score of over 92% on nine out of the total fourteen benchmarks. It even achieved 100% MeanAcc on three of them. It is monitored that DESFO-RF and basic RF are equal in accuracy in one of the WineEW benchmarks. It is worth mentioning that the MeanFeat has decreased in 100% of the benchmarks due to implementing the DESFO–RF method as suggested. However, finally, it was found that the DESFO–RF method outperformed the original RF algorithm in terms of MeanAcc in most of the benchmarks and MeanFeat. The suggested DESFO–RF approach has shown promising results in feature selection compared to the main RF on the chosen benchmarks.

DESFO results versus other MH algorithms

To prove the effectiveness of DESFO in comparison with DESFO-RF and DESFO-K-NN, which rely on RF and k-NN classifiers, respectively, a comparison was made between DESFO and other meta-heuristic methods such as DE, SFO, ABC, PSO, BA, GWO, WOA, GOA, HHO, BSA, and HGSO, all of which were conducted under identical conditions. The comparison results were measured in terms of mean fitness value (MaeanFit), mean accuracy (MaeanAcc), and mean number of features selected (MeanFeat).

Comparisons based on the RF classifier

Table 5 presents the fitness values obtained from the proposed DESFO-RF meta-heuristic optimization algorithm, compared with those of other advanced optimization techniques in addressing the FS issue. Table 5 shows that DESFO-RF showed superior performance compared to other methods. In the FS problem, it scored the highest in 8 benchmarks and achieved the same score as the others in 2 benchmarks. This led to a more significant impact in 10 out of the 14 benchmarks, equivalent to 71% of all the benchmarks. Furthermore, the benchmark employed in this research comprises benchmarks of varying sizes, demonstrating the ability of DESFO-RF to deliver consistent performance across the entire range of benchmarks, regardless of their size. It was observed that DESFO–RF missed out on 4 benchmarks, but the results obtained were much closer to the methods used by SFO and ABC when the mean fitness values were compared. This indicates that the DESFO–RF has better outcomes than its competitors. It has been discovered that the DESFO-RF method suggested by the team ranked first in all benchmarks except for SFO. This provides further evidence of the effectiveness of the proposed method over other techniques used by competitors.

Table 5 Results comparison of the mean fitness value (MeanFit) based on RF classifier for DESFO with other. MH methods

Table 6 compares the classification accuracy means of the presented DESFO-RF with other advanced metaheuristic optimization algorithms in tackling the FS issue, as per the empirical findings. It’s worth mentioning that, according to Table 6, the DESFO-RF approach showed better performance than all other methods in terms of accuracy mean across seven benchmarks. Moreover, it delivered equivalent results to other methods across five benchmarks but needed to be more fortunate to outperform them in two benchmarks. However, the DESFO-RF approach was significantly more effective than other methods in 12 out of 14 benchmarks, equivalent to 85.7% of all the benchmarks. Also, it's worth noting that the SFO method was ranked second on several benchmarks. It showed a slight improvement of 0.0034% on the Lymphography benchmark and 0.0020% on the M-of-n benchmark while achieving the same score as the top performer on five other benchmarks.

Table 6 Results comparison of the mean accuracy (MeanAcc) based on RF classifier for DESFO with other MH methods.

Table 7 compares the mean number of selected features between the DESFO-RF method and other popular meta-heuristic optimization algorithms commonly used for feature selection (FS) strategy. When Table 7 is analyzed, the observation shows that DESFO-RF and SFO produce similar results regarding the number of selected features, and both outperform the other algorithms. These two techniques won in two benchmarks and tied in three benchmarks, surpassing the other algorithms: DE, ABC, PSO, BA, GWO, WOA, GOA, HHO, BSA, and HGSO. However, it is important to note that this does not necessarily imply a tie in classification accuracy between DESFO and SFO. DESFO has demonstrated superiority over other algorithms. Furthermore, it should be kept in mind that choosing the smallest number of characteristics may negatively impact classification accuracy.

Table 7 Results comparison of the mean number of features selected (MeanFeat) based on the RF classifier for DESFO with other MH methods.

Comparisons based on the K-NN classifier

Table 8 compares the average fitness values between the proposed DESFO-K-NN and other advanced MH optimization algorithms in addressing the FS problem. After examining Table 8, the DESFO-K-NN outperformed all other methods in 9 benchmarks and tied in 2 benchmarks in the FS problem. This indicates that DESFO-K-NN had a significantly better impact on 11 out of 14 benchmarks, accounting for 85.7% of all benchmarks. Additionally, the study employed a benchmark of both large and small-scale benchmarks, indicating that DESFO-K-NN can deliver consistent performance across the entire range of benchmarks, irrespective of their size. For the two missing benchmarks, it has been noted that DESFO-K-NN has produced almost equivalent outcomes to other techniques in terms of mean fitness values. This highlights the superior results of DESFO-K-NN. Except for SFO in two benchmarks (vote and zoo), none of the competing methods are ranked first compared to DESFO-K-NN. Hence, it is evident that DESFO-K-NN is superior to the suggested competitor’s methods. In addition, the results of the comparison between DESFO-K-NN and other metaheuristic optimization algorithms in terms of classification accuracy values for feature selection strategy are presented in Table 9. The table shows the empirical outcomes of this comparison.

Table 8 Results comparison of the mean fitness value (MeanFit) based on the K-NN classifier for DESFO with other MH methods.
Table 9 Results comparison of the mean accuracy (MeanAcc) based on the K-NN classifier for DESFO with others. MH methods.

From Table 9, it is essential to note that DESFO-K-NN outperformed all other methods regarding accuracy mean values across seven benchmarks. In the remaining seven benchmarks, results were similar to those achieved by the different methods. DESFO-K-NN also showed significantly better performance in all 14 benchmarks, accounting for 100% of all benchmarks, which is a remarkable improvement compared to other methods. Additionally, In Table 10, a comparison of the mean number of selected features between the DESFO-K-NN method and other established meta-heuristic optimization algorithms is given. This comparison helps us understand the effectiveness of the DESFO-K-NN method in addressing the FS strategy.

Table 10 Results comparison of the mean number of features selected (MeanFeat) based on the K-NN classifier for DESFO with other MH methods.

Based on the results shown in Table 10, it can be inferred that the DESFO-K-NN algorithm has better exploration capabilities compared to other algorithms, as it has the lowest mean selected features number among all the algorithms tested (winning in 5 out of 7 cases and tying in 2 cases). This performance is superior to DE, PSO, GWO, GOA, BSA, and HGSO algorithms. It is worth mentioning that even though SFO selected fewer irrelevant features compared to DESFO-K-NN and other methods on only a few benchmarks (lymphography, vote, and Zoo), and achieved the same performance as DESFO-K-NN on two benchmarks (WineEw and BreastCancer), it did not outperform DESFO-K-NN in terms of mean accuracy. When selecting a minimal number of characteristics for classification, it is important to note that this approach can harm accuracy. The DESFO-K-NN algorithm has been proposed to efficiently identify the pertinent attributes and reduce the feature search area without compromising the classification accuracy. The algorithm achieves optimal results by discarding insignificant search areas and concentrating on the most viable ones.

Analysis and visualization

An analysis for DESFO–RF and DESFO-K-NN, used for handling the FS strategy, has been performed in this section using asymptotic analysis. To validate their convergence capabilities, the proposed technique was applied to 14 widely used benchmark datasets, and their performance has been compared against their peers under identical conditions, including the iteration number and population size. Figures 3 and 4 demonstrate the convergence ability of these methods in comparison to their counterparts.

Figure 3
figure 3

The convergence graphs comparing the suggested DESFO approach with other methods using the RF Classifier.

Figure 4
figure 4

The convergence graphs comparing the suggested DESFO approach with other methods using the K-NN Classifier.

Based on the results depicted in Fig. 3, the DESFO-RF approach showcases rapid yet effective convergence across eight benchmarks, including PenglungEW, IonosphereEW, SonarEW, WaveformEW, KrVsKpEW, BreastEW, Zoo, and Exactly2. On the other hand, Fig. 4 highlights that the DESFO-K-NN model outperforms the competition in five benchmarks, namely PenglungEW, IonosphereEW, SonarEW, WaveformEW, KrVsKpEW, BreastEW, Lymphography, Exactly2, and Lymphography. It’s worth noting that both the proposed algorithms (DESFO-RF and DESFO-K-NN) balance exploration and exploitation, ensuring the timely acquisition of the optimal solution.

Figures 5, 6, and 7 show the performance of DSEFO and other methods regarding Mean fitness Function values with RF and K-NN. The box plot with the swarm plot is demonstrated in Figs. 5 and 6, showing the superiority of DESFO over other algorithms. The plots reveal no outliers with Both DESFO-RF and DESFO-K-NN, unlike the DE, PSO, and HGSO Algorithms. The swarm plot demonstrates that most values are in the boxplot's interquartile range (IQR). Figure 7 shows the KDE plots, demonstrating the performance of DESFO and the other algorithms with the 14 UCI benchmarks.

Figure 5
figure 5

Box and swarm plot of DESFO-RF and Algorithms performance in terms of fitness value.

Figure 6
figure 6

Box and swarm plot of DESFO-K-NN other Algorithms performance in terms of fitness value.

Figure 7
figure 7

KDE plot diagram of DESFO and other Algorithms performance in terms of fitness value.

Figures 8, 9, and 10 show the performance of DSEFO and other methods regarding Mean classification accuracy with RF and K-NN. Figures 8 and 9 illustrate the box plot with the swarm plot, highlighting the superior performance of DESFO over other algorithms. A noticeable observation from the plots is that no outliers exist in DESFO-RF and DESFO-K-NN, unlike other algorithms such as DE, PSO, BA, BSA, GOA, and HGSO Algorithms. The swarm plot indicates that for DESFO with RF and KNN, most of the values are located in the interquartile range (IQR) and the maximum value of the boxplot. Additionally, Fig. 10 shows KDE plots that depict the performance of DESFO and other algorithms with the 14 UCI benchmarks.

Figure 8
figure 8

Box and swarm plot of DESFO-RF and Algorithms performance in term of Classification Accuracy.

Figure 9
figure 9

Box and swarm plot of DESFO-K-NN and Algorithms performance in term of classification accuracy.

Figure 10
figure 10

KDE plot diagram of DESFO and other Algorithms performance in term of classification accuracy.

Wilcoxon’s analysis

The statistical significance of the analysis can be observed in Tables 11 and 12, where the Wilcoxon test was conducted as a pair-wise assessment. This test helped to determine if there was a significant difference between the fitness results achieved by the proposed DESFO algorithm and its counterparts74.

Table 11 Wilcoxon’s test for DESO-RF vs Other algorithms.
Table 12 Wilcoxon’s test for DESO-K-NN versus Other algorithms.

The Wilcoxon test is a statistical test often used in hypothesis testing situations. The test involves ranking the differences between the results of two paired algorithms on a set of problems. The calculation of ranks is based on the absolute values of the differences. Next, the positive and negative ranks are summed separately as R+ and R. The smaller sum between the two is recorded. If the significance level of the recorded results is less than 5%, then the null hypothesis is rejected. On the other hand, if the significance level is greater than 5%, then the null hypothesis is not rejected.

After analyzing the data presented in Tables 11 and 12, it can be concluded that the DESFO-RF and DESFO-k-NN algorithms outperformed all other algorithms in all the tested scenarios. In Tables 11 and 12, the indicated p values are below 5%, implying that the proposed method’s results are statistically significant. This strong evidence against the null hypothesis suggests that the outcomes obtained are not due to chance.

Discussion

According to the results of the empirical analysis, the DESFO algorithm stands out among recent algorithms in terms of its reliability in feature selection for classification tasks. This algorithm makes use of k-NN and RF classifiers. Among all the benchmarks, DESFO-K-NN produced the best results in terms of mean accuracy, followed by DESFO-RF. Additionally, the DESFO optimizer demonstrated a more pronounced exploration and exploitation behavior than its counterparts. On the other hand, The DESFO method exhibits a limitation in that it selects more features than its competitors across various datasets. Specifically, when compared with other methods, DESFO–RF selects a greater number of features in 9 out of 18 datasets (PenglungEW, IonosphereEW, WaveformEW, KrVsKpEW, Lymphography, Vote, Exacly2, BreastCancer, and Tic-tac-toe), while DESFO–K-NN does so in 7 datasets (SonarEW, WaveformEW, Lymphography, Vote, Zoo, Exactly2, and Tic-tac-toe).

Conclusion and future works

The DESFO algorithm, a combination of the DE and SFO algorithms, has been proposed in this paper to handle FS strategies. The LS strategy has also been incorporated to improve the optimal results after each algorithm iteration. The algorithm has exhibited satisfactory performance and capability with significantly enhanced results. To evaluate the chosen feature subsets, RF and K-NN classifiers were used to calculate the classification accuracy. The DESFO algorithm was tested on several benchmarks using multi-scale attributes and records in this work to assess its effectiveness. The results were compared with binary versions of 11 different meta-heuristic methods. The performance has been evaluated based on various metrics, such as mean fitness rate, mean accuracy rate, and mean number of features selected. The findings indicated that the two algorithms proposed in the study (DESFO–RF and DESFO–K-NN) outperformed their counterparts in managing FS strategies. DESFO-RF was the most effective method among all benchmarks regarding mean accuracy results, followed by IBAO-k-NN.

Additionally, the DESFO optimizer demonstrated greater exploration and exploitation abilities than its counterparts. According to Wilcoxon's test (with a significance level of α = 0.05), it was evident that the DESFO algorithm with RF and k-NN classifiers outperformed the other methods. This algorithm achieved exceptional classification accuracy up to 100% in some benchmarks and also resulted in a reduced feature size.

The DESFO technique has one limitation: it tends to choose more features than its rivals across different datasets. Specifically, in comparison with other methods, DESFO–RF selects more features in 9 out of 18 datasets, and DESFO–K-NN does so in 7 datasets. Therefore, to improve the proposed algorithm, it would be beneficial to implement a new selection strategy to reduce the number of features selected, particularly for high-dimensional datasets with small instances. This opens up avenues for further research in the future.

Integrating the DESFO algorithm with various other optimization techniques merits exploration for future works. Additionally, the application of different classifiers, such as Artificial Neural Networks (ANNs), Decision Trees (DT), support vector machines (SVM), and others, could further examine DESFO’s capability in feature selection for classification. The adaptation of other transfer functions, such as S-shape functions, could also be explored. Given its feature selection (FS) efficacy, DESFO presents significant potential across various domains, such as healthcare, the Internet of Things (IoT), and intrusion detection systems. Furthermore, employing DESFO in the context of CEC benchmark functions could also be explored.