M-estimation activation functions for high-performance extreme learning machine ensemble classification

Alimi, Fathi; Khan, Adnan; Ali, Hameed; Bouzidi, Mohamed; Alshammari, Abdulrahman Obaid; Ahmad, Bakhtiyar

doi:10.1038/s41598-025-16798-5

Download PDF

Article
Open access
Published: 01 September 2025

M-estimation activation functions for high-performance extreme learning machine ensemble classification

Fathi Alimi¹,
Adnan Khan²,
Hameed Ali³,
Mohamed Bouzidi⁴,
Abdulrahman Obaid Alshammari⁵ &
…
Bakhtiyar Ahmad⁶

Scientific Reports volume 15, Article number: 32154 (2025) Cite this article

1785 Accesses
1 Altmetric
Metrics details

Subjects

Abstract

Machine learning plays a pivotal role in addressing real-world challenges across domains such as cybersecurity, where AI-driven methods, especially in Software-Defined Networking, enhance traffic monitoring and anomaly detection. Contemporary networks often employ models like Random Forests, Neural Networks, and Support Vector Machines to identify threats early and reinforce security. Ensemble learning further improves predictive accuracy and stability, yet many frameworks falter when confronted with noisy or contaminated data. In this study, we propose a robust ensemble framework for Extreme Learning Machines (ELMs) that integrates a family of redescending ψ-activation functions grounded in M-estimation theory. Each ψ-function yields a distinct base classifier, initialized with random weights, and the optimal hidden-node count is selected via grid search minimizing the Brier score. Rather than traditional voting, ensemble outputs are combined through a least-squares optimization, allowing precise parameter estimation and enhanced stability. We validate our method on five benchmark datasets, SatImage, Email-Spamdexing, Breast Cancer, Musk, and Iris, demonstrating consistently superior accuracy and reduced variance compared to existing ELM ensembles. Rigorous statistical testing (Kruskal–Wallis with Dunn’s post-hoc comparisons) confirms these gains. Our results show that embedding robust M-estimator–based activations within a controlled ensemble yields marked improvements in generalization, predictive precision, and resilience to data irregularities, offering a significant advancement in the design of efficient neural classifiers.

Boosting ridge for the extreme learning machine globally optimised for classification and regression problems

Article Open access 21 July 2023

Improving I-ELM structure through optimal addition of hidden nodes: Compact I-ELM

Article Open access 01 October 2024

Development of optimized ensemble machine learning-based prediction models for wire electrical discharge machining processes

Article Open access 07 October 2024

Introduction

Machine learning algorithms form the backbone of artificial intelligence research, playing a pivotal role in predictive analytics, pattern classification, image recognition, and system forecasting. Over the past two decades, there has been a sustained and growing interest in neural network-based approaches for tackling pattern recognition and regression tasks, with their applications expanding across a wide spectrum of domains. The first foundational work, by Knerr et al.¹ and LeCun et al.² introduced various single-layer learning rules as a means to divide complex tasks into manageable subtasks. The LIRA model, a dynamic neural architecture, is widely used in automated medical diagnostics and autonomous vehicle systems for image identification. Its layered model architecture achieves fast convergence of training and computational efficiency, making it crucial for real-time traffic flow management and precision agriculture. New activation functions enhance ML models’ performance for general and domain-specific applications. Kussul and Baidyk³ proposed the Limited Receptive Area, a new neural classifier developed only for picture identification problems. The LIRA framework is organized into three distinct layers, sensor, associative, and output. The sensor layer feeds into the associative layer through fixed, randomly initialized connections, while the associative layer projects to the output layer via learnable weights. Neural-network approaches are particularly attractive here because they combine powerful learning dynamics, seamless scalability to large datasets, high-fidelity approximation of complex functions, and inherently parallel architectures.

The hybridization of neural networks with the principles of machine learning has introduced hybrid approaches⁴. In a relative performance study, Kumar and Bhattacharya⁵ presented Artificial Neural Network (ANN) and Linear Discriminant Analysis (LDA) models. Their work proved that ANNs outperform LDA models on both training and test datasets by using a fully connected backpropagation ANN architecture with three neuron layers. Besides, ANNs outperformed LDA models in terms of robustness while handling missing data. In this respect, Abe et al.⁶ developed a new approach which used objective indexes to allow the evaluation of rules in post-processing mined data. Applications of evolutionary algorithms (EAs) to learning and evolution in ANNs were reviewed by Yao⁷. The combinations considered in that review include evolving ANN connection weights, topologies, learning rules, and input features using EAs. The review concluded that such combinations often lead to more successful intelligent systems than those using either ANNs or EAs in isolation. For adaptive control of strict-feedback nonlinear systems, MNNs were developed by Zhang et al.⁸. The effectiveness of the approach was verified by simulation experiments. For solving these problems and transcending some limitations of traditional neural networks, Huang et al.^9,10,11 and Bai et al.¹² proposed the Extreme Learning Machine (ELM) and related methods. ELM gives an intrinsically resistant solution to overfitting and is less affected by outliers, thanks to the randomness of the input weights initialization while performing the analytical optimization of the output weights. However, as the number of neurons in a hidden layer increase, neural networks are more complex physically. This is a serious shortcoming of ELM. The efficiency and robustness of ELM have been improved in many ways by researchers. The works of Barreto and Barros¹³Sze et al.¹⁴Horata et al.¹⁵Man et al.¹⁶Zhang and Luo¹⁷ developed modifications in order to make the basic ELM more robust against outliers. Unfortunately, most of those developments depend too much on neurons in the hidden layer, making the size of the network physical structure too big.

In addressing this challenge, Man et al.¹⁶ devised an optimal weight learning machine for handwritten image recognition that strategically employs fewer hidden nodes, thereby expediting the learning process in industrial applications. Das et al.¹⁸ also proposed a backward-forward ELM approach for input weight enhancement. They had taken an orthogonal matrix with an ideal input weight and half of the weights generated randomly to avoid the risk of overfitting. The proposed backward-forward ELM algorithm was thus performing better in performance and computational economy when the traditional ELM models were using different types of activation functions. Ensemble learning has become more popular since it uses numerous expert classifiers to increase accuracy. Khellal et al.¹⁹ tackled object recognition by integrating a convolutional neural network with a stacking ensemble of Extreme Learning Machine (ELM) classifiers. Earlier, Cao et al.²⁰ showed that a voting-based ELM employing a sigmoid activation function outstripped the original ELM’s performance. By amalgamating the complementary strengths of multiple learners, ensemble classifiers generally surpass individual models in accuracy. In line with this, recent investigations have made noteworthy progress in refining ELM architectures through both ensemble strategies and advanced optimization techniques. For example, Lan et al.²¹ and Mansoori & Sara²² demonstrated competitive accuracy on the Satimage dataset by combining multiple ELMs using traditional activation functions such as sigmoid and RBF. In a broader context, Kiani et al.²³ and Palomino-Echeverria²⁴ conducted a comprehensive survey of ELM-based approaches for outlier detection, identifying key developments in robust loss functions, data preprocessing, and ensemble training frameworks. Expanding on these advances, Tang et al.²⁵ introduced a two-stage ensemble ELM architecture optimized via the Sparrow Search Algorithm for software defect prediction, demonstrating how metaheuristic parameter tuning can substantially boost accuracy on real-world datasets. Similarly, A unified, hybrid metaheuristic optimized intrusion-detection framework developed by S Sumathi et al.^26,27,28, combining Harris Hawks and Particle Swarm Optimization (PSO) with Grey Wolf Optimization(GWO) to select and tune features and parameters across Backpropagation, Multilayer Perceptron (MP), Self-Organizing Map, and SVM classifiers, and validate it on NSL-KDD and UNSW-NS15 datasets to achieve superior distributed denial of service detection accuracy, F1 scores, and minimal false-alarm rates.

Despite these advancements, no prior work has focused on developing activation functions for ELMs based on M-estimation theory. Our study is the first to explore the integration of redescending M-estimator-based ψ-functions, as activation functions within an ELM ensemble. These activation functions preserve the key mathematical characteristics of conventional activations while offering enhanced resilience to noise and outliers, resulting in a more adaptable and noise-tolerant learning framework for classification tasks. Building on the work of Khan et al.²⁹we extend the concept, exploring more robust activation functions based on M-estimation. The goal of using the psi function is to produce a more learnt feature space for final classification because of its adaptable and distinct non-linear properties. Tables 1 and 2 provide specifics of the suggested methodology. The Psi-function, chosen for its more flexible and distinguishable non-linear characteristics, is utilized to create a more learned feature space for the final classification. As summarized in Tables 1 and 2, a core objective of this study is to formally introduce and systematize the integration of redescending M-estimation $\:\psi\:-$functions as activation mechanisms within the Extreme Learning Machine framework. These $\:\psi\:-$functions, owing to their inherent robustness and flexibility, are introduced not only as alternatives to traditional activation functions but as a means to enhance learning stability and generalization in the presence of noise and outliers. Moreover, to fully exploit these activation functions, we propose a resilient ensemble classification framework in which multiple base ELMs, each utilizing a different $\:\psi\:-$based activation, are judiciously combined via a least-squares fusion scheme. This architecture preserves model diversity while mitigating instability, thereby enhancing both the accuracy and robustness of the composite classifier. This work, therefore, establishes a novel theoretical and computational foundation for integrating robust statistical principles into neural architectures for complex classification tasks.

Novelty and significance of the study

This paper presents a robust and efficient ensemble Extreme Learning Machine-based learning framework that offers a new paradigm to tackle some of the long-standing problems of ML, such as contamination of data, instability due to random weight initialization, and inconsistent performance of classifiers. This present work is further improved by the presence of an activation function that is newly adopted and inspired by new M-estimation techniques and re-descending M-estimation. They improve robustness and increase capability in most unstructured, high uncertainty, data-laden environments, giving a great gain in accuracy over baseline models in different frameworks of stacking generalization, producing unsurpassed, stable state-of-the-art ensembles from an architecture that reduces problem situations of single, classical, as well as traditionally superior ELM architectures.

The significance of this study is not confined to algorithmic contributions alone but finds real-world practicality in domains where accuracy and reliability are critical. For instance:

Cybersecurity: It enhances the anomaly detection and incident response in Software-Defined Networking (SDN), where there is a growing need for timely and precise detection of security threats.
Healthcare: The ensemble demonstrates excellent performance on diagnostic tasks such as lung nodule detection and automated analysis of medical imaging data.
Financial Systems: In fraud detection, the strong classification capabilities are very effective in real time to identify rogue transactions and help keep the economy stable.
Autonomous Systems: Applications toward real-time decision-making for autonomous driving and precision agriculture manifest its adaptability and efficiency in high-stakes scenarios.

This research bridges the gap between theoretical innovation and practical applicability to overcome some of the critical challenges traditional ML algorithms face. Not like the conventional instability and sensitivity to the initialization faced by standard ELM frameworks, the architecture will ensure consistent performance by the system. Extensive validation against state-of-the-art methods shows its superiority in terms of classification accuracy with a lower variance and adapting to different datasets and domains of applications. The study further contributes toward the increasing area of ethics and trust in AI through an algorithmic framework that is transparent, interpretable, and scalable. This fits with the current demand for machine learning systems: besides performing well, their reliability and fairness should be trusted. This work sets a new bar for ensemble learning methodologies through its combination of theoretical advancements with practical solutions for pressing, real-world challenges, prepared to make serious impacts on both academic research and industrial practice.

Enhancing classification accuracy

We provide an effective ensemble of Extreme Learning Machines (ELMs) that are intended to extract a variety of significant information from data in order to increase accuracy and dependability. By using different beginning weights selected from a predetermined distribution and implementing a novel activation function, this ensemble makes use of the idea of variety. Using the least squares technique, the outputs of each base classifier are averaged to determine the final prediction. Section 3 contains comprehensive details regarding the suggested ensemble. We provide a brief summary of the state-of-the-art models that are currently in use, such as ELM, BFELM, and several ELM ensemble techniques from the literature, before outlining the methodology. The purpose of this conversation is to set the scene and highlight the progress made using our suggested methodology.

The extreme learning machine (ELM) algorithm

Algorithm

The dataset X with matching goal values T is one of the contributions. The number of hidden nodes is N. The activation function is g(.).

Output: - ELM parameters (input, bias, and output weights).

Steps:

1.
Initializing the input weights (W) and biases (b) at random is the first stage in creating input weights and biases.
2.
The next step is to determine the hidden layer feature space using the formula below: Use $\:H=G\left(XW+b\right),$ to find the hidden layer feature space matrix H, where g(.) is the activation function symbol.
3.
The calculation of the output weights, , involves minimizing the error between the expected and actual target values. The analytical solution is calculated as follows: In the pseudo-inverse of Moore-Penrose, $\\hat :{\beta\:}={\left(H{H}^{T}\right)}^{-1}{H}^{T}T$.
4.
During the testing step, test the model on new data using the determined ELM parameters (input weights W, biases b, and output weights).

Voting based ELM²⁰

Misclassification rates often surge when outputs fall near decision boundaries. To address this and secure accurate labeling of previously unseen instances, Cao et al.²⁰ proposed an ensemble of Extreme Learning Machines that employs a voting-based fusion strategy. This method increases prediction reliability by using trained classifiers to collectively determine the class of an unknown item. An important factor in the ensemble model’s creation was diversity. However, in Cao et al.²⁰every base classifier employed an identical activation function, a fixed number of hidden neurons, and initial weights sampled from the same continuous distribution. Additionally, to further minimize misclassification risk, the ensemble parameter k was set in advance.

The process for the Voting-based ELM is as follows:

1.
Initialize Base Classifiers: Randomly generate input weights and biases for each base classifier using the same activation function.
2.
Train Classifiers: Each base classifier is trained on the supplied dataset using a varying number of hidden neurons. Subsequently, within the voting framework, each trained model casts a class prediction for every unlabeled sample 50. The final class is determined by a majority vote among the ensemble’s classifiers. Pre-fixing k (the number of distinct classifiers) reduces variability and potential misclassification around decision limits.

This ensemble voting method shows that it is possible to aggregate predictions from multiple models to improve overall classification accuracy, especially for data at challenging decision boundaries.

Extreme learning machine (ELM) with a voting-based input scheme

Given: A training dataset: $\:\left\{\:\right(x_i,\:t_i\left)\:\right|\:x_i\:\in\:\mathbb{\:}\mathbb{R}\mathbb{^p},\mathbb{\:}t_i\:\in\:\mathbb{\:}\mathbb{R}\mathbb{^c},\mathbb{\:}for\:i\:=\:1,\:2,\:\dots\:,\:N\:\},$ with a specified number of hidden nodes and an activation function $\:G(\cdot).$ Let $\:M\:$denote the number of independent classifiers, and initialize$\:\:{S}_{m}\in\:\mathbb{\:}\mathbb{R}\mathbb{^c}$ to zero.

Training Phase:

1.
Set $\:m\:=\:1$.
2.
While $\:m\:\le\:\:M$ do:

(a) Randomly generate the weights and biases ($\:{w}_{i}^{m},{b}^{m}$).
(b) Compute the output weight $\:\beta\:^k$ using the formula:

$$\:{\beta\:}^{m}=\left({H}^{t}H\right){H}^{t}T$$

c. Increment m by 1.

7.
End While.

Testing Phase:

For any test sample $\:x^{test},$ perform the following:

For each classifier $\:(i.e.,\:while\:m\le\:\:M):$

(a) Employ the parameters $\:({w}_{i}^{m},{b}^{m},{\beta\:}^{m})$ to predict the label for xᵗᵉˢᵗ.
(b) Update the vote count: $\:{S}_{m}\left({x}^{test}\right(i\left)\right)={S}_{m}+1$
(c) Increment$\:m\:by\:1$.

4.
Determine the final predicted class $\:Ac{c}^{test}$ by selecting the index corresponding to the maximum value in $\:{S}_{m}\left({x}^{test}\right(i\left)\right)$,i.e.,

$\:Ac{c}^{test}={argmax}({S}_{m}\left({x}^{test}\right(i\left)\right))$

ELM based ensemble for classification

Khellal et al.¹⁹ introduced an ensemble of extreme learning machines, each base classifier trained with a sigmoid activation function, to capture the nonlinear patterns within data for a classification problem.(refer to Fig. 1)They adopted the principle of the ordinary least square method for optimizing the contribution of each base classifier. Their algorithm pseudo-code is as follows:

Algorithm: training procedure for the $\:\varvec{E}\varvec{L}\varvec{M}-\varvec{B}\varvec{a}\varvec{s}\varvec{e}\varvec{d}$ ensemble for classification

$\:\varvec{I}\varvec{n}\varvec{p}\varvec{u}\varvec{t}$

$\:\left\{\:X,T,M,\:N\right\}$.

Where $\:Dataset\:X$, $\:Target\:T$, Number of individual models$\:\:M$. Number of hidden nodes $\:N$.

$\:\varvec{O}\varvec{u}\varvec{t}\varvec{p}\varvec{u}\varvec{t}:$ Parameters of the $\:ELM-based$ ensemble.

Procedure:

1.
For each model $\:m=1\:to\:M:$.
2.
(a) Randomly generate the input weights $\:{W}^{m}$ and biases $\:{b}^{m}$.
3.
(b) Compute the hidden layer matrix:

$$\:{H}^{m}\:=\:G(X\:{W}^{m}\:+\:{b}^{m})\:$$

c. Determine the output weights using the pseudoinverse of $\:{H}^{m}$:

$$\:{\beta\:}^{m}={\left({H}^{m}\right)}^{\dagger}T$$

d. Calculate the model output:

$$\:{O}^{m}={H}^{m}{\beta\:}^{m}$$

4.
Form the Global Hidden Matrix:
5.
Concatenate the outputs of all individual models to construct:

$$\:{H}_{g}\:=[\:{O}^{\left(1\right)}\:\:\:\:{O}^{\left(2\right)}\:\:\:\cdots\:\:\:\:{O}^{\left(M\right)}\:]$$

6.
Compute the Fusion Parameters:
7.
Obtain the fusion parameters by applying the pseudoinverse of H_g:

$$\:F={\left({H}_{g}\right)}^{\dagger}T$$

Return the ensemble parameters:

8. The complete set of ensemble parameters comprises $\:\{{\:W}^{\left(m\right)},\hspace{0.17em}{b}^{\left(m\right)},\hspace{0.17em}{\beta\:}^{\left(m\right)}\:\}$ for m = 1, 2,…, M, along with the fusion parameters F. Here, $\:{O}^{\left(m\right)}$ denotes the output of the $\:mth\:$ model, and $\:{H}_{g}$ represents the global hidden matrix. The uniqueness of the ELM-based ensemble is inherently determined by these parameters.

Methodology

To solve the instability in Extreme Learning Machines (ELMs) caused by random weight initialization and improve their learning capabilities, we offer a modified approach: the Efficient Ensemble of Extreme Learning Machines (EEoELM). This technique dramatically enhances generalization by combining many ELMs, each with novel activation functions. Unlike conventional aggregation methods such as majority voting, our approach optimizes the weights of all classifiers through an additional ensemble layer. These ensemble layer weights are derived in a manner analogous to the output layer weights of a standard ELM. Figure 1, depicts the architecture of the general scheme of the proposed EEoELM. The main novelty in designing the current framework is the addition of one more hyper-parameter, which is the number of classifiers (MM), as reported in the works by^19,22. In this regard, Khellal et al. developed an efficient ensemble learning strategy that was aggregated by simple classifiers, which were fed with similar activation functions; their weights between input and hidden layers had been assigned in random order²². further developed this framework by including different types of activation functions within their approach while leveraging the Moore-Penrose inverse technique to optimize the weights of ensemble layers. The proposed EEoELM uses P_i neurons in the hidden layer, either with identical or distinct activation functions. The different type of activation functions allows the ensemble to approximate the class decision restrictions more precisely. Due to the randomness in weight initialization and different types of activation functions in classifiers, different ELMs may generate different feature mappings. Therefore, the diversity that enhances the reliability of the base classifiers improves the overall performance of the ensemble largely. In this paper, we introduce an approach founded on ensembles of extreme learning machines comprising diverse classifiers, each employing distinct randomly initialized weights and novel activation functions, and inspired by M-estimation as well as re-descending M-estimation theories (refer to Tables 1 and 2). The proposed EEoELM method augments the robustness and dependability of ensemble learning in extreme machines by leveraging the variance inherent among the base classifiers. One major drawback with traditional ELM and its derivatives is that they only utilize random initial weights from one single continuous probability distribution. The proposed EEoELM tries to loosen this restriction by training base classifiers using different activation functions and starting weights from both normal and uniform distributions with different values. Also, the ensemble classifier performs an analytical optimization of its parameters by using different stacking or bagging approaches that generally allow considerably better robustness at better performance. For complete methodology refer to Fig. 4.

Table 1 Prevalent activation functions characterized by both continuity and differentiability.

Full size table

Table 2 The objective function $\:\rho\:(\cdot),\:$influence function $\:\psi\:(\cdot),$ and corresponding weight function $\:w(\cdot)$ for M-estimators and re-descending $\:M-estimators$.

Full size table

Motivation for using M and Re-descending M based Psi function as activation function

In practice activation functions are used for introducing non linearity and increase accuracy through calculating gradient in neural network and traditional properties of activation function is zero centered, differentiable and introducing non linearity and early convergence. It is clear that tangent sigmoid possess the property of zero centered while sigmoid function does not enjoy this property. Keeping in mind the basic of properties while leveraging different traditional activation function, we inspired to the see the mathematical graphical behaviors of M and re-descending M estimation psi function enjoy the property of zero centered, differentiable, nonlinear behavior and last but not the least property of proposed activation functions that is unsaturated in nature, for information, see Fig. 2. Table 1 shows traditional activation functions like sigmoid, tanh, and sine are commonly used in single-layer feed forward networks such as Extreme Learning Machines (ELM) to introduce non-linearity into the hidden layer. However, Fig. 2 exhibit monotonic saturation behavior and possess S-shaped or bounded oscillatory transformations, which inherently impose non-local generalization.

This leads to several challenges, particularly in ELM where the hidden layer parameters are randomly initialized and fixed: firstly, Standard activations respond non-discriminatively to extreme values. Outliers can dominate the feature space, distorting the learning process. Secondly, the global nonlinearity of sigmoid/tanh functions limits the model’s ability to represent complex localized data structures. For large input magnitudes, derivatives vanish (i.e., gradient saturation), reducing the expressiveness of neurons in capturing meaningful variation. Redescending ψ-functions derived from robust M-estimators (Tukey’s biweight, Welsch, Cauchy) mentioned in Table 2 address these limitations by providing bounded and adaptive non-linear transformations with local sensitivity and global suppression, figure distinguish proposed and existing activations clearly to address flexible nature of M-estimation based activations from traditional ones. Figure 2 clearly mention the clear picture of traditional and proposed activation function where proposed activation functions are zero centered and S type functions are their special case which clearly showed where gradients are not diminishes. Their main advantages are: firstly redescending ψ-functions down-weight extreme values by design, ensuring that inputs mapped to extreme regions due to random initialization do not distort the feature space. Secondly, these functions emphasize central data trends and suppress outliers, enabling better separation of classes, more compact and informative hidden layer representations, and increased local adaptability in complex data environments. Thirdly by nullifying the contribution of large residuals, redescending activations reduce model complexity, prevent overfitting to noise or rare data patterns, and improve stability across different training subsets. Furthermore, we extend to leverage the flexible nature of proposed activation functions in our proposed ELM ensembles, each base classifier’s diversity depends on the hidden layer mapping and ψ-based activations (see Fig. 3) induce more varied but meaningful feature transformations, improving ensemble fusion performance when fusion parameters are estimated optimally.

Applications

For benchmarking, five datasets were chosen from the UCI and Kaggle libraries, with thorough descriptions provided in Table 4. Data preparation was used to handle missing values and outliers when relevant. Furthermore, categorical variables were encoded using the one-hot encoding technique, with the existence of a certain feature represented as 1 and all other qualities as 0. The datasets were divided into training and testing sets, with 70% used for training and 30% for testing. Weights for each classifier were randomly assigned using a typical normal distribution. Output parameters were then analytically optimized using the square loss function with specified hyperparameters that included the proposed activation function as well as other existing activation functions. Different activation functions, starting weights, and hidden neuron counts were used during the six training cycles of each classifier. To ensure accuracy and consistency, all base classifier outputs were combined and averaged using the ordinary least squares method. Classification accuracy was assessed for the suggested ensemble and other state-of-the-art methods after the trained ensemble classifiers were tested on the reserved test data.

The entire process was done 50 times to ensure robustness, and the mean % accuracy and standard deviation (SD) were reported. Table 5 shows the results of the suggested ensemble using the psi-based activation function. For comparative study, the mean classification accuracy and SD of various state-of-the-art ensembles in the literature are also discussed. The superior performance and dependability of the suggested ensemble for classification tasks are highlighted by a discussion of the mean classification accuracy and SD of various state-of-the-art ensembles from the literature for comparison study.

Experiments on real world data

To evaluate the relative efficiency of the proposed approach in terms of both accuracy and precision, we employed two widely recognized high-dimensional numerical datasets: SatImage and Emails, obtained from the UCI Machine Learning Repository and Kaggle, respectively. The proposed ensemble model comprises six base ELM classifiers. Each classifier is initialized with distinct weight vectors drawn from a standard normal distribution, while maintaining a consistent number of hidden nodes and employing the same activation function across the ensemble. In the proposed ensemble framework, we incorporate newly designed activation functions, as detailed in Table 2, and evaluate their performance against established state-of-the-art ensemble methods reported in the literature. Following data preprocessing, each experimental setup is executed over 50 independent runs, and the average performance is assessed based on accuracy and precision metrics. The key statistical characteristics of the datasets utilized are summarized in Table 4.

Grid search optimization of hidden nodes

The number of hidden nodes h is critical to ELM performance. For each dataset and activation function, we perform grid search over a plausible range:

• Satimage: 500–1000.

• Emails Spamdexing: 300–600.

• Breast Cancer: 100–400.

• Musk: 100–200.

• Iris: 20–100.

The optimal hidden nodes is selected based on minimum Brier score on the test set, balancing calibration and accuracy (refer to Fig. 4).

The same optimal hidden nodes is then used across all base classifiers in the ensemble to maintain consistency and reduce hyperparameter variability.

Performance metrics

We evaluate models using:

• Test Accuracy (%).

• Standard Deviation: Across 50 replications per method.

$$\bullet\:\:\:\:\:\%SD=\sqrt{\left({\sum\:}_{j=1}^{J}{\left(Accuracy-\overline{Accuracy}\right)}^{2}/J\right)},$$

• Brier Score: Measures probabilistic calibration.

• Kruskal-Wallis Tests: To assess statistical significance of accuracy differences across methods.

• Computational Cost (Relative Time Complexity): Estimated based on number of ELMs and matrix inversion operations.

Experimental protocol

1. Preprocessing: Normalize inputs to [0,1].

2. Label Encoding: One-hot encoding for multi-class targets.

3. Repeated Simulation: 50 runs per configuration to account for ELM randomness.

4. Ensemble Fusion: Use least-squares to combine outputs from base ELMs.

5. Grid Search: Tune hidden layer size for single classifier using Brier score and then ensemble ELM.

6. Statistical Testing: Perform Kruskal-Wallis test across activation functions and display the performance of accuracies of all methods using boxplot.

Trade-off analysis

Trade-off between and accuracy and computational complexity is often compromised over each other regarding the problem under study, complex structure of mathematical functions are adopted to developed heavy loaded models to get stability and high accuracy in sensitive nature of prediction where a small error make unable to take decisions. We analyze the computational cost of base ELMs using lightweight traditional activation functions offer fast training and also analyze the effect of proposed activation function with base classifier and to further control the stability and increase ensemble methods using complex robust functions increase time cost (1.2x-1.4x) (see Table 3), however, significantly improve performance on complex datasets such as Satimage and Emails. For comparison simple datasets are considered and even single ELMs are often sufficient but ensemble methods enjoy stability.

Table 3 Test accuracy and relative computational cost.

Full size table

Table 4 Descriptive statistics of the datasets employed to evaluate classifier performance.

Full size table

Results

The following tables and figures portray the numerical comparisons of proposed and existing methods clearly for confirmations.

Table 5 Classification accuracy of the proposed ELM-based ensemble methods under a bagging framework.

Full size table

Table 6 Empirical accuracy vs. Complexity Trade-off.

Full size table

Table 7 Post-hoc analysis summary across datasets.

Full size table

Discussion and conclusion

To evaluate the effectiveness of our proposed ensemble framework of Extreme Learning Machines (ELMs), we conducted comprehensive experiments across five benchmark datasets: Satimage, Email Spamdexing, Breast Cancer, Musk, and Iris. Our approach integrates both traditional activation functions and novel ones inspired by robust M-estimation theory, such as Welsch, Bisquare, Lmstf, and Cauchy, aiming to improve classification accuracy, robustness, and generalization. (See Figs. 5 and 6). We compared our results against several established ensemble ELM methods. For instance, Lan et al. (2009) implemented an ensemble of 25 online sequential ELMs on the Satimage dataset using sigmoid and radial basis function (RBF) activations. Their models achieved testing accuracies of 88.01% (SD: 0.0029) for sigmoid and 88.28% (SD: 0.0037) for RBF, using 400 hidden nodes per ELM. In contrast, Mansoori and Sara²² utilized 97 ELMs with only 7 hidden nodes each, obtaining an accuracy of 87.85% (SD: 0.26) via weighted averaging. Our proposed ensemble model clearly outperformed these benchmarks, both in accuracy and stability, as summarized in Table 4. Among all tested activation functions, the Psi-based estimator consistently delivered the best performance. Our results also surpassed other competitive models, such as ReliefF Random Forest (Zhang et al., 2023), Lexicase Selection (Pleshkova and Stanovov³⁰and the method by Ren et al. (2022). The proposed method outperformed other state-of the-art methods such as ANN, SVM, logistic regression, KNN, Mansoori & Sara ensemble, Niave Bayes, Decision Trees, Random Forest (RF), Lan et al. RBF, Lan et al. sigmoid and GBM (refer to Fig. 8).

• Satimage & Emails (High-Dimensional Datasets): These datasets posed greater challenges due to their complexity and high dimensionality. Our ensemble achieved superior mean accuracy and reduced standard deviation compared to traditional single and ensemble ELMs. On the Emails dataset, functions like Logistic, Welsch, and Cauchy exceeded 95% accuracy.

• Musk (Noisy and Imbalanced): The ensemble model demonstrated its robustness by performing well on this challenging dataset. The use of re-descending M-estimators proved effective in handling noise and imbalance.

• Breast Cancer & Iris (Simpler Datasets): While individual robust models performed competitively on these datasets, the ensemble versions improved overall stability and generalization. Table 6; Fig. 7 summarize the accuracy and time complexity where proposed methods outperformed with a little compromise on time complexity.

Our ensemble construction involves training base ELMs with different activation functions and optimizing hidden nodes through Brier score-based grid search. This ensures each model is well-calibrated for its respective dataset. Rather than relying on majority voting, we applied least squares estimation to determine fusion weights, offering a more refined combination of predictions. This weighted strategy consistently outperformed standard voting ensembles. While our ensemble incurs higher computational costs, as shown in Fig. 8, this is a typical trade-off in ensemble learning. The gain in accuracy and robustness, particularly for high-stakes or noisy applications, justifies the added complexity. Table 7 showing Kruskal-Wallis and Dunn tests across all datasets to detect significant differences among models (p < 0.00001), confirming the need for deeper post-hoc analysis. Dunn’s test revealed that proposed models such as Proposed 3, 4, and 10 significantly outperformed traditional activation functions (e.g., Sigmoid, Sine, Tan-Sig, and RAF) on multiple datasets, especially Satimage and Breast Cancer. However, in some pairwise comparisons, differences, though favorable to our models, were not statistically significant after correction. For example:

• On Satimage, Proposed 4 vs. Proposed 6 showed modest Z-scores.

• On Emails, Proposed 10 outperformed Proposed 4 and 3 numerically but lacked statistical support.

• On Breast Cancer, Proposed 2 performed better than baseline models in mean accuracy but not significantly so.

These findings indicate that while statistical significance is essential, practical performance indicators, such as mean accuracy, standard deviation, and ranking, also hold critical value, especially under varying data distributions and noise conditions. Models that consistently perform well, even when p-values are borderline, are still valuable and may generalize better in real-world scenarios. This study introduces a robust and efficient ensemble learning framework based on ELMs, leveraging both conventional and newly proposed activation functions grounded in M-estimation theory. The ensemble is enhanced through Brier score-guided grid search for hidden node optimization and least squares-based fusion weighting. Across all evaluated datasets, our approach consistently delivered high accuracy and robustness, outperforming or matching existing state-of-the-art techniques. For future work, we plan to study the robustness and efficiency of proposed algorithm with novel activation function in intrusion detection enhancing the idea, which not only preserve the core properties of traditional activation functions but also offer enhanced robustness and adaptability. Unlike prior studies that focused on deep learning³¹ and SVMs with hybrid optimization²⁸ or hybrid neuro-optimizers (Sumathi & Rajesh, 2024), our approach aims to strengthen the hidden layer’s learning capacity through statistically grounded, flexible activation functions for improved classification in noisy environments. Overall, the results validate the potential of robust activation functions in enhancing ELM ensembles and open new directions for scalable, resilient classification frameworks in machine learning.

Data availability

Data availabilityThe datasets generated and/or analyzed during the current study are available in the UCI (https://archive.ics.uci.edu/datasets) and Kaggle (https://www.kaggle.com/datasets) libraries. Access to this data is unrestricted and users can download the data for replication.

Abbreviations

X:: Matrix containing inputs
T:: Target variable matrix
$\:{x}_{i}$ :: numerical value of input variable
$\:{t}_{i}$ :: target vari
$\:{N}$ :: number of samples in dataset
p(.):: objective function
$\:{w}(\cdot)$ :: weight function
h:: hidden node containing sum of linear combinations of input weights and inputs H= G(h) matrix ahving processed information matrix after aplliying activation function
O:: output matrix of base classifie
M:: number of base classifers in ensemble model
z:: standardized residuals
IR:: real number
P:: number of input variables
$\:{\dagger}$ :: moor Penrose pseduinverse
Ń:: Number of hidden nodes
W:: intial weights randomly generated from continous probability distribution
b:: Bias term generated randomly for each nodes from contious probability distribution
G(.):: Activation Function
$\:{\psi(\cdot)}$ :: psi function
t(c):: tuning hyperparameter used in objective, psi and weight function
$\beta$ :: output parameters of the base ELMS(.) accuracy of separate classifier in voting base ensemble
F:: matrix of fusion parameters measuring the effect of each base classifiers in ensemble model
c:: number of column
Acc:: Accuracy
J:: number of simulations

References

Knerr, S., Personnaz, L. & Dreyfus, G. Single-layer learning revisited: a stepwise procedure for building and training a neural network. in Neurocomputing (eds. Soulié, F. F. & Hérault, J.) 41–50Springer Berlin Heidelberg, Berlin, Heidelberg, (1990). https://doi.org/10.1007/978-3-642-76153-9_5
LeCun, Y., Bottou, L., Bengio, Y. & Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE. 86, 2278–2324 (1998).
Article Google Scholar
Kussul, E. & Baidyk, T. Improved method of handwritten digit recognition tested on MNIST database. Image Vis. Comput. 22, 971–981 (2004).
Article Google Scholar
Haykin, S. Neural Networks and Learning Machines, 3/e. (Pearson Education India).
Kumar, K. & Bhattacharya, S. Artificial neural network vs linear discriminant analysis in credit ratings forecast: A comparative study of prediction performances. Rev. Acc. Finance. 5, 216–227 (2006).
Article Google Scholar
Abe, H., Tsumoto, S., Ohsaki, M. & Yamaguchi, T. A rule evaluation support method with learning models based on objective rule evaluation indexes. in Fifth IEEE International Conference on Data Mining (ICDM’05) 4 pp.- (2005). https://doi.org/10.1109/ICDM.2005.13
Yao, X. Evolving artificial neural networks. Proc. IEEE. 87, 1423–1447 (1999).
Article Google Scholar
Zhang, T., Ge, S. S. & Hang, C. C. Adaptive neural network control for strict-feedback nonlinear systems using backstepping design. Automatica 36, 1835–1846 (2000).
Article MathSciNet Google Scholar
Huang, G. B., Zhou, H., Ding, X. & Zhang, R. Extreme learning machine for regression and multiclass classification. IEEE Trans. Syst. Man. Cybern Part. B Cybern. 42, 513–529 (2012).
Article Google Scholar
Huang, G. B., Zhu, Q. Y. & Siew, C. K. Extreme learning machine: theory and applications. Neurocomputing 70, 489–501 (2006).
Article Google Scholar
Huang, G. B., Zhu, Q. Y. & Siew, C. K. Extreme learning machine: a new learning scheme of feedforward neural networks. in IEEE International Joint Conference on Neural Networks (IEEE Cat. No.04CH37541) vol. 2 985–990 vol.2 (2004). vol. 2 985–990 vol.2 (2004). (2004).
Bai, Z., Huang, G. B., Wang, D., Wang, H. & Westover, M. B. Sparse extreme learning machine for classification. IEEE Trans. Cybern. 44, 1858–1870 (2014).
Article PubMed PubMed Central Google Scholar
Barreto, G. A. & Barros, A. L. B. P. A robust extreme learning machine for pattern classification with outliers. Neurocomputing 176, 3–13 (2016).
Article Google Scholar
Sze, V., Chen, Y. H., Yang, T. J. & Emer, J. S. Efficient processing of deep neural networks: A tutorial and survey. Proceedings of the IEEE vol. 105 2295–2329 (Ieee, 2017).
Horata, P., Chiewchanwattana, S. & Sunat, K. Robust extreme learning machine. Neurocomputing 102, 31–44 (2013).
Article Google Scholar
Man, Z., Lee, K., Wang, D., Cao, Z. & Khoo, S. An optimal weight learning machine for handwritten digit image recognition. Signal. Process. 93, 1624–1638 (2013).
Article ADS Google Scholar
Zhang, K. & Luo, M. Outlier-robust extreme learning machine for regression problems. Neurocomputing 151, 1519–1527 (2015).
Article Google Scholar
Das, D., Nayak, D. R., Dash, R. & Majhi, B. Backward-forward algorithm: an improvement towards extreme learning machine. arXiv.org (2019). https://arxiv.org/abs/1907.10282v4
Khellal, A., Ma, H. & Fei, Q. Ensemble of extreme learning machines for regression. in IEEE 7th Data Driven Control and Learning Systems Conference (DDCLS) 1052–1057 (2018). 1052–1057 (2018). (2018). https://doi.org/10.1109/DDCLS.2018.8515915
Cao, J. J., Kwong, S., Wang, R. & Li, K. A weighted voting method using minimum square error based on Extreme Learning Machine. in International Conference on Machine Learning and Cybernetics vol. 1 411–414 (2012). vol. 1 411–414 (2012). (2012).
Lan, Y., Soh, Y. C. & Huang, G. B. Ensemble of online sequential extreme learning machine. Neurocomputing 72, 3391–3395 (2009).
Article Google Scholar
Mansoori, E. G. & Sara, M. Extreme ensemble of extreme learning machines. Stat. Anal. Data Min. ASA Data Sci. J. 14, 116–128 (2021).
Article MathSciNet Google Scholar
Kiani, I. et al. White matter changes in paediatric bipolar disorder: a systematic review of diffusion magnetic resonance imaging studies. journal of affective disordersElsevier, (2024).
Palomino-Echeverria, S. et al. A robust clustering strategy for stratification unveils unique patient subgroups in acutely decompensated cirrhosis. J. Transl Med. 22, 599 (2024).
Article PubMed PubMed Central CAS Google Scholar
Tang, Z., Jia, W., Zhou, X., Yang, W. & You, Y. Representation and reinforcement learning for task scheduling in edge computing. IEEE Trans. Big Data. 8, 795–808 (2022).
Article Google Scholar
Sumathi, S., Rajesh, R. A. & Dynamic BPN-MLP neural network Ddos detection model using hybrid swarm intelligent framework. Indian J. Sci. Technol. 16, 3890–3904 (2023).
Article Google Scholar
Sumathi, S., Rajesh, R. & HybGBS A hybrid neural network and grey Wolf optimizer for intrusion detection in a cloud computing environment. Concurr Comput. Pract. Exp. 36, e8264 (2024).
Article Google Scholar
Sokkalingam, S. & Ramakrishnan, R. An intelligent intrusion detection system for distributed denial of service attacks: A support vector machine with hybrid optimization algorithm based approach. Concurr Comput. Pract. Exp. 34, e7334 (2022).
Article Google Scholar
Khan, A. et al. Robust extreme learning machine using new activation and loss functions based on m-estimation for regression and classification. Sci. Program. e6446080 (2022). (2022).
Pleshkova, T. & Stanovov, V. Classification algorithm with lexicase selection. Eur. Proc. Comput. Technol. Hybrid Methods of Modeling and Optimization in Complex Systems, (2023).
Sumathi, M. & Raja, S. P. Machine learning algorithm-based spam detection in social networks. social network analysis and mining vol. 13 104Springer, (2023).

Download references

Author information

Authors and Affiliations

Department of Chemistry, College of Science, University of Ha’il, P.O. Box 2440, Ha’il, 81441, Saudi Arabia
Fathi Alimi
Department of Statistics, Islamia College Peshawar, Peshawar, Pakistan
Adnan Khan
Department of Mathematics, Statistics & Computer Science, The University of Agriculture Peshawar, Peshawar, Pakistan
Hameed Ali
Department of Physics, College of Science, University of Ha’il, P.O. Box 2440, Ha’il, Saudi Arabia
Mohamed Bouzidi
Department of Mathematics, College of Science, Jouf University, Sakaka, 72388, Saudi Arabia
Abdulrahman Obaid Alshammari
Higher Education Department, Kabul, Afghanistan
Bakhtiyar Ahmad

Authors

Fathi Alimi
View author publications
Search author on:PubMed Google Scholar
Adnan Khan
View author publications
Search author on:PubMed Google Scholar
Hameed Ali
View author publications
Search author on:PubMed Google Scholar
Mohamed Bouzidi
View author publications
Search author on:PubMed Google Scholar
Abdulrahman Obaid Alshammari
View author publications
Search author on:PubMed Google Scholar
Bakhtiyar Ahmad
View author publications
Search author on:PubMed Google Scholar

Contributions

Credit authorship contribution statementProject administration, Funding acquisition: Fathi AlimiWriting–review & editing, Investigation, Supervision, Project administration, Funding acquisition: Adnan KhanWriting - original draft: Conceptualization: Hameed Ali.Data curation, Writing–review & editing, Investigation, Validation, Resources: Yousaf HayatSolution Methodology, Software, Formal analysis: Abdulrahman Obaid AlshammariSoftware, Project administration: Mohamed Bouzidi, Data curation, Investigation: Bakhtiyar Ahmad.

Corresponding authors

Correspondence to Hameed Ali or Bakhtiyar Ahmad.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Alimi, F., Khan, A., Ali, H. et al. M-estimation activation functions for high-performance extreme learning machine ensemble classification. Sci Rep 15, 32154 (2025). https://doi.org/10.1038/s41598-025-16798-5

Download citation

Received: 09 March 2025
Accepted: 19 August 2025
Published: 01 September 2025
DOI: https://doi.org/10.1038/s41598-025-16798-5

Subjects

Abstract

Similar content being viewed by others

Boosting ridge for the extreme learning machine globally optimised for classification and regression problems

Improving I-ELM structure through optimal addition of hidden nodes: Compact I-ELM

Development of optimized ensemble machine learning-based prediction models for wire electrical discharge machining processes

Introduction

Novelty and significance of the study

Enhancing classification accuracy

The extreme learning machine (ELM) algorithm

Algorithm

Voting based ELM20

Extreme learning machine (ELM) with a voting-based input scheme

ELM based ensemble for classification

Algorithm: training procedure for the \(\:\varvec{E}\varvec{L}\varvec{M}-\varvec{B}\varvec{a}\varvec{s}\varvec{e}\varvec{d}\) ensemble for classification

\(\:\varvec{I}\varvec{n}\varvec{p}\varvec{u}\varvec{t}\)

Methodology

Motivation for using M and Re-descending M based Psi function as activation function

Applications

Experiments on real world data

Grid search optimization of hidden nodes

Performance metrics

Experimental protocol

Trade-off analysis

Results

Discussion and conclusion

Data availability

Abbreviations

References

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Quick links

Voting based ELM²⁰