Introduction

Machine learning algorithms form the backbone of artificial intelligence research, playing a pivotal role in predictive analytics, pattern classification, image recognition, and system forecasting. Over the past two decades, there has been a sustained and growing interest in neural network-based approaches for tackling pattern recognition and regression tasks, with their applications expanding across a wide spectrum of domains. The first foundational work, by Knerr et al.1 and LeCun et al.2 introduced various single-layer learning rules as a means to divide complex tasks into manageable subtasks. The LIRA model, a dynamic neural architecture, is widely used in automated medical diagnostics and autonomous vehicle systems for image identification. Its layered model architecture achieves fast convergence of training and computational efficiency, making it crucial for real-time traffic flow management and precision agriculture. New activation functions enhance ML models’ performance for general and domain-specific applications. Kussul and Baidyk3 proposed the Limited Receptive Area, a new neural classifier developed only for picture identification problems. The LIRA framework is organized into three distinct layers, sensor, associative, and output. The sensor layer feeds into the associative layer through fixed, randomly initialized connections, while the associative layer projects to the output layer via learnable weights. Neural-network approaches are particularly attractive here because they combine powerful learning dynamics, seamless scalability to large datasets, high-fidelity approximation of complex functions, and inherently parallel architectures.

The hybridization of neural networks with the principles of machine learning has introduced hybrid approaches4. In a relative performance study, Kumar and Bhattacharya5 presented Artificial Neural Network (ANN) and Linear Discriminant Analysis (LDA) models. Their work proved that ANNs outperform LDA models on both training and test datasets by using a fully connected backpropagation ANN architecture with three neuron layers. Besides, ANNs outperformed LDA models in terms of robustness while handling missing data. In this respect, Abe et al.6 developed a new approach which used objective indexes to allow the evaluation of rules in post-processing mined data. Applications of evolutionary algorithms (EAs) to learning and evolution in ANNs were reviewed by Yao7. The combinations considered in that review include evolving ANN connection weights, topologies, learning rules, and input features using EAs. The review concluded that such combinations often lead to more successful intelligent systems than those using either ANNs or EAs in isolation. For adaptive control of strict-feedback nonlinear systems, MNNs were developed by Zhang et al.8. The effectiveness of the approach was verified by simulation experiments. For solving these problems and transcending some limitations of traditional neural networks, Huang et al.9,10,11 and Bai et al.12 proposed the Extreme Learning Machine (ELM) and related methods. ELM gives an intrinsically resistant solution to overfitting and is less affected by outliers, thanks to the randomness of the input weights initialization while performing the analytical optimization of the output weights. However, as the number of neurons in a hidden layer increase, neural networks are more complex physically. This is a serious shortcoming of ELM. The efficiency and robustness of ELM have been improved in many ways by researchers. The works of Barreto and Barros13Sze et al.14Horata et al.15Man et al.16Zhang and Luo17 developed modifications in order to make the basic ELM more robust against outliers. Unfortunately, most of those developments depend too much on neurons in the hidden layer, making the size of the network physical structure too big.

In addressing this challenge, Man et al.16 devised an optimal weight learning machine for handwritten image recognition that strategically employs fewer hidden nodes, thereby expediting the learning process in industrial applications. Das et al.18 also proposed a backward-forward ELM approach for input weight enhancement. They had taken an orthogonal matrix with an ideal input weight and half of the weights generated randomly to avoid the risk of overfitting. The proposed backward-forward ELM algorithm was thus performing better in performance and computational economy when the traditional ELM models were using different types of activation functions. Ensemble learning has become more popular since it uses numerous expert classifiers to increase accuracy. Khellal et al.19 tackled object recognition by integrating a convolutional neural network with a stacking ensemble of Extreme Learning Machine (ELM) classifiers. Earlier, Cao et al.20 showed that a voting-based ELM employing a sigmoid activation function outstripped the original ELM’s performance. By amalgamating the complementary strengths of multiple learners, ensemble classifiers generally surpass individual models in accuracy. In line with this, recent investigations have made noteworthy progress in refining ELM architectures through both ensemble strategies and advanced optimization techniques. For example, Lan et al.21 and Mansoori & Sara22 demonstrated competitive accuracy on the Satimage dataset by combining multiple ELMs using traditional activation functions such as sigmoid and RBF. In a broader context, Kiani et al.23 and Palomino-Echeverria24 conducted a comprehensive survey of ELM-based approaches for outlier detection, identifying key developments in robust loss functions, data preprocessing, and ensemble training frameworks. Expanding on these advances, Tang et al.25 introduced a two-stage ensemble ELM architecture optimized via the Sparrow Search Algorithm for software defect prediction, demonstrating how metaheuristic parameter tuning can substantially boost accuracy on real-world datasets. Similarly, A unified, hybrid metaheuristic optimized intrusion-detection framework developed by S Sumathi et al.26,27,28, combining Harris Hawks and Particle Swarm Optimization (PSO) with Grey Wolf Optimization(GWO) to select and tune features and parameters across Backpropagation, Multilayer Perceptron (MP), Self-Organizing Map, and SVM classifiers, and validate it on NSL-KDD and UNSW-NS15 datasets to achieve superior distributed denial of service detection accuracy, F1 scores, and minimal false-alarm rates.

Despite these advancements, no prior work has focused on developing activation functions for ELMs based on M-estimation theory. Our study is the first to explore the integration of redescending M-estimator-based ψ-functions, as activation functions within an ELM ensemble. These activation functions preserve the key mathematical characteristics of conventional activations while offering enhanced resilience to noise and outliers, resulting in a more adaptable and noise-tolerant learning framework for classification tasks. Building on the work of Khan et al.29we extend the concept, exploring more robust activation functions based on M-estimation. The goal of using the psi function is to produce a more learnt feature space for final classification because of its adaptable and distinct non-linear properties. Tables 1 and 2 provide specifics of the suggested methodology. The Psi-function, chosen for its more flexible and distinguishable non-linear characteristics, is utilized to create a more learned feature space for the final classification. As summarized in Tables 1 and 2, a core objective of this study is to formally introduce and systematize the integration of redescending M-estimation \(\:\psi\:-\)functions as activation mechanisms within the Extreme Learning Machine framework. These \(\:\psi\:-\)functions, owing to their inherent robustness and flexibility, are introduced not only as alternatives to traditional activation functions but as a means to enhance learning stability and generalization in the presence of noise and outliers. Moreover, to fully exploit these activation functions, we propose a resilient ensemble classification framework in which multiple base ELMs, each utilizing a different \(\:\psi\:-\)based activation, are judiciously combined via a least-squares fusion scheme. This architecture preserves model diversity while mitigating instability, thereby enhancing both the accuracy and robustness of the composite classifier. This work, therefore, establishes a novel theoretical and computational foundation for integrating robust statistical principles into neural architectures for complex classification tasks.

Novelty and significance of the study

This paper presents a robust and efficient ensemble Extreme Learning Machine-based learning framework that offers a new paradigm to tackle some of the long-standing problems of ML, such as contamination of data, instability due to random weight initialization, and inconsistent performance of classifiers. This present work is further improved by the presence of an activation function that is newly adopted and inspired by new M-estimation techniques and re-descending M-estimation. They improve robustness and increase capability in most unstructured, high uncertainty, data-laden environments, giving a great gain in accuracy over baseline models in different frameworks of stacking generalization, producing unsurpassed, stable state-of-the-art ensembles from an architecture that reduces problem situations of single, classical, as well as traditionally superior ELM architectures.

The significance of this study is not confined to algorithmic contributions alone but finds real-world practicality in domains where accuracy and reliability are critical. For instance:

  • Cybersecurity: It enhances the anomaly detection and incident response in Software-Defined Networking (SDN), where there is a growing need for timely and precise detection of security threats.

  • Healthcare: The ensemble demonstrates excellent performance on diagnostic tasks such as lung nodule detection and automated analysis of medical imaging data.

  • Financial Systems: In fraud detection, the strong classification capabilities are very effective in real time to identify rogue transactions and help keep the economy stable.

  • Autonomous Systems: Applications toward real-time decision-making for autonomous driving and precision agriculture manifest its adaptability and efficiency in high-stakes scenarios.

This research bridges the gap between theoretical innovation and practical applicability to overcome some of the critical challenges traditional ML algorithms face. Not like the conventional instability and sensitivity to the initialization faced by standard ELM frameworks, the architecture will ensure consistent performance by the system. Extensive validation against state-of-the-art methods shows its superiority in terms of classification accuracy with a lower variance and adapting to different datasets and domains of applications. The study further contributes toward the increasing area of ethics and trust in AI through an algorithmic framework that is transparent, interpretable, and scalable. This fits with the current demand for machine learning systems: besides performing well, their reliability and fairness should be trusted. This work sets a new bar for ensemble learning methodologies through its combination of theoretical advancements with practical solutions for pressing, real-world challenges, prepared to make serious impacts on both academic research and industrial practice.

Enhancing classification accuracy

We provide an effective ensemble of Extreme Learning Machines (ELMs) that are intended to extract a variety of significant information from data in order to increase accuracy and dependability. By using different beginning weights selected from a predetermined distribution and implementing a novel activation function, this ensemble makes use of the idea of variety. Using the least squares technique, the outputs of each base classifier are averaged to determine the final prediction. Section 3 contains comprehensive details regarding the suggested ensemble. We provide a brief summary of the state-of-the-art models that are currently in use, such as ELM, BFELM, and several ELM ensemble techniques from the literature, before outlining the methodology. The purpose of this conversation is to set the scene and highlight the progress made using our suggested methodology.

The extreme learning machine (ELM) algorithm

Algorithm

The dataset X with matching goal values T is one of the contributions. The number of hidden nodes is N. The activation function is g(.).

Output: - ELM parameters (input, bias, and output weights).

Steps:

  1. 1.

    Initializing the input weights (W) and biases (b) at random is the first stage in creating input weights and biases.

  2. 2.

    The next step is to determine the hidden layer feature space using the formula below: Use \(\:H=G\left(XW+b\right),\) to find the hidden layer feature space matrix H, where g(.) is the activation function symbol.

  3. 3.

    The calculation of the output weights, , involves minimizing the error between the expected and actual target values. The analytical solution is calculated as follows: In the pseudo-inverse of Moore-Penrose, \(\\hat :{\beta\:}={\left(H{H}^{T}\right)}^{-1}{H}^{T}T\).

  4. 4.

    During the testing step, test the model on new data using the determined ELM parameters (input weights W, biases b, and output weights).

Voting based ELM20

Misclassification rates often surge when outputs fall near decision boundaries. To address this and secure accurate labeling of previously unseen instances, Cao et al.20 proposed an ensemble of Extreme Learning Machines that employs a voting-based fusion strategy. This method increases prediction reliability by using trained classifiers to collectively determine the class of an unknown item. An important factor in the ensemble model’s creation was diversity. However, in Cao et al.20every base classifier employed an identical activation function, a fixed number of hidden neurons, and initial weights sampled from the same continuous distribution. Additionally, to further minimize misclassification risk, the ensemble parameter k was set in advance.

The process for the Voting-based ELM is as follows:

  1. 1.

    Initialize Base Classifiers: Randomly generate input weights and biases for each base classifier using the same activation function.

  2. 2.

    Train Classifiers: Each base classifier is trained on the supplied dataset using a varying number of hidden neurons. Subsequently, within the voting framework, each trained model casts a class prediction for every unlabeled sample 50. The final class is determined by a majority vote among the ensemble’s classifiers. Pre-fixing k (the number of distinct classifiers) reduces variability and potential misclassification around decision limits.

This ensemble voting method shows that it is possible to aggregate predictions from multiple models to improve overall classification accuracy, especially for data at challenging decision boundaries.

Extreme learning machine (ELM) with a voting-based input scheme

Given: A training dataset: \(\:\left\{\:\right(x_i,\:t_i\left)\:\right|\:x_i\:\in\:\mathbb{\:}\mathbb{R}\mathbb{^p},\mathbb{\:}t_i\:\in\:\mathbb{\:}\mathbb{R}\mathbb{^c},\mathbb{\:}for\:i\:=\:1,\:2,\:\dots\:,\:N\:\},\) with a specified number of hidden nodes and an activation function \(\:G(\cdot).\) Let \(\:M\:\)denote the number of independent classifiers, and initialize\(\:\:{S}_{m}\in\:\mathbb{\:}\mathbb{R}\mathbb{^c}\) to zero.

Training Phase:

  1. 1.

    Set \(\:m\:=\:1\).

  2. 2.

    While \(\:m\:\le\:\:M\) do:

  • (a) Randomly generate the weights and biases (\(\:{w}_{i}^{m},{b}^{m}\)).

  • (b) Compute the output weight \(\:\beta\:^k\) using the formula:

$$\:{\beta\:}^{m}=\left({H}^{t}H\right){H}^{t}T$$
  • c. Increment m by 1.

  1. 7.

    End While.

Testing Phase:

For any test sample \(\:x^{test},\) perform the following:

For each classifier \(\:(i.e.,\:while\:m\le\:\:M):\)

  • (a) Employ the parameters \(\:({w}_{i}^{m},{b}^{m},{\beta\:}^{m})\) to predict the label for xᵗᵉˢᵗ.

  • (b) Update the vote count: \(\:{S}_{m}\left({x}^{test}\right(i\left)\right)={S}_{m}+1\)

  • (c) Increment\(\:m\:by\:1\).

  1. 4.

    Determine the final predicted class \(\:Ac{c}^{test}\) by selecting the index corresponding to the maximum value in \(\:{S}_{m}\left({x}^{test}\right(i\left)\right)\),i.e.,

  • \(\:Ac{c}^{test}={argmax}({S}_{m}\left({x}^{test}\right(i\left)\right))\)

ELM based ensemble for classification

Khellal et al.19 introduced an ensemble of extreme learning machines, each base classifier trained with a sigmoid activation function, to capture the nonlinear patterns within data for a classification problem.(refer to Fig. 1)They adopted the principle of the ordinary least square method for optimizing the contribution of each base classifier. Their algorithm pseudo-code is as follows:

Algorithm: training procedure for the \(\:\varvec{E}\varvec{L}\varvec{M}-\varvec{B}\varvec{a}\varvec{s}\varvec{e}\varvec{d}\) ensemble for classification

\(\:\varvec{I}\varvec{n}\varvec{p}\varvec{u}\varvec{t}\)

\(\:\left\{\:X,T,M,\:N\right\}\).

Where \(\:Dataset\:X\), \(\:Target\:T\), Number of individual models\(\:\:M\). Number of hidden nodes \(\:N\).

\(\:\varvec{O}\varvec{u}\varvec{t}\varvec{p}\varvec{u}\varvec{t}:\) Parameters of the \(\:ELM-based\) ensemble.

Procedure:

  1. 1.

    For each model \(\:m=1\:to\:M:\).

  2. 2.

    (a) Randomly generate the input weights \(\:{W}^{m}\) and biases \(\:{b}^{m}\).

  3. 3.

    (b) Compute the hidden layer matrix:

$$\:{H}^{m}\:=\:G(X\:{W}^{m}\:+\:{b}^{m})\:$$

c. Determine the output weights using the pseudoinverse of \(\:{H}^{m}\):

$$\:{\beta\:}^{m}={\left({H}^{m}\right)}^{\dagger}T$$

d. Calculate the model output:

$$\:{O}^{m}={H}^{m}{\beta\:}^{m}$$
  1. 4.

    Form the Global Hidden Matrix:

  2. 5.

    Concatenate the outputs of all individual models to construct:

$$\:{H}_{g}\:=[\:{O}^{\left(1\right)}\:\:\:\:{O}^{\left(2\right)}\:\:\:\cdots\:\:\:\:{O}^{\left(M\right)}\:]$$
  1. 6.

    Compute the Fusion Parameters:

  2. 7.

    Obtain the fusion parameters by applying the pseudoinverse of Hg:

$$\:F={\left({H}_{g}\right)}^{\dagger}T$$

Return the ensemble parameters:

8. The complete set of ensemble parameters comprises \(\:\{{\:W}^{\left(m\right)},\hspace{0.17em}{b}^{\left(m\right)},\hspace{0.17em}{\beta\:}^{\left(m\right)}\:\}\) for m = 1, 2,…, M, along with the fusion parameters F. Here, \(\:{O}^{\left(m\right)}\) denotes the output of the \(\:mth\:\) model, and \(\:{H}_{g}\) represents the global hidden matrix. The uniqueness of the ELM-based ensemble is inherently determined by these parameters.

Fig. 1
figure 1

ELM based ensemble topological structure19.

Methodology

To solve the instability in Extreme Learning Machines (ELMs) caused by random weight initialization and improve their learning capabilities, we offer a modified approach: the Efficient Ensemble of Extreme Learning Machines (EEoELM). This technique dramatically enhances generalization by combining many ELMs, each with novel activation functions. Unlike conventional aggregation methods such as majority voting, our approach optimizes the weights of all classifiers through an additional ensemble layer. These ensemble layer weights are derived in a manner analogous to the output layer weights of a standard ELM. Figure 1, depicts the architecture of the general scheme of the proposed EEoELM. The main novelty in designing the current framework is the addition of one more hyper-parameter, which is the number of classifiers (MM), as reported in the works by19,22. In this regard, Khellal et al. developed an efficient ensemble learning strategy that was aggregated by simple classifiers, which were fed with similar activation functions; their weights between input and hidden layers had been assigned in random order22. further developed this framework by including different types of activation functions within their approach while leveraging the Moore-Penrose inverse technique to optimize the weights of ensemble layers. The proposed EEoELM uses Pi neurons in the hidden layer, either with identical or distinct activation functions. The different type of activation functions allows the ensemble to approximate the class decision restrictions more precisely. Due to the randomness in weight initialization and different types of activation functions in classifiers, different ELMs may generate different feature mappings. Therefore, the diversity that enhances the reliability of the base classifiers improves the overall performance of the ensemble largely. In this paper, we introduce an approach founded on ensembles of extreme learning machines comprising diverse classifiers, each employing distinct randomly initialized weights and novel activation functions, and inspired by M-estimation as well as re-descending M-estimation theories (refer to Tables 1 and 2). The proposed EEoELM method augments the robustness and dependability of ensemble learning in extreme machines by leveraging the variance inherent among the base classifiers. One major drawback with traditional ELM and its derivatives is that they only utilize random initial weights from one single continuous probability distribution. The proposed EEoELM tries to loosen this restriction by training base classifiers using different activation functions and starting weights from both normal and uniform distributions with different values. Also, the ensemble classifier performs an analytical optimization of its parameters by using different stacking or bagging approaches that generally allow considerably better robustness at better performance. For complete methodology refer to Fig. 4.

Table 1 Prevalent activation functions characterized by both continuity and differentiability.
Table 2 The objective function \(\:\rho\:(\cdot),\:\)influence function \(\:\psi\:(\cdot),\) and corresponding weight function \(\:w(\cdot)\) for M-estimators and re-descending \(\:M-estimators\).

Motivation for using M and Re-descending M based Psi function as activation function

In practice activation functions are used for introducing non linearity and increase accuracy through calculating gradient in neural network and traditional properties of activation function is zero centered, differentiable and introducing non linearity and early convergence. It is clear that tangent sigmoid possess the property of zero centered while sigmoid function does not enjoy this property. Keeping in mind the basic of properties while leveraging different traditional activation function, we inspired to the see the mathematical graphical behaviors of M and re-descending M estimation psi function enjoy the property of zero centered, differentiable, nonlinear behavior and last but not the least property of proposed activation functions that is unsaturated in nature, for information, see Fig. 2. Table 1 shows traditional activation functions like sigmoid, tanh, and sine are commonly used in single-layer feed forward networks such as Extreme Learning Machines (ELM) to introduce non-linearity into the hidden layer. However, Fig. 2 exhibit monotonic saturation behavior and possess S-shaped or bounded oscillatory transformations, which inherently impose non-local generalization.

This leads to several challenges, particularly in ELM where the hidden layer parameters are randomly initialized and fixed: firstly, Standard activations respond non-discriminatively to extreme values. Outliers can dominate the feature space, distorting the learning process. Secondly, the global nonlinearity of sigmoid/tanh functions limits the model’s ability to represent complex localized data structures. For large input magnitudes, derivatives vanish (i.e., gradient saturation), reducing the expressiveness of neurons in capturing meaningful variation. Redescending ψ-functions derived from robust M-estimators (Tukey’s biweight, Welsch, Cauchy) mentioned in Table 2 address these limitations by providing bounded and adaptive non-linear transformations with local sensitivity and global suppression, figure distinguish proposed and existing activations clearly to address flexible nature of M-estimation based activations from traditional ones. Figure 2 clearly mention the clear picture of traditional and proposed activation function where proposed activation functions are zero centered and S type functions are their special case which clearly showed where gradients are not diminishes. Their main advantages are: firstly redescending ψ-functions down-weight extreme values by design, ensuring that inputs mapped to extreme regions due to random initialization do not distort the feature space. Secondly, these functions emphasize central data trends and suppress outliers, enabling better separation of classes, more compact and informative hidden layer representations, and increased local adaptability in complex data environments. Thirdly by nullifying the contribution of large residuals, redescending activations reduce model complexity, prevent overfitting to noise or rare data patterns, and improve stability across different training subsets. Furthermore, we extend to leverage the flexible nature of proposed activation functions in our proposed ELM ensembles, each base classifier’s diversity depends on the hidden layer mapping and ψ-based activations (see Fig. 3) induce more varied but meaningful feature transformations, improving ensemble fusion performance when fusion parameters are estimated optimally.

Fig. 2
figure 2

Mathematical behavior of traditional and proposed activation functions.

Fig. 3
figure 3

Flowchart of the study.

Applications

For benchmarking, five datasets were chosen from the UCI and Kaggle libraries, with thorough descriptions provided in Table 4. Data preparation was used to handle missing values and outliers when relevant. Furthermore, categorical variables were encoded using the one-hot encoding technique, with the existence of a certain feature represented as 1 and all other qualities as 0. The datasets were divided into training and testing sets, with 70% used for training and 30% for testing. Weights for each classifier were randomly assigned using a typical normal distribution. Output parameters were then analytically optimized using the square loss function with specified hyperparameters that included the proposed activation function as well as other existing activation functions. Different activation functions, starting weights, and hidden neuron counts were used during the six training cycles of each classifier. To ensure accuracy and consistency, all base classifier outputs were combined and averaged using the ordinary least squares method. Classification accuracy was assessed for the suggested ensemble and other state-of-the-art methods after the trained ensemble classifiers were tested on the reserved test data.

The entire process was done 50 times to ensure robustness, and the mean % accuracy and standard deviation (SD) were reported. Table 5 shows the results of the suggested ensemble using the psi-based activation function. For comparative study, the mean classification accuracy and SD of various state-of-the-art ensembles in the literature are also discussed. The superior performance and dependability of the suggested ensemble for classification tasks are highlighted by a discussion of the mean classification accuracy and SD of various state-of-the-art ensembles from the literature for comparison study.

Experiments on real world data

To evaluate the relative efficiency of the proposed approach in terms of both accuracy and precision, we employed two widely recognized high-dimensional numerical datasets: SatImage and Emails, obtained from the UCI Machine Learning Repository and Kaggle, respectively. The proposed ensemble model comprises six base ELM classifiers. Each classifier is initialized with distinct weight vectors drawn from a standard normal distribution, while maintaining a consistent number of hidden nodes and employing the same activation function across the ensemble. In the proposed ensemble framework, we incorporate newly designed activation functions, as detailed in Table 2, and evaluate their performance against established state-of-the-art ensemble methods reported in the literature. Following data preprocessing, each experimental setup is executed over 50 independent runs, and the average performance is assessed based on accuracy and precision metrics. The key statistical characteristics of the datasets utilized are summarized in Table 4.

Grid search optimization of hidden nodes

The number of hidden nodes h is critical to ELM performance. For each dataset and activation function, we perform grid search over a plausible range:

• Satimage: 500–1000.

• Emails Spamdexing: 300–600.

• Breast Cancer: 100–400.

• Musk: 100–200.

• Iris: 20–100.

The optimal hidden nodes is selected based on minimum Brier score on the test set, balancing calibration and accuracy (refer to Fig. 4).

The same optimal hidden nodes is then used across all base classifiers in the ensemble to maintain consistency and reduce hyperparameter variability.

Fig. 4
figure 4

Grid Search optimizers for hidden nodes.

Performance metrics

We evaluate models using:

Test Accuracy (%).

• Standard Deviation: Across 50 replications per method.

$$\bullet\:\:\:\:\:\%SD=\sqrt{\left({\sum\:}_{j=1}^{J}{\left(Accuracy-\overline{Accuracy}\right)}^{2}/J\right)},$$

• Brier Score: Measures probabilistic calibration.

• Kruskal-Wallis Tests: To assess statistical significance of accuracy differences across methods.

• Computational Cost (Relative Time Complexity): Estimated based on number of ELMs and matrix inversion operations.

Experimental protocol

1. Preprocessing: Normalize inputs to [0,1].

2. Label Encoding: One-hot encoding for multi-class targets.

3. Repeated Simulation: 50 runs per configuration to account for ELM randomness.

4. Ensemble Fusion: Use least-squares to combine outputs from base ELMs.

5. Grid Search: Tune hidden layer size for single classifier using Brier score and then ensemble ELM.

6. Statistical Testing: Perform Kruskal-Wallis test across activation functions and display the performance of accuracies of all methods using boxplot.

Trade-off analysis

Trade-off between and accuracy and computational complexity is often compromised over each other regarding the problem under study, complex structure of mathematical functions are adopted to developed heavy loaded models to get stability and high accuracy in sensitive nature of prediction where a small error make unable to take decisions. We analyze the computational cost of base ELMs using lightweight traditional activation functions offer fast training and also analyze the effect of proposed activation function with base classifier and to further control the stability and increase ensemble methods using complex robust functions increase time cost (1.2x-1.4x) (see Table 3), however, significantly improve performance on complex datasets such as Satimage and Emails. For comparison simple datasets are considered and even single ELMs are often sufficient but ensemble methods enjoy stability.

Table 3 Test accuracy and relative computational cost.
Table 4 Descriptive statistics of the datasets employed to evaluate classifier performance.

Results

The following tables and figures portray the numerical comparisons of proposed and existing methods clearly for confirmations.

Table 5 Classification accuracy of the proposed ELM-based ensemble methods under a bagging framework.
Fig. 5
figure 5

Mean classification accuracy and standard deviation of proposed and exinsting ELMs ensemble.

Fig. 6
figure 6

Boxplot with Kruskal Wallis test of significance.

Table 6 Empirical accuracy vs. Complexity Trade-off.
Fig. 7
figure 7

Empirical accuracy vs. complexity trade-off.

Table 7 Post-hoc analysis summary across datasets.
Fig. 8
figure 8

Comparison with 5 state-of-the-art models.

Discussion and conclusion

To evaluate the effectiveness of our proposed ensemble framework of Extreme Learning Machines (ELMs), we conducted comprehensive experiments across five benchmark datasets: Satimage, Email Spamdexing, Breast Cancer, Musk, and Iris. Our approach integrates both traditional activation functions and novel ones inspired by robust M-estimation theory, such as Welsch, Bisquare, Lmstf, and Cauchy, aiming to improve classification accuracy, robustness, and generalization. (See Figs. 5 and 6). We compared our results against several established ensemble ELM methods. For instance, Lan et al. (2009) implemented an ensemble of 25 online sequential ELMs on the Satimage dataset using sigmoid and radial basis function (RBF) activations. Their models achieved testing accuracies of 88.01% (SD: 0.0029) for sigmoid and 88.28% (SD: 0.0037) for RBF, using 400 hidden nodes per ELM. In contrast, Mansoori and Sara22 utilized 97 ELMs with only 7 hidden nodes each, obtaining an accuracy of 87.85% (SD: 0.26) via weighted averaging. Our proposed ensemble model clearly outperformed these benchmarks, both in accuracy and stability, as summarized in Table 4. Among all tested activation functions, the Psi-based estimator consistently delivered the best performance. Our results also surpassed other competitive models, such as ReliefF Random Forest (Zhang et al., 2023), Lexicase Selection (Pleshkova and Stanovov30and the method by Ren et al. (2022). The proposed method outperformed other state-of the-art methods such as ANN, SVM, logistic regression, KNN, Mansoori & Sara ensemble, Niave Bayes, Decision Trees, Random Forest (RF), Lan et al. RBF, Lan et al. sigmoid and GBM (refer to Fig. 8).

Satimage & Emails (High-Dimensional Datasets): These datasets posed greater challenges due to their complexity and high dimensionality. Our ensemble achieved superior mean accuracy and reduced standard deviation compared to traditional single and ensemble ELMs. On the Emails dataset, functions like Logistic, Welsch, and Cauchy exceeded 95% accuracy.

Musk (Noisy and Imbalanced): The ensemble model demonstrated its robustness by performing well on this challenging dataset. The use of re-descending M-estimators proved effective in handling noise and imbalance.

Breast Cancer & Iris (Simpler Datasets): While individual robust models performed competitively on these datasets, the ensemble versions improved overall stability and generalization. Table 6; Fig. 7 summarize the accuracy and time complexity where proposed methods outperformed with a little compromise on time complexity.

Our ensemble construction involves training base ELMs with different activation functions and optimizing hidden nodes through Brier score-based grid search. This ensures each model is well-calibrated for its respective dataset. Rather than relying on majority voting, we applied least squares estimation to determine fusion weights, offering a more refined combination of predictions. This weighted strategy consistently outperformed standard voting ensembles. While our ensemble incurs higher computational costs, as shown in Fig. 8, this is a typical trade-off in ensemble learning. The gain in accuracy and robustness, particularly for high-stakes or noisy applications, justifies the added complexity. Table 7 showing Kruskal-Wallis and Dunn tests across all datasets to detect significant differences among models (p < 0.00001), confirming the need for deeper post-hoc analysis. Dunn’s test revealed that proposed models such as Proposed 3, 4, and 10 significantly outperformed traditional activation functions (e.g., Sigmoid, Sine, Tan-Sig, and RAF) on multiple datasets, especially Satimage and Breast Cancer. However, in some pairwise comparisons, differences, though favorable to our models, were not statistically significant after correction. For example:

• On Satimage, Proposed 4 vs. Proposed 6 showed modest Z-scores.

• On Emails, Proposed 10 outperformed Proposed 4 and 3 numerically but lacked statistical support.

• On Breast Cancer, Proposed 2 performed better than baseline models in mean accuracy but not significantly so.

These findings indicate that while statistical significance is essential, practical performance indicators, such as mean accuracy, standard deviation, and ranking, also hold critical value, especially under varying data distributions and noise conditions. Models that consistently perform well, even when p-values are borderline, are still valuable and may generalize better in real-world scenarios. This study introduces a robust and efficient ensemble learning framework based on ELMs, leveraging both conventional and newly proposed activation functions grounded in M-estimation theory. The ensemble is enhanced through Brier score-guided grid search for hidden node optimization and least squares-based fusion weighting. Across all evaluated datasets, our approach consistently delivered high accuracy and robustness, outperforming or matching existing state-of-the-art techniques. For future work, we plan to study the robustness and efficiency of proposed algorithm with novel activation function in intrusion detection enhancing the idea, which not only preserve the core properties of traditional activation functions but also offer enhanced robustness and adaptability. Unlike prior studies that focused on deep learning31 and SVMs with hybrid optimization28 or hybrid neuro-optimizers (Sumathi & Rajesh, 2024), our approach aims to strengthen the hidden layer’s learning capacity through statistically grounded, flexible activation functions for improved classification in noisy environments. Overall, the results validate the potential of robust activation functions in enhancing ELM ensembles and open new directions for scalable, resilient classification frameworks in machine learning.