Introduction

Automatic Modulation Classification (AMC) plays a pivotal role in modern wireless communication systems by enabling the automatic identification of the modulation schemes employed during signal transmission1,2. This capability is particularly critical in both military and commercial applications, where software-defined radios (SDRs) require the rapid and accurate recognition of modulation types within limited time windows to ensure efficient spectrum utilization and signal processing2. With the increasing diversity and complexity of wireless communication environments, characterized by a wide variety of modulation schemes and dynamic channel conditions, there is an escalating demand for advanced AMC techniques capable of robust and precise classification of signals to infer their origins. Generally, AMC methods are categorized into two main classes: feature-based and likelihood-based approaches4. Likelihood-based methods operate by evaluating the likelihood functions of the received signal against a set of known modulation templates, often yielding high classification accuracy especially in multi-channel scenarios5. However, these methods tend to become computationally prohibitive in real-world situations due to the need to estimate or adapt to unknown parameters such as carrier frequency offsets, fading channel effects, and varying coding rates6. On the other hand, feature-based methods focus on extracting distinctive characteristics of the signal, including asynchronous delay sampling, higher-order statistics, and time-frequency domain features, which are subsequently employed by pattern recognition algorithms for modulation identification7. Traditional machine learning classifiers such as decision trees and support vector machines utilize these engineered features but often struggle with nonlinearity and complexity inherent in modern communication signals, resulting in suboptimal performance9. The advent of artificial intelligence, coupled with substantial improvements in computational power, has catalyzed the adoption of machine learning (ML) and deep learning (DL) techniques within the AMC domain10,11. Deep learning models, characterized by their hierarchical multi-layer architectures, inherently learn complex feature representations from raw data, thereby minimizing the reliance on manual feature engineering12,13. Architectures including Residual Networks (ResNet-50), VGG1628, Convolutional Neural Networks (CNN), and Google’s Inception V3 have demonstrated remarkable efficacy in modulation classification tasks. Typically, the deep features extracted by these networks are further processed by Extreme Learning Machines (ELM)39,40 for final classification decisions. To optimize classification performance, ELM41 parameters such as weights and biases can be fine-tuned using metaheuristic algorithms like Moth Flame Optimization (MFO), which efficiently navigate the search space for optimal solutions29. Furthermore, the deployment of these ML/DL models on cloud computing platforms offers significant advantages, including scalable resource management, seamless integration of large-scale datasets, reliable data storage, and continuous service availability26,27,35. Such cloud-based frameworks facilitate accelerated processing and improved accuracy in AMC by enabling extensive data collection, storage, and remote processing capabilities. In this work, we propose a comprehensive cloud-based feature detection system designed for AMC, which integrates optimized MFO-ELM for feature selection and classification. The system is evaluated in both standalone and cloud environments, demonstrating enhanced execution speed and classification accuracy, thereby underscoring the benefits of cloud integration for modern AMC applications.

Aim and objectives

The aim of this study is to enhance the performance of automatic modulation classification using a hybrid deep-learning and optimization-based model deployable in cloud environments. The specific objectives are:

  • To extract rich features using pre-trained deep learning models (InceptionV3, ResNet50, VGG16).

  • To fuse these feature representations into a unified vector for better classification capability.

  • To optimize the performance of the ELM classifier using the Moth-Flame Optimization (MFO) algorithm.

  • To implement and evaluate the system in both standalone and scalable cloud environments.

Motivation and scope

The rise in wireless communication devices and demand for efficient spectrum use motivates the need for robust, low-latency modulation classifiers. Traditional methods struggle with generalization and interpretability. Our model provides:

  • A flexible framework suitable for real-time deployment on cloud platforms.

  • High classification accuracy using optimized ELM with deep fused features.

  • Insights through explainable AI for system transparency and trustworthiness.

The organization of this paper is described in following manner: Section Related work describes the related work, Section Proposed work mentions the proposed work. Section Research materials and methods describes experimental result and discussion. Section Explainable AI analysis analyzes the model’s decisions using state-of-the-art explainable AI tools, and Section Conclusion describes the conclusion.

Related work

Modulation classification plays a crucial role in wireless communication systems, involving essential tasks such as feature extraction, classifier development, and learning training processes. Zeng et al.13 proposed a convolutional neural network (CNN) model for automatic modulation recognition, building on the significant influence of image classification techniques in classical machine learning across various applications14,22. Traditionally, modulation classification relied on extracting either individual or sets of features, often utilizing Support Vector Machine (SVM) classifiers. However, with the rise of Artificial Neural Networks (ANN), more advanced feature extraction methods have emerged, leveraging multilayer architectures to effectively train on large datasets for enhanced feature learning and image classification16,17. Numerous studies have demonstrated the gradual replacement of conventional artificial feature extraction and classical machine learning techniques by deep learning approaches in modulation classification tasks18. For example, Ali et al.19 introduced an ANN-PCA-based method that normalizes received signals to form support vectors for improved feature extraction, while Chen et al.20 developed a deep multi-scale CNN to boost recognition performance, especially in low Signal-to-Noise Ratio (SNR) conditions. Beisun et al.21 proposed a modulation classification technique using Software Defined Radio (SDR) by adapting kernel sizes within an Inception-ResNet architecture. Similarly, Zhang et al.16 addressed overfitting challenges by employing a dual neural network framework composed of student and mentor networks. Zhou et al.24 emphasized the importance of transforming raw signal data into tensors to meet deep learning model input requirements, while Yin et al.25 introduced a CNN with offline pre-training on limited samples to facilitate independent SNR estimation. Sathyanarayanan et al.26 further compared basic CNNs and residual CNN architectures to evaluate modulation classification accuracy. Building upon these advancements, recent deep learning techniques have substantially improved Automatic Modulation Classification (AMC) performance. Baishya et al.42 demonstrated edge-efficient deep learning models that optimize computational resources without sacrificing accuracy, proving their utility in constrained environments. El-Haryqy et al.43 enhanced AMC frameworks by incorporating additive attention mechanisms to focus on relevant signal features, improving recognition outcomes. Ouamna et al.44 utilized spectrogram-based deep learning approaches to robustly classify modulations in complex signal scenarios, while Luu et al.45 introduced uncertainty-aware incremental learning with Bayesian neural networks to adaptively improve AMC confidence and flexibility. Complementing these, Alade et al.46 showed that metaheuristic optimization techniques like Moth-Flame Optimization (MFO) can significantly enhance Extreme Learning Machines (ELM) classification accuracy and reduce training times. Additionally, models such as the T2FCATR by Nagendranth et al.33 employ Type II fuzzy clustering and improved ant colony optimization to achieve secure and energy-efficient routing in mobile ad hoc networks (MANETs), while the PPRDA-FC model34 integrates moth flame optimization with levy flight for secure cluster formation and Fog Computing-based data transmission in vehicular ad hoc networks (VANETs), further enhanced by deep neural networks for malicious vehicle detection. Despite the widespread use of CNNs across diverse domains including medical image analysis30,31,32, current CNN-based modulation classification models face challenges such as high computational overhead limiting real-time deployment, lack of interpretability reducing insight into their decision-making processes, and limited scalability across heterogeneous environments like cloud and edge devices. These limitations motivate the development of lightweight, explainable, and scalable alternatives such as the proposed MFOP-ELM approach, which aims to address these challenges effectively.

This study incorporates a wide variety of modulation techniques including Binary Phase Shift Keying (BPSK), Amplitude Modulation Double Side Band (AM-DSB), 8 Phase Shift Keying (8PSK), 64 Quadrature Amplitude Modulation (64QAM), Wideband Frequency Modulation (WBFM), 16 Quadrature Amplitude Modulation (16QAM), 4 Pulse Amplitude Modulation (PAM4), Gaussian Frequency Shift Keying (GFSK), Quadrature Phase Shift Keying (QPSK), Continuous Phase Frequency Shift Keying (CPFSK), Amplitude Modulation Single Side Band (AM-SSB), Offset Quadrature Phase Shift Keying (OQPSK), Gaussian Minimum Shift Keying (GMSK), Amplitude Modulation Double Side Band With Carrier (AM-DSB-WC), Amplitude Modulation - Single Side Band Suppressed Carrier (AM-SSB-SC), 128 Amplitude and Phase Shift Keying (128APSK), On-Off Keying (OOK), 8 Amplitude Shift Keying (8ASK), 4 Amplitude Shift Keying (4ASK), 16 Phase Shift Keying (16PSK), 32 Phase Shift Keying (32PSK), 16 Amplitude and Phase Shift Keying (16APSK), 32 Amplitude and Phase Shift Keying (32APSK), 64 Amplitude and Phase Shift Keying (64APSK), 16 Quadrature Amplitude Modulation (16QAM), 32 Quadrature Amplitude Modulation (32QAM), 64 Quadrature Amplitude Modulation (64QAM), 128 Quadrature Amplitude Modulation (128QAM), and 256 Quadrature Amplitude Modulation (256QAM). According to the literature, most computer aided detection (CAD) systems suffer from slow computation and limited classification accuracy, making them unsuitable for real-time deployment. Existing CAD models often rely on extensive feature sets and conventional machine learning techniques, making classifier and feature selection a persistent challenge. Extreme Learning Machines (ELMs) have gained popularity in several domains due to their fast convergence, ability to avoid local minima, and superior computational speed. Therefore, this work implements a Moth Flame Optimization (MFO) based ELM method to achieve improved classification accuracy with low computational overhead.

The following table summarizes recent approaches to automatic modulation classification in terms of datasets used, feature extraction methods, classifier models, performance results, and their limitations.

Table 1 Comparison of related work with the proposed model.

Proposed work

Architecture of proposed network

In this study, we propose an efficient and robust model for automatic modulation classification (AMC) by leveraging the strengths of multiple deep feature extractors and an optimized lightweight classifier. The input signal is first transformed from the time domain to the image domain using the proposed polar transformation, which encodes amplitude and phase variations into a 2D visual representation. This transformation enhances feature separability and increases robustness under noisy channel conditions. The core novelty of the proposed architecture lies in the effective utilization of three complementary pre-trained deep convolutional neural networks: InceptionV3, ResNet50, and VGG16. These models have been selected due to their diverse architectural designs and distinct feature extraction capabilities: InceptionV3 excels in capturing multi-scale patterns, ResNet50 offers deep residual learning for edge-sensitive features, and VGG16 provides uniform receptive fields with strong texture representation. By concatenating the deep feature vectors from these networks, we construct a comprehensive and discriminative feature representation that captures both local and global structures in the transformed signal.

The fusion of features from InceptionV3, ResNet50, and VGG16 is performed using simple feature level concatenation. Specifically, the output feature vectors from the final global average pooling (GAP) layers of each CNN are concatenated into a single 1D vector. This vector combines the multi scale, residual, and texture-based features into a unified representation. Mathematically, if \(f_{I}\), \(f_{R}\), and \(f_{V}\) represent the feature vectors from InceptionV3, ResNet50, and VGG16 respectively, the final fused vector \(f_{\text {fused}}\) is given by:

$$f_{\text {fused}} = [f_{I} \parallel f_{R} \parallel f_{V}]$$

where \(\parallel\) denotes concatenation. Empirical evaluation in Section Ablation study confirms that this fusion improves classification accuracy across SNR levels.

While feature fusion may introduce risks of redundancy or overfitting, empirical validation in our ablation study demonstrates that fused features consistently outperform single-model features. Additionally, we incorporate an Extreme Learning Machine (ELM) classifier optimized via Moth-Flame Optimization (MFO), which enhances generalization and mitigates the curse of dimensionality by fine-tuning hidden layer parameters. As illustrated in Fig. 1, the transformed image data is first passed through the three pre-trained CNN backbones, followed by feature-level fusion. The fused representation is then classified by the MFOP-ELM module. We evaluate and compare the performance of individual CNN features and their fused representation under both standalone and cloud-based deployment environments using identical datasets, ensuring a fair and reproducible assessment. .

Fig. 1
figure 1

Proposed network model.

Proposed cloud-based model

The cloud-based network model is used for achieving better classification accuracy by monitoring different data sets. This model can be adopted by taking different pre trained deep learning models (Inception V3,Resnet 50 and VGG 16) .The cloud environment with MFO-ELM is combined and is used for the purpose of classification accuracy. The experimental model of the cloud MFO-ELM model is shown in Fig. 2.

Fig. 2
figure 2

Amazon EC2 cloud system.

Dataset and preprocessing

Here the authors assess the modulation recognition problem using the RADIOML2016.10A (having 11 classes of modulation scheme) and RADIOML2018.01A (having 24 classes of modulation scheme) datasets. By using the dataset, the researcher makes the comparison with the effectiveness of existing work. The dataset was produced using GNU Radio, which uses analogue and digital modulations. Table 2 describes the details of the datasets used in this work.

The RADIOML2016.10A and RADIOML2018.01A datasets are preprocessed by normalizing each sample between [0,1] for both In-phase (I) and Quadrature-phase (Q) components. The data is randomly shuffled and then split in a stratified manner into 60% training and 40% testing. Care is taken to prevent correlation based leakage, as adjacent frames in DeepSig datasets may carry temporal or frequency domain similarities.

To ensure robust model training and avoid overfitting, a 60/40 train-test split was selected. This split offers a good balance by providing sufficient samples for both learning and evaluation. Furthermore, to verify the stability of results, each experiment was repeated 10 times with different random splits and weight initializations. The reported performance metrics represent the average values across these runs, along with standard deviations to indicate consistency. The preprocessing pipeline involves the following steps:

  • Each I/Q sample is normalized to the [0,1] range.

  • A polar transformation is applied to convert raw signals into 2D grayscale images by encoding amplitude and phase as pixel intensities.

  • The resulting images are resized to a uniform shape of 224\(\times\)224 to be compatible with input requirements of CNNs.

  • Dataset balancing is ensured across SNR levels and modulation classes.

  • Finally, data augmentation techniques such as Gaussian noise injection, random rotations, and flips are selectively applied to improve generalization.

Modulation types included: RADIOML2018.01A contains 24 modulations, including: BPSK, QPSK, 8PSK, 16QAM, 64QAM, AM-DSB, AM-SSB, FM, GMSK, OOK, 4ASK, 8ASK, CPFSK, etc. RADIOML2016.10A includes 11 modulations such as: BPSK, QPSK, 8PSK, AM-DSB, AM-SSB, WBFM, and GFSK.

SNR range: Both RADIOML2018.01A and RADIOML2016.10A cover a range from -20 dB to +18 dB in 2 dB steps.

Table 2 Specification of the data sets.

Extreme learning machine

An Extreme Learning Machine (ELM) chooses random linked weights between the input and hidden layers before determining the output weights. Generally, this method is used to overcome the slow gradient learning techniques to train a single hidden-layer feed-forward neural network. In the output layer, linearly connected weights are optimized to produce a result that makes the system straightforward, quicker, and more generalizable than other conventional learning systems. Select \(N\) distinct learning samples \((x_i, y_i)\) arbitrarily where

$$\begin{aligned} x_i = [x_{i1}, x_{i2}, \dots , x_{iL}]^T \in {\mathbb {R}}^L, \end{aligned}$$
(1)
$$\begin{aligned} y_i = [y_{i1}, y_{i2}, \dots , y_{iC}]^T \in {\mathbb {R}}^C. \end{aligned}$$
(2)

Here, L denotes the number of samples, and C denotes the number of classes. Consider the activation function \(\mu (\cdot )\), and \(M_h\) as the total number of hidden nodes. The ELM algorithm is as follows:

  1. 1.

    Initialize hidden node parameters randomly, i.e., \((D_i^h, G_i)\) for \(i = 1, 2, 3, \dots , M_h\).

  2. 2.

    Using the activation function, determine the input-hidden layer matrix having the size \((N \times M_h)\).

  3. 3.

    Calculate the result matrix using the minimal norm least squares method:

    $$\begin{aligned} D^0 = H^\dagger t \end{aligned}$$
    (3)

Here, \(D_i^h = [D_{i1}^h, D_{i2}^h, D_{i3}^h, \dots , D_{iL}^h]^T\) are the weight vectors between the hidden neurons and the input. \(D_i^o = [D_{i1}^o, D_{i2}^o, D_{i3}^o, \dots , D_{iC}^o]^T\) are the connected weight vectors between the output node and the hidden neuron. The Moore-Penrose (MP) generalized inverse of matrix H is represented by \(H^\dagger\). Compared to standard learning algorithms, this learning method gives better output.

MFO algorithm

This algorithm is a meta-heuristic approach based on the natural behavior of moths. For optimization, moths are considered as candidate solutions, and based on their position, moths are initialized. According to MFO techniques, flames and moths play an important role, where flames are considered artificial lights, and moths are drawn toward them based on their location. The different steps of the MFO algorithm are as follows:

  1. 1.

    Moths are chosen at random based on their position in the search space.

  2. 2.

    A random number \(r \in [a, 1]\) is initialized, where \(a\) is a convergence constant that decreases from -1 to -2, and \(b\) specifies the spiral shape constant with a value of 1.

  3. 3.

    Initially, set the number of moths equal to the number of flames.

  4. 4.

    The fitness value of each moth is determined and arranged in accordance with maximum to minimum for maximization problems (or vice versa).

  5. 5.

    Use the equation below to determine the number of flames (NF):

    $$\begin{aligned} NF = \text {round}\left( N - \frac{1 \cdot (N - 1)}{T}\right) \end{aligned}$$
    (4)
  6. 6.

    Using the reduced list of moths, initialize the NF number of flames.

  7. 7.

    Determine the value of \(d_i\) using the following equation:

    $$\begin{aligned} d_i = |f - m_i| \end{aligned}$$
    (5)
  8. 8.

    Update the values of \(a\) and \(r\) using the following equations:

    $$\begin{aligned} a = -1 + I \times \left( \frac{-1}{T} \right) \end{aligned}$$
    (6)
    $$\begin{aligned} r = (a - 1) \times \text {rand()} + 1 \end{aligned}$$
    (7)
  9. 9.

    Determine the moth’s position \(m_i\) in relation to the corresponding flame \(f_j\) using the spiral function:

    $$\begin{aligned} S_p(m_i, f_j) = d(i) e^{b r} \cos (2 \pi r) + f_j \end{aligned}$$
    (8)
  10. 10.

    Offer the best flame, which is regarded as the greatest solution.

Proposed MFOP-ELM techniques

The conventional ELM is built upon the input weights and hidden biases. Firstly, a large number of neurons are required to achieve more precision. Secondly, as the bias and weights are changed arbitrarily, there is a major change in the resultant matrix. By increasing the parameters of a hidden node in a traditional ELM, population-based optimization models like PSO and Genetic Algorithms (GA) have been used to overcome these challenges in recent years. In this paper, we have employed the MFO algorithm, an optimizer that trains the ELM by maximizing hidden node properties such as weight and bias. The MFO algorithm is used to select the hidden layer parameters, such as weights and bias values, in order to maximize the performance of ELM.

The objective of MFO-ELM is to minimize the prediction error (Sum Squared Error) between the predicted and true labels by optimizing hidden node weights and biases:

$$\begin{aligned} \min \mu (m_j) = \sum _{j=1}^{N_v} \sum _{i=1}^{N_o} |O_{ij} - D_{ij}| \end{aligned}$$
(9)

The total number of nodes in the ELM’s hidden and input layers is represented by \(N_h\) and \(N_i\). By using the equation

$$\begin{aligned} M_{\text {opt}} = (N_i + 1) \times N_h, \end{aligned}$$
(10)

the optimal variable \(M_{\text {opt}}\) is determined. The steps for the MFOP-ELM method are as follows:

  1. 1.

    Initialize the candidate’s solution by randomly selecting the position of the moth. In the given range [1, 1], set the input weights and hidden biases to create a potential solution:

    $$\begin{aligned} M_j = \left[ D_{11}^h, D_{12}^h, D_{13}^h, \dots , D_{1L}^h, \dots , D_{i1}^h, D_{i2}^h, \dots , D_{iL}^h, G_1, G_2, \dots , G_L \right] . \end{aligned}$$
    (11)
  2. 2.

    Set the parameters \(I = 1\) and \(b = 1\) as the current iteration’s parameters, and T as the maximum number of iterations.

  3. 3.

    For each moth, calculate the sum squared error (SSE) \(\mu (m_j)\):

    $$\begin{aligned} \mu (m_j) = \sum _{j=1}^{N_v} \sum _{i=1}^{N_o} |O_{ij} - D_{ij}| \end{aligned}$$
    (12)
  4. 4.

    Using equation (2), update the flames.

  5. 5.

    The gap between the potential solution and flame is calculated using equation (2).

  6. 6.

    By using equations (3) and (4), update the position of the moth and also find the parameters a and r from equation (5).

  7. 7.

    Repeat steps (2) through (5) to achieve better optimal parameters. In order to find a global solution while balancing local exploitation and global exploration, the MFO algorithm uses two equations: equation (3) for logarithmic spiral search and equation (1) for flame decrement approach.

The testing samples and suggested model classification are evaluated using the optimized hidden node parameters.

The summary of all the notations used above is as follows:

  • \(N_h\): Number of hidden neurons in the ELM

  • \(N_i\): Number of input features

  • \(N_v\): Number of validation samples

  • \(O_{ij}\): Output of the ELM for the j-th sample and the i-th output class

  • \(D_{ij}\): Desired output (ground truth)

  • \(M_{\text {opt}}\): Number of optimized parameters

  • a, r: MFO spiral search constants

  • T: Maximum number of MFO iterations

The pseudocode of the proposed model is given in Algorithm 1.

Algorithm 1
figure a

MFOP-ELM: Moth-Flame Optimized extreme learning machine.

VGG-16

In 2014, Karen Simonyan and Andrew Zisserman introduced the architecture of VGG-16. The term VGG stands for Visual Geometry Group, and 16 refers to the number of layers in the network20. Essentially, this is a simple model used for ImageNet recognition, designed to recognize visual objects in software research. The model uses a dataset of 14 million images, consisting of 1000 classes. VGG-16 is an advanced version of the AlexNet architecture. Instead of using large kernel sizes, VGG-16 uses multiple stacked 3 \(\times\) 3 kernel filters, one after another. This model has a fixed input image size of 224 \(\times\) 224. The images are passed through a stack of convolutional layers, each using 3 \(\times\) 3 filters to capture features in all directions. VGG-16 is deeper, provides more non-linearity, and has fewer parameters than its predecessor. It also uses 1 \(\times\) 1 convolution filters for linear transformations of the linear channels. For each pixel, both padding and stride are fixed. The architecture includes five max-pooling layers, each followed by convolution layers. The max-pooling layers have a stride of 2. The network also contains three fully connected layers, followed by a series of convolution layers. The first two fully connected layers consist of 4096 channels, and the third one performs the 1000-way classification for the ImageNet Large Scale Visual Recognition Challenge. The final layer is a softmax layer. The overall architecture of VGG-16 is shown in Fig. 3.

Fig. 3
figure 3

VGG-16 architecture model.

There are some drawbacks of VGG-16, such as slow training times and large network weights.

Table 3 VGG-16 architecture details and output sizes.

The choice of 13 convolutional and 5 max-pooling layers was guided by the proven balance between model complexity and performance in VGG16. The use of small 3\(\times\)3 kernels allows capturing fine grained features while keeping parameter count manageable. Max pooling progressively reduces spatial dimensions, enabling hierarchical feature abstraction. The architectural details are specified in Table 3.

ResNet-50

ResNet50 refers to the Residual Network designed to avoid issues with gradient descent in deep architectures. The problems of “gradient disappearance” and “gradient explosion” can cause the following issues in deep networks: (1) a long training period with difficult or impossible convergence, and (2) the degradation problem, where network performance gradually saturates and even declines as the network deepens. To address these challenges, the ResNet architecture was developed, providing high efficiency and better performance even as the number of layers increases. Deep Convolutional Neural Networks (CNNs) are known for their ability to detect high, mid, and low-level features in images. However, stacking additional layers to improve accuracy can cause training difficulties. The authors of ResNet addressed this issue by creating a deep residual learning framework21 that uses shortcut connections to perform identity mappings.

In ResNet-50, skip connections are used, as shown in Fig. 4. The input can be directly connected to the output by skipping a few layers. For example, in Fig. 4, the total input X is connected directly to the identity connection, skipping some layers, which is represented as \(F(X) + X\). The final output is \(H(X) = \text {ReLU}(F(X) + X)\).

Fig. 4
figure 4

Architecture overview of ResNet-50.

Inception V3

One successful way to enhance network depth and learning ability is the inception architecture. This network is made up of inception modules that are repeated. Each inception module has four parallel routes, as indicated in Fig.5, there are four parallel outputs connected with filter concatenation at the second stage. For the forward selection of data first path is made up of 1\(\times\) 1 convolution. In the case of 1\(\times\) 1 convolution, the data is simply passed instead of transformed. The second and third blocks of the second stages ate contain 1\(\times\) 1 convolution followed by a bank of 5 \(\times\)5 and 3\(\times\)3 convolution24. To obtain overall training loss the inception modules are connected to a softmax classifier25.

Fig. 5
figure 5

Architecture of inception V3.

Research materials and methods

The research is conducted in two environments: a standalone environment and a cloud environment. The normalized dataset is split into training (60%) and testing (40%) data. The experimental results are compared with various classifiers such as SVM, BPNN, KNN, ELM, and MFO-ELM. The proposed model workflow is shown in Fig. 6.

Fig. 6
figure 6

Work flow diagram of proposed model.

Hyperparameter settings

The experimental settings for the MFOP-ELM model were chosen based on empirical tuning to balance accuracy and computational efficiency. The table below lists the key hyperparameters used in all experiments.

Table 4 MFOP-ELM hyperparameter settings.

Cloud environment

In this paper, the MFO-ELM model is tested on Amazon Elastic Compute Cloud (EC2), a Platform-as-a-Service (PaaS), and compared with the standalone system. The primary goal of using the cloud environment is to reduce latency and improve accuracy. In general, the Linux operating system is used to create virtual machines in the cloud. Therefore, the results from both environments are compared to analyze the performance.

Standalone environment

Table 5 Determination of parameters using MFO-ELM with different hidden layers in standalone system.
Fig. 7
figure 7

Model accuracy for standalone system.

The performance of the standalone environment is evaluated using a system with the following configuration: 8 GB RAM, 11th Gen Intel (R) Core (TM) i5-1135G7 @ 2.40 GHz processor, and a 1 TB HDD. Python 3.1 and PyCharm IDE are used for the development and execution of various classification models.

Performance of cloud-based environment

The proposed MFO-ELM model, tested on various hidden node configurations in a standalone system, outperformed traditional models when implemented in a cloud environment. By varying the hardware specifications, we determined the performance of MFO-ELM in the cloud environment. We experimented with hidden layer counts ranging from 1 to 250. The same model was evaluated on cloud computing virtual machines with distinct features such as vCPU-4 16 GB RAM, vCPU-8 32 GB RAM, and vCPU-16 64 GB RAM. Each virtual machine executed the proposed model with identical hidden node settings. The simulation outcomes are documented in Tables 6, 7, and 8, which illustrate the results.. Our findings indicate that the vCPU-16 64 GB RAM virtual machine achieved the highest accuracy of 94.19% with 250 hidden nodes compared to the other virtual machines. The outcomes are visually represented in Figs. 8, 9, and 10.

Each reported metric is averaged over 5 independent trials, and we report mean ± standard deviation. For example: Accuracy: \(94.19\% \pm 0.31\), Sensitivity: \(89.56\% \pm 0.25\), Specificity: \(88.76\% \pm 0.37\).

Fig. 8
figure 8

Performance of MFO-ELM in cloud environment on vCPU-4 16 GB RAM.

Table 6 Performance of MFO-ELM in cloud environment with vCPU-4 16 GB RAM.
Fig. 9
figure 9

Performance of MFO-ELM in cloud environment on vCPU-8 32 GB RAM.

Table 7 Performance of MFO-ELM in cloud environment with vCPU-8 32 GB RAM.
Fig. 10
figure 10

Performance of MFO-ELM in cloud environment on vCPU-16 64 GB RAM.

Table 8 Performance of MFO-ELM in cloud environment with vCPU-16 64 GB RAM.

Comparison of MFO-ELM in standalone environment versus cloud systems

This section compares the performance of the proposed model in both standalone and cloud environments. The models were tested with varying numbers of hidden nodes, ranging from 1 to 250. As shown in Table 9, the cloud environment, particularly using a vCPU-16 64 GB RAM virtual machine, achieved the highest classification accuracy of 94.19%, compared to the standalone system and

Figure 11 depicts the confusion matrix of the best performing model.

Table 9 Comparison of classification accuracy of different methods with existing work.
Fig. 11
figure 11

Confusion matrix for the proposed MFOP-ELM model on the RADIOML2016.10A dataset.

Ablation study

To evaluate the effectiveness of each component of our proposed MFOP-ELM model, we conducted an ablation study using the RADIOML2018.01A dataset. This study systematically removes or replaces modules to quantify their individual contributions toward classification performance. All experiments were conducted under identical hardware settings (vCPU-16 with 64 GB RAM) and dataset splits.

Experimental setups

We considered the following configurations:

  1. 1.

    ELM (baseline): Extreme Learning Machine without any optimization or deep features.

  2. 2.

    MFO-ELM: ELM with Moth Flame Optimization to tune weights and biases.

  3. 3.

    Single backbone models: Feature extraction from only one pre-trained model (InceptionV3, ResNet50, or VGG16) fed into MFO-ELM.

  4. 4.

    Feature fusion (without optimization): Concatenated features from all three models input to a vanilla ELM.

  5. 5.

    MFOP-ELM (proposed): Our full model with fused features and MFO-optimized ELM.

Results and analysis

Table 10 Ablation study results on RADIOML2018.01A (vCPU-16, 64 GB RAM).

The ablation results clearly show the benefit of combining both feature fusion and MFO optimization. While single backbone models perform reasonably well, the fusion strategy provides a richer feature space. Moreover, optimization with MFO leads to consistent improvements across all configurations. Compared to the End to End CNN, our MFOP-ELM is not only more accurate but also lighter and faster to train, making it highly suitable for real-time cloud deployments.

Comparison with CNN Models and Traditional Classifiers

To validate the advantages of the proposed MFOP-ELM approach, we benchmarked it against both traditional End to End CNN classifiers and conventional machine learning classifiers using fused deep features. The CNN models evaluated include VGG-16, ResNet50, InceptionV3, DenseNet121, MobileNetV2, and EfficientNet-B0, all trained on the same dataset (RADIOML2018.01A at SNR = 0 dB) with identical preprocessing steps, excluding any feature fusion or ELM integration.

In addition, the fused features extracted from these CNNs were classified using standard classifiers such as Support Vector Machine (SVM), k-Nearest Neighbors (KNN), and Backpropagation Neural Network (BPNN). The performance comparison in terms of classification accuracy is presented in Table 11.

Table 11 Comparison of classification metrics (RADIOML2018.01A, SNR = 0 dB).

The results clearly demonstrate that the proposed MFOP-ELM outperforms both individual CNN models and conventional classifiers in terms of classification accuracy. This improvement is attributed to the effective fusion of deep features and the efficient learning capability of the ELM, which benefits from its non-iterative nature and rapid training time. These findings highlight the suitability of MFOP-ELM for complex modulation recognition tasks involving high-dimensional data. The classification results for the proposed MFOP-ELM model were statistically evaluated over multiple runs to ensure robustness and that the observed performance was not due to chance. The mean accuracy was found to be 88.7%, with a root mean square error (RMSE) of 1.2% and a standard deviation of 1.5%, indicating consistent and reliable model performance.

Robustness considerations for real-world deployment

Although our experimental setup focuses on evaluating performance using static SNR conditions from the RADIOML datasets, real-world wireless environments are affected by channel impairments such as multipath fading, Doppler shifts, impulsive noise, and co-channel interference. These factors can significantly impact modulation classification accuracy. To enhance the applicability of our model for practical deployments, future work will include testing under dynamic fading models such as Rayleigh and Rician channels, and evaluating the model’s robustness under varying SNR levels, interference patterns, and mobile transmitter-receiver conditions. Such evaluation will help quantify the generalization capability of the MFOP-ELM framework in realistic spectrum environments.

Explainable AI analysis

While the quantitative results reported in Section Research materials and methods confirm that MFOP–ELM achieves state-of-the-art modulation-classification accuracy, practical deployment demands transparency beyond raw numbers. To this end, we conducted a post-hoc, model-agnostic explainability study using LIME, examining both the global importance of the concatenated deep features and the instance-level reasoning for individual frames. Figure 12 shows the global feature importance over a stratified validation subset of 500 frames. By averaging the absolute LIME weights, we see that InceptionV3 features (blue bars) dominate the attribution mass, with the single most influential feature contributing over 20% of the total. ResNet-50 (orange) and VGG-16 (green) features account for the remaining importance, collectively confirming that texture-like descriptors from InceptionV3 are primary drivers of the MFOP–ELM decision. To illustrate local reasoning, Fig. 13 presents the LIME explanation for one correctly classified frame (sample 165). Here, negative contributions from features f299, f2517, and f3424 reduce the logit for the true class, whereas positive contributions from f4140, f3967, and f1543 reinforce it. This behaviour aligns with domain knowledge–attention is focused on constellation-cluster and spectral-peak descriptors rather than noise.

Fig. 12
figure 12

Global feature importance (Top 20).

Fig. 13
figure 13

Local LIME explanation.

We employed LIME for local instance-level explanations and SHAP for global contribution insights. LIME was selected due to its model-agnostic flexibility and ability to reveal input feature weights affecting single predictions.

Insights from XAI:

  • Features from InceptionV3 dominate both local and global attribution.

  • Class-specific features (e.g., clustered spectral peaks) align with domain understanding of modulation symbols.

  • Masking important features reduced accuracy significantly, confirming model reliance on fused representations.

System optimization based on XAI:

  • Reduce feature dimensionality by selecting top-k features.

  • Perform targeted retraining on underperforming SNR slices.

Two additional sanity checks corroborate trustworthiness. First, masking the top 10% of globally important features reduced accuracy from 94.2% to 72.5%, while masking the bottom 10% caused only a 3.1% drop, demonstrating causal relevance. Second, when LIME explanations were recomputed within narrow SNR strata, the ordering of the top-5 features was preserved in 92% of cases, indicating stability across channel conditions. Beyond transparency, these findings suggest concrete optimizations–pruning or quantizing the latent vector for edge deployment and targeted data augmentation in vulnerable SNR ranges. In summary, the LIME-based analysis validates MFOP–ELM’s credibility and uncovers actionable levers for future pipeline refinements.

Conclusion

This study presents the implementation of an automated performance accuracy analysis utilising an MFO-ELM framework in conjunction with a cloud-based model. The cloud environment provides continuous services that are accessible at any time and from any location. The MFO-ELM method demonstrates a capacity to circumvent local minima, facilitates rapid convergence, and exhibits a relative simplicity when compared to alternative classifiers. The proposed model underwent testing in both standalone and cloud environments utilising various virtual machine configurations. The cloud environment, characterised by the utilisation of a virtual machine with 16 virtual CPUs and 60 GB of RAM, demonstrated an enhanced classification accuracy of 94.19% in comparison to the standalone system. Additionally, SHAP values were calculated to offer insights into the explainability of artificial intelligence by elucidating the contribution of each deep feature to the predictions made by the ELM. The model under consideration is utilised for the assessment of classification accuracy. In future research, the utilisation of additional resources may contribute to an enhancement in accuracy. In addition to the aforementioned considerations, several parameters associated with the MFO-ELM have been adjusted to enhance the effectiveness of the proposed model. Subsequent investigations may involve the application of alternative optimisation techniques to enhance the generalizability of the system’s performance. The proposed model is well-suited for real-time deployments in cognitive radio, IoT-based spectrum monitoring, and defense communications. Its low-complexity structure (using ELM) and cloud readiness enable easy deployment in constrained and scalable environments. Moreover, the integration of explainable AI offers system developers insights into failure modes, aiding robustness and future improvements.