Introduction

The Internet of Medical Things (IoMT), which stems from IoT, is transforming healthcare through the interconnection of medical devices, sensors, and IT systems, facilitating remote monitoring, diagnostics, and personalized therapy1. The devices form adhoc networks used to transmit biomedical data and correspondingly form a multilayer architecture that may include sensors, network modules, storage units, and AI-based platforms2. Despite the benefits of IoMT networks, they also have security vulnerability issues due to the wide range of inter-connectivity and heterogeneous device setups.

Recent research has suggested several intrusion detection system (IDS) frameworks specifically designed for the constantly evolving context of IoMT (Internet of Medical Things) networks. These include hybrid architectures based on convolutional networks (CNN) and recurrent neural networks (RNNs)3, ensemble models with meta-learning to improve adaptability and detection rate4, and transformer models that offer explainability and robustness5. Other promising models in this regard include deep reinforcement learning that enabled adaptive IDS systems to be developed; such as a hybrid CNN-LSTM with Deep Q-Network (DQN) and Proximal Policy Optimization (PPO) which provided real-time threat detection across numerous datasets in IoMT6. Another noteworthy model combines federated learning and reinforcement agents, to secure IoMT while keeping data private7.

The incorporation of the Internet of Medical Things (IoMT) into healthcare systems has enabled real-time patient monitoring and proactive treatment interventions. The capability of conventional security approaches (particularly static signature-based IDS) in addressing these challenges is limited, highlights the need for an adaptable and dynamic threat management paradigm. Due to resource-constrained environments and vulnerable communication protocols, IoMT devices have often become prime targets for cyber threats and unauthorized data breaches. This research investigates these security concerns through a structured three phased methodology.

In phase I, we conduct a preliminary classification with a supervised learning model based on the C4.5 decision tree algorithm, which we targeted for a fast, low-latency response to known threats. In this phase of detection accuracy, we utilize feature importance, ensemble methods, and class balancing with SMOTE. In phase II, the model utilizes a DQN reinforcement learning loop integrated with a C4.5 classifier, forming a hybrid IDS. The C4.5 classifier can make rapid classification that is interpretable, while the DQN agent is learning dynamic threat patterns over time8,9. This hybrid approach managed the ability to learn and adapt to changing threats, incorporating the ability to learn new policies through feedback from the environment, hence providing security from a real time perspective10. In phase III, we tested the model’s generalization capabilities. The hybrid IDS was validated across multiple heterogeneous IoMT datasets: WUSTL-EHMS, ECU-IoHT, DF_IOMT, and CICIOT23. We observed that the hybrid IDS demonstrated consistently high performance across these datasets and confirmed the flexibility of the framework for deployment and real-world use in unfamiliar scenarios. The proposed IDS thus is a holistic approach encompassing low latency supervised detection, adaptability of Deep Reinforcement Learning, and proven generalization across diverse threat landscapes.

Contributions of the current study

The contributions of this study can be summarized as follows:

  1. 1.

    A deployment-aware and lightweight IDS designed primarily for the Internet of Medical Things (IoMT) environments, motivated by the need for high detection accuracy at low inference latencies.

  2. 2.

    We introduce a hybrid IDS that merges a pre-trained Decision Tree (C4.5) with a DQN for building adaptability/ self-learning capability in the model, to provide enhanced detection of evolving threats.

  3. 3.

    Demonstrated how this hybrid IDS can efficiently capitalize on supervised learning for initial classification while gaining the continuous adaptation and reward optimization of reinforcement learning to produce improvement in long-term performance and resilience.

The structure of this paper is follows, Section Related Work reviews related work and discusses security aspects related to IDS for IoMT. Section Dataset describes the dataset and its composition,Proposed Methodology explains proposed methodology, including model training and design aspects. Section Results presents results and analyses. Section Discussion includes discussion on obtained results. Section Limitations explains the limitations observed in executing the proposed pipeline. Finally, Section Conclusion and Future Work concludes with key findings, contributions and future directions.

Related work

Security challenges in IoMT networks

IoMT devices are uniquely at risk due to these limitations on resources and the lack of a security baseline for organizations using IoMT devices11.Threats commonly affecting IoMT include Denial of Service (DoS), Spoofing, Recon, Man-in-the-Middle (MiTM) attacks, Distributed Denial of Service (DDoS), and Brute Force. This contributes to problems such as weak authentication, inadequate encryption practices, and limited firmware updates, thus exposing IoMT systems to data breaches. These threats affect various components across the IoMT architecture from each perception layer to each application layer. The varying nature of devices and the sensitive nature of data increase the difficulty of developing a standard security approach. In the event of a security breach, the trust can deteriorate, which may disincentives technology adoption12. Privacy concerns may get escalated, assuming the nature of the data being collected, i.e., sleeping habits, diet, etc.

Intrusion detection systems (IDS) for IoMT

Intrusion detection systems (IDS) protect networks from compromise and are especially important for IoMT networks. IDS relies on two general approaches to monitor networks for anomalies and known threats: signature-based detection and anomaly-based detection13. Signature-based detection systems are very accurate when detecting known threats, but perform poorly with unknown threats. Anomaly-based detection utilizes algorithms based on machine learning (ML) to identify unauthorized attacks, but it has a major issue with false positives. Hybrid intrusion detection systems have the potential to utilize both signature-based and anomaly-based detection to combine both types of systems. IDSs are essential components of network defenses for IoMT networks, but existing intrusion detection solutions are often less effective in real-time due to dynamic traffic and high false positive rates, which is particularly problematic in a medical context14.

Advanced AI/ML-based IDS solutions for IoMT

AI/ML-based IDSs can recognize and respond to threats in real time and adapt to new threats. Several of these algorithms such as Support Vector Machines (SVM), Random Forests (RF), Convolutional Neural Networks (CNN), and Long Short Term Memory (LSTM) have shown a high degree of detection accuracy, whereas the ensemble based models (e.g. Xtreme Gradient Boosting; ERT) exhibited better performance compared to the individual classifiers15.

Existing hybrid IDS models

Recently developed hybrid IDS models attempted to combine static classifiers with reinforcement learning in order to facilitate adaptability and improve detection performance. For instance, Shaikh et al. (2025) presented a hybrid CNN-LSTM design with Deep Q-Network (DQN) for real-time detection of threats specific to IoMT environments and recorded decent detection rates, albeit with comparatively high inference latency6. presented a hybrid approach referred to as HDRL-IDS, that integrated Deep Deterministic Policy Gradient (DDPG) and deep learning based strategies for intrusion detection in 5G-enabled medical networks; however, they identified interpretability and model complexity issues16. Other architectures such as MLP + PPO offer significant adaptability, but suffer from over-sensitivity to parameter tuning and lack of transparency in the decision processes.

These problems are further compounded by the hyper-parameter tuning of DQN and intrinsic lack of interpretability across neural networks. Due to these issues we seek to create C4.5-DQN hybrid, which integrates the explainability and low-latency of C4.5 with low-complexity dynamic policy learning that DQN affords, while aiming to remain computationally lightweight for IoMT settings. Meta-Learning has demonstrated the capacity to make and adjust the classifiers’ weights dynamically, to explore ensemble learning capabilities. However, they are computationally intensive and make the classification time highly unsuitable for IoMT specifications4, therefore requiring the use of lightweight models and edge/fog computing. Having realistic datasets, such as CICIoMT2024, is critical to further build upon the existing work and to develop models and solutions that can be generalized. Table  1 summarizes key strengths and limitations of common ML based models used in IDS, particularly in the IoMT context.

Table 1 Comparison of AI/ML-based IDS techniques in IoMT Environment.

Existing and emerging hybrid IDS models

Recent hybrid IDS models are designed to combine static classifiers with reinforcement learning to facilitate adaptability and improve detection performance. For instance, researchers proposed a new hybrid CNN-LSTM architecture with Deep Q-Network (DQN) for real-time identification of IoMT infrastructures specific threats, returning reasonable detection rates, but had high inference latency6. A hybrid Deep Reinforcement Learning model called HDRL-IDS was introduced by merging Deep Deterministic Policy Gradient (DDPG) and deep learning based approaches for intrusion detection on 5G-enabled medical networks. interpretability and model complexity were identified as key actors in achieving effective IDS16. Other architectures such as MLP + PPO offer significant adaptability but suffer from lack of transparency issues in the decision processes. These problems are further compounded by the hyper-parameter tuning of DQN and intrinsic lack of interpretability across neural networks.

Some hybrid IDS models are designed to combine static classifiers with reinforcement learning to facilitate adaptability and improve detection performance. For instance, researchers proposed a new hybrid CNN-LSTM architecture with Deep Q-Network (DQN) for real-time identification of IoMT infrastructures specific threats, returning reasonable detection rates, but had high inference latency. A hybrid Deep Reinforcement Learning model called HDRL-IDS was introduced by merging Deep Deterministic Policy Gradient (DDPG) and deep learning based approaches for intrusion detection on 5G-enabled medical networks. Other architectures such as MLP + PPO offer significant adaptability but suffer from lack of transparency issues in the decision processes. These problems are further compounded by the hyper-parameter tuning of DQN and intrinsic lack of interpretability across neural networks.

Recently, much of the work has evolved in the area of intrusion detection systems (IDS) for Internet of Medical Things (IoMT) environments. For example, a recent study22 proposed a multi-attention Deep Convolutional Recurrent Neural Network (DeepCRNN)-based cyberattack detector that was efficient for devices in smart IoMT environments. Another work23 presented a hybrid deep learning method incorporating an Autoencoder (AE) and Long Short-Term Memory (LSTM) network for intrusion detection under imbalanced Industrial Internet of Things (IIoT) traffic, where the attack samples are few and far between compared to benign traffic. Similarly, a survey24 examined Internet of Things (IoT) security models inclusive of a Firefly Algorithm (FA)-optimized feature selection with LSTM for intrusion detection, while other researchers25,26 explored genetic algorithm (GA)-driven and metaheuristic-optimized LSTM models for intrusion detection based on IoT-edge interfaces. In the Cyber-Physical Systems—Industrial Internet of Things (CPS–IIoT) domain, attention-based explainable privacy-preserving architectural designs have been implemented in recent work27,28, recognizing both resilience and interpretability. Due to these issues, we intend to create a hybrid model that integrates the low-latency of a suitable static classifier with the dynamic policy learning capabilities of DQN, while remaining computationally lightweight for IoMT settings.

Emerging paradigms for IoMT security

Edge or fog computing reduces latency while improving privacy by leveraging local processing of data. Blockchain delivers transparency, immutability, and distributed authentication; however, it is not suitable for high data throughput. Federated Learning (FL) allows collaborators to train algorithms without sharing raw data and reduces the communication overhead and latency of sharing updates29.Each concept addresses its own set of challenges, and when combined, they create powerful hybrid security architectures. Explainable AI (XAI) has become both a necessity and a focus area in the healthcare space, as clinician acceptance and adoption of AI models rely on their transparency. Ensemble learning architectures can produce higher accuracies through model diversity, but can make interpretability process ambiguous due to decision outcomes being spread across multiple learners30,31. Post hoc methods such as SHAP and LIME aim to counteract this ambiguity, though they may produce inconsistent explainability across sub-models of the ensemble architectures32. Deep reinforcement learning (DRL) adds even more opacity given the deep policy networks and must be interrogated sequentially, which will not always be available unless otherwise integrated into formal verification health decision frameworks33. Given these limitations, some studies have examined hybrid approaches for ensemble learning, such as hierarchical RL with built-in interpretability layer, to fulfill an acceptable and accurate interpretation of the AI model that can also be utilized for wisdom of IoMT deployments. Features like flow duration and biometric information serve as valuable sources for anomaly intrusion detection systems (IDS) models. Models that offer greater transparency are more likely to be accepted and demonstrate accountability, both foundational principles for clinical use34.

Research gaps and opportunities

The Internet of Medical Things provides healthcare systems with a pathway for e-health services. However, the high risk nature of IoMT presents significant security problems in the healthcare space, in particular, the absence of lightweight options capable of working in a resource-constrained IoMT environment, poor adaptability to changing threats, and insufficient evaluation across diverse real-world datasets35. Most existing approaches to IDS either incur high computational costs or are unable to adjust dynamically to evolving attack surfaces. In addition, generalizability across different operational contexts is often ignored, which further limits real-world deployment. These problems emphasize the need to develop IDS frameworks that are both computationally efficient and enable the system to generalize robustly across unseen contexts36,37. This research sets out to address these significant limitations by developing a hybrid, lightweight, and generalizable intrusion detection system suitable for the situated perspective of IoMT environments.

Dataset and preprocessing

A comprehensive overview of diverse multiple attacks encompassing different kinds of attacks is used for validation of the proposed model. Each dataset selected shows various aspects of IoT/ IoMT environments, such as varying network characteristics, numerous features related to various network layers, thereby ensuring a complete assessment of the model’s performance across a broad spectrum of different kinds of attacks.

CICIoMT2024 Dataset: The rationale for selecting the CICIoMT2024 dataset for this work, instead of some other dataset, is that the dataset is comprehensive, as it was specifically designed to assess security solutions in the Internet of Medical Things (IoMT) space. It includes 14 different cyberattack scenarios targeted against 40 IoMT devices, incorporating both actual and emulated hardware that simulated a realistic threat event. These attacks exist in various classes, namely binary (2-class), Categorical (5-class), and Multiclass (14-class) as given in Table 2.

Table 2 Attack types for binary, categorical, and multiclass classification problems.

Additionally, the CICIoMT2024 dataset is ideal for machine learning research, being constructed to help support and measure the performance of machine learning-based intrusion detection systems11.

WUSTL-EHMS: This dataset is produced from an enriched healthcare monitoring system comprising a total of 44 features, including 35 features of network flow and 8 features of biometric patient data. This dataset is primarily concerned with Man-in-the-Middle (MiTM) At-tacks like spoofing or data injection intended to corrupt medical telemetry. This dataset is especially useful for evaluating the performance of IDS under stealthy attacks38.

ECU-IoHT. Designed to address the lack of publicly available IoHT attack datasets, this da-taset simulates cyberattacks against healthcare Internet of Things (IoT) infrastructure, including body wearables, infusion pumps, etc. This dataset is well suited for generalization testing of IDS frameworks in medical conditions. However, it lacks any comprehensive description of features in the literature, so special effort was made to understand and assess the features and information it conveys39.

DF_IoMT. Feature descriptions are specific to each dataset, but DF_IoMT is a unique dataset as it provides new testing scenarios based on distinct devices, distinct forms of communication, and different forms of attack. An inclusion of this dataset provides diversity to the datasets used in this study for benchmarking cross-dataset generalization in the IoMT context40.

CICIoT2023: This dataset covers 33 types of attacks and is divided into 7 categories, i-e, DoS, DDoS, Brute Force, Reconnaissance, Web-based, Spoofing, and botnet attacks. CICIoT2023 is one of the most extensive, feature-rich, publicly available datasets on the topic of IoT security. Collected data from 105 real IoT devices, which can expand the scope of research by making IDS from IoMT to generic IoT systems41. A summary of datasets used and corresponding at-tacks covered is given in Table 3.

Table 3 Datasets used for validating generalization of the proposed IDS.

Proposed methodology

The proposed methodology employs a systematic multi-stage framework for threat detection and adaptive primary response to threats, consisting of data processing and supervised classification (Phase I), subsequently layered with Deep Q— Network Architecture (Deep Reinforcement Learning) to learn a dynamic policy (Phase II) and then finally thoroughly evaluated to measure generalization on “real-world” datasets (Phase III). While there are three interdependent parts of the process, the objective is to combine and maximize accuracy, adaptation, and readiness based on characterization of the threats and the system environment. The overall methodology is depicted in Fig. 1 which shows the sequential application of all parts of the methodology.

The classification stage provides accurate anomaly detection, which feeds into the reinforcement agent for accurate, real-time decisions. The generalization stage is intentionally positioned as a demonstration of system validation. Earlier results from the ‘Classification’ and ‘Reinforcement’ parts would not be credible without showing consistent performance on various datasets. Consequently, Phase III demonstrates the transferability of the system, as well as the practical use of the system.

Fig. 1
Fig. 1
Full size image

Execution work flow.

Phase I: Initial supervised classification

The first phase of the methodology pertains to building a high-performing classification model to accomplish maximum accuracy with minimum latency. It involves several critical sub-stages.

Data preprocessing

Data preprocessing is a necessary step to maximize machine learning outcomes and optimize processes to account for noise, missing values, and duplication of records. Data preprocessing minimizes noise, thus maximizing the model’s ability to generalize on meaningful patterns in datasets, especially regarding high-dimensional or sensor-based datasets, where raw inconsistencies lead to output even poorer than expected8. Without prior preprocessing of data, there is a high likelihood of developing a biased model, which overall affects the accuracy of machine learning classifiers.

Data encoding and balancing

Class imbalance is one of the common issues in supervised ML-based anomaly detection, where rare or critical classes are ignored during model development. SMOTE solves this problem by generating new synthetic samples from the minority classes. In general, using SMOTE-generated minority samples greatly improves the Recall and F1 score without overfitting42. Improved and enhanced versions of SMOTE further optimize sample generation (close to original samples in the data), hence contributing to overall accuracy and robustness in the detection of rare events in real-world datasets43.

Feature selection and engineering

Not all features in high-dimensional network traffic datasets aid in classification, and irrelevant or highly correlated features can increase computational costs and reduce model performance. Therefore, 10 features are carefully selected for training by considering prior domain knowledge, similar studies in the literature, and empirical feature importance to understand which features are most relevant. This feature engineering step improves the interpretability of the model and reduces processing time due to lower dimensionality. Feature importance was assessed based on two different methodologies, one using a Random Forest classifier and the second using permutation importance. Random Forest assigns greater importance to features based on the improvement in node purity during tree construction, while permutation importance measures a feature’s contribution by assessing the effect on performance when its values are randomly shuffled44,45. The 10 x features selected are given as per Fig. 2.

Fig. 2
Fig. 2
Full size image

Features selected after feature selection step.

Relevance of feature mapping to network layers for IDS development

The success of an intrusion detection system (IDS) in identifying threats depends on the ability to recognize features of relevant network traffic and understand their significance at particular points in the stack of network layers. This section correlates ‘ten vital selected features’ of network traffic to their corresponding layers in both the OSI and TCP/IP models, as well as describes how they can be used to detect a multitude of network intrusions46,47,48. An IDS can take advantage of certain features like ‘Header_Length’, ‘rst_count’, and ‘Tot size’ to give insight into certain layers of the network stack, such as Network and Transport Layers. These features can help identify threats such as TCP reset attacks, malformed packets, and unusual traffic. However, concentrating on just a few layers can create problem areas of blindness, especially given that modern cyber attacks frequently happen across multiple layers from Application Layer exploitation to Transport Layer data exfiltration, and often take advantage of lower layer characteristics to hide from detection. Given the large data set, if the IDS cannot evaluate many layers at once, it will typically not see the full context of the attack49,50 .

Correlation of data across all layers of the network stack is essential for effective threat detection. Features such as ‘IAT’ contain timing anomalies that are observable at the Physical/Link Layer, while ‘Protocol Type’ provides information about the Application Layer activity. Aggregated features such as ‘Tot sum’ can produce complete behavioral profiles that can reveal multi-stage attacks, which would otherwise remain hidden in associated data. Correspondingly, both the OSI and TCP/IP models can be utilized to organize the features, but it is essential to make certain that insights at different layers use data from each layer to provide comprehensive and accurate IDS capabilities51,52. Table 4 gives detailed relevance of selected features to network layers.

Table 4 Relevance of selected features to network layers.

Train-test splitting

In efforts to objectively assess the performance of the models, the dataset was split into two sets, one for training and the other for testing, with a split of 80% training: 20% testing. Stratified sampling was employed to create the partitions with proportionate class distributions. This approach is especially important in imbalanced classification because the stratified sampling will account for the minority class instances being included in both training and testing sets55.

Model training & selection

A comparative analysis employing six different machine learning classifiers in the quest to identify one with the best performing parameters of accuracy and latency time. This comparative analysis is a first step towards achieving the above performance representations. The thoughtful choice of such a wide-ranging set of machine learning models, from easily interpretable decision trees to the more complex and powerful ensemble methods and complicated neural networks, cemented a thorough comparative analysis.

Voting ensemble

This combines the predictions of several individual base models. The use of perspective coming from many diverse perspectives should improve the overall robustness and usually achieves better accuracy than any individual component’s predictions in a Voting Ensemble56.

Decision tree (C4.5)

The C4.5 algorithm for decision trees is an extension of the ID3 algorithm. It is a well-used and understood algorithm that can easily accommodate continuous and discrete attributes. It can also accommodate missing values and can prune the decision tree to limit overfitting57,58.

XGBoost: Xtreme Gradient Boosting (XGBoost) is mentioned here primarily as a powerful, distributed, open-source machine learning library. It has the capability and performance to solve even large datasets. Moreover, XGBoost was designed with regularization as a built-in component59.

Neural network/complex MLP

Multi-Layer Perceptrons (MLPs) are a type of artificial neural network that consists of multiple hidden layers and non-linear activation functions (Relu in this case), allowing them to learn and model complex, non-linear relationships in the data60. There are two versions of Neural Networks (NN) implemented in this study, one is a simple NN (Simple MLP with one hidden layer), and the second Complex NN ( NN with three hidden layers). The existence of these two options indicates a belief in a highly complex relationship in the underlying data that may not be captured using other simpler models.

Random Forest

Random Forest is an ensemble learning technique that builds a set of decision trees while training, then outputs the mode of possible classes (for classification) or mean prediction (for regression). It uses bagging (bootstrapped aggregators) and also features randomness to help reduce the correlation between individual trees, leading to better robustness against overfitting, as well as to be able to work with more types of data61,62.

One of the primary reasons for the selection of these classifiers is their previously documented success in intrusion detection and represent a diverse range of ML models63. All models were trained with SMOTE balanced training and tuned using either default or experimentally inferred hyper-parameters to mitigate the risk of overfitting the model.

Model evaluation

To evaluate the performance of the models accuracy, F1 score, Precision, Recall, and Log Loss Graph are used. These metrics were calculated for each of the three types of classification, i-e, binary, categorical & multi-class using a single evaluation function to ensure consistency of performance assessment when making comparisons across classifiers. In addition, training and inference times were also captured to assess computational viability in an IoMT context. Furthermore, we qualitatively compared models based on visual summaries such as confusion matrices, log loss plots, and Inference Time for binary classification.

Evaluation of inference time

Inference time was evaluated and compared for each model to assess their suitability in real-time deployment in the case of an IoMT. C4.5 and the Simple Neural Network achieved the shortest average inference times (9.64 ms and 25.24 ms, respectively), representing a best-case scenario for latency-sensitive environments. In contrast, the Voting Ensemble and Complex Neural Network required significantly longer inference times (exceeding 600 ms), limiting their real-time operational potential. The modelling results suggest the significance of balancing detection performance to computational expense when designing IDS within the resource constraints of healthcare systems.

Cross-validation and generalization

To further test that the models generalize to unseen data, k-fold cross-validation was employed, where k would equal 5. For each iterated test, the data set was equally partitioned into k parts, with k-1 parts for training and the last part for testing. The test was iterated k times, and the estimation of the model performance was achieved by averaging performance over the k runs55. Cross-validation has the effect of reducing bias in the estimation of model performance, and it allows for the detection of model overfitting, especially in cases where minority samples are much lower based on the total sample from the dataset. This was a particularly useful step when dealing with models that were based on neural networks, for example, since neural networks have an issue of overfitting data, especially related to sample quantity and imbalance across data levels in the dataset.

Phase II - Deep Q-Network (DQN) for Adaptive Detection and Response

The fundamental part of this proposed hybrid system is using a pre-trained C4.5 model in a Reinforcement Learning loop of Deep Q-Networks (DQN). This is very critical for the system to have adaptive capabilities. Before explaining the hybrid framework in detail, it is vital to provide a formal definition of the DQN components for clarity and reproduction. The environment components are represented in Table 5:

Table 5 Reinforcement learning components and their connection to C4.5.
Table 6 Hyper parameters used for training the DQN agent.

The working of Deep Q-network is shown in Fig. 3. It explains the role of each of these components in training the model using DQN.

Fig. 3
Fig. 3
Full size image

Deep Q-network architecture.

The Deep Q-Network (DQN) algorithm uses a neural network to approximate Q-values, the value represent the expected cumulative future reward for taking a specific action in a state. Key features of DQN include the use of experience replay (storing past interactions, i-e, state, action, reward, next state) in a replay buffer and sampled randomly during training. This technique will serve to limit correlations between successive entries, hence the stabilizing learning process64. DQN agent training parameters and architecture settings used for adaptive learning within the IoMT environment are given as per Table 6.For action selection, the epsilon-greedy exploration strategy was used. The initial exploration rate (“\(\epsilon\)”) from the agent’s experience was randomly set to 1.0 and then decayed by 0.995 each episode until it reached minimum of 0.01. This decay schedule allows the agent to explore more of the different actions at the start, and then focus more on using the best behavior it had learned in the later episodes. This adaptive investigation is vital for the agent to constantly learn from the dynamic IoMT environment and identify evolving attack patterns65.

The outputs of the pre-trained C4.5, such as predicted class labels, confidence scores, or even the feature vector after processing through C4.5’s internal nodes, are used as part of the state representation for the DQN agent. The benefit of this hybrid learning through an RL agent is that it provides a more informed starting point than raw network traffic data alone, hence capitalizing on the learned patterns and efficiency of the supervised model. Recent works have started exploring the deployment of DQN in IoMT network security for better anomaly detection, showing its potential for adaptive threat detection9. The target Q-values are calculated using the Bellman equation as shown in Eq. 1 :

$$\begin{aligned} Q^*(s, a) = {\mathbb {E}} \left[ r_{t+1} + \gamma \max _{a'} Q^*(s_{t+1}, a') \,\Bigg |\, s_t = s, a_t = a \right] \end{aligned}$$
(1)

where \(Q^*(s, a)\) is the optimal Q-value for taking action a in state s. \({\mathbb {E}}\) denotes the expected value of the total reward, and \(\gamma\) is the discount factor (\(0 \le \gamma \le 1\)), which determines the importance of future rewards in decision making.

Let the original dataset be denoted by S, the training set as \(S_{\text {train}}\), and the testing set as \(S_{\text {test}}\). The division is expressed in Eq. 2:

$$\begin{aligned} S = S_{\text {train}} \cup S_{\text {test}} \end{aligned}$$
(2)

Where the ground truth label is the actual class label of the data point in the dataset, and \(R_d(s, a)\) is the reward for taking action a in state s. Let A be the action taken by the RL agent, and \(A_{\text {actual}}\) be the actual class label of the data point66. Then, the reward \(R_d\) is given by Equation 3:

$$\begin{aligned} R_d = {\left\{ \begin{array}{ll} +2, & \text {if } A = A_{\text {actual}} \text { and } A = 1 \\ +1, & \text {if } A = A_{\text {actual}} \text { and } A = 0 \\ -3, & \text {if } A = 0 \text { and } A_{\text {actual}} = 1 \\ -1, & \text {if } A = 1 \text { and } A_{\text {actual}} = 0 \\ \end{array}\right. } \end{aligned}$$
(3)

This reward function takes into account the following factors: a +2 reward is given for correctly identifying an attack (true positive), and a +1 reward for correctly identifying benign activity (true negative). A more significant penalty of -3 is applied for missing a real attack (false negative), due to the possibility of serious consequences following for ineffective threat detection across healthcare systems. The penalty for a false alarm (false positive) is -1 as this is a lesser consequence in comparison to undetected threats.

Phase III—Generalization

This segment validates the generalization capability of the proposed hybrid model, whereby the model is tested on generalization aspects across multiple unseen datasets. Generalization is a critical aspect to validate model robustness and reliability in dynamic domains like cyber security, which is characterized by ever-evolving threats66. Compared to models that perform well only on training data, models capable of generalizing across domains are suited for real-world tasks67. Evaluating across diverse data sets enables higher external validity and limits the probability of overfitting, making it a critical process before any deployment occurs68. The implementation of the generalization aspects in this work is shown in Fig. 4.

Fig. 4
Fig. 4
Full size image

Workflow—evaluation of generalization aspects.

Data preparation for generalization

A uniform preprocessing pipeline is applied to each dataset to ensure consistency across datasets. This includes:

  • Data loading and feature/label separation: The dataset on which generalization aspects are to be evaluated is loaded, and then it is formatted in a manner that features and target labels are separated, enabling a consistent input structure69.

  • Missing value imputation: To assign values to missing values, an imputation technique (mean) was used while preserving input distributions70.

  • Train-test split (80/20): All datasets used in the generalization evaluation process used the same train-test split uniformly (80% train and 20% test). This was intended to enable the fairest possible baseline for comparing each model’s performance.

Standardized pre-processing does not introduce bias associated with how datasets are handled and ensures that the performance of the model represents generalization, not variance from pre-processing71.

Model application and training

The C4.5–DQN classifier is uniformly applied to all datasets to maintain consistency in the generalization process. SMOTE is used to balance class distributions, and Random Forest-based feature selection is done to ensure the inclusion of the most appropriate and relevant attributes in the classification process. The C4.5 model is then trained on each processed dataset and saved for further inference. Using the same model structure across datasets ensures comparability and ensures a generalization aspect in a real sense72. Additionally, research demonstrates that sustaining the same selection of models and feature selection transformation pipeline lowers variance in evaluations across datasets73.

Model inference and DQN integration

The trained C4.5 models (as classifiers) are each considered to produce inference on their respective test sets and have their predictions used for the Deep Q-Network (DQN) framework. The DQN agent is framed in a way that it considers the classifier output to be the environmental state input. For every dataset, the RL environment is created, initialized, and then the agent interacts within the environment to learn detection-response behaviors. This allows us to merge static classification with adaptive policy learning, where the DQN agent can learn response patterns against unexpected attacks within a range of IoMT environments74. An additional use for DQN is that it is more robust to noisy or imbalanced test data; the agent only needs to learn a strategy for maximizing reward, which may not involve classification75.

This staged hybrid design leverages both the bias–variance trade-off and state-space reduction: C4.5 (high bias, low variance) provides stable, explainable classifications, while DQN (low bias, high variance) fine-tunes only ambiguous cases. Such gating reduces dimensionality and stochasticity for DRL, mitigating overfitting and aligning adaptive updates with interpretable Phase I boundaries76,77. Empirical studies confirm that these hybrid IDS architectures have higher performance metrics than standalone DRL and significantly improve convergence speed78, making them well-suited for security-sensitive IoMT applications.

Evaluation metrics in DQN environment

The DQN agent is trained in a generalization environment. Cumulative Reward is calculated to indicate the capability of the system to detect and respond effectively. Accuracy per Episode approximates episode-level detection accuracy. Simulation logs demonstrates a qualitative view of agent behavior and learning trends. This shift in evaluation metrics from static parameters like F1-score, etc to dynamic & reward-based metrics accuracy per episode & ’reward’ is in line with desired RL practices to test model capability and long-term strategy success73. The RL-driven evaluation also improves model operational performance in emergent IoMT environments79.

Tools and experimental setup required

Various tools and frameworks were utilized for all experiments conducted under this study, The DRL component was trained and evaluated on a system with an Intel Core i5-8250U CPU, Intel UHD Graphics 620 (integrated GPU), 24 GB DDR4 RAM, and a 512 GB SK Hynix SATA SSD. Training the DQN phase took approximately 30 mins for 1500 episodes, while inference latency per packet flow averaged 23.35 ms.The C4.5 phase runs completely on the CPU with insignificant latency (< 10 ms). The hybrid model, despite being performed on modest, non-dedicated GPU hardware, runs in the compute envelope that is typical with IoMT edge servers, demonstrating its suitability in a lightweight and deployment ready environments.

Basic and advanced ensemble ML methods

Scikit-learn 1.6.1 was employed for traditional machine learning models. This included Random Forest, XGBoost, Decision Tree (C4.5), and Voting Ensemble (RF GB).

Deep learning/deep reinforcement learning models

PyTorch 2.7.1 was employed for designing and training the Multilayer Perceptron (MLP) architectures and the DQN neural network. ‘Gym 0.26.2’ library (which is a standard API for reinforcement learning, and a diverse collection of reference environments) was used for the RL environment41. Development and model experimentation were conducted in Visual Studio Code (VSCode) while using the Jupyter Notebook Extension for prototyping purposes on the local machine. For GPU-accelerated resources, the training of the Voting Ensemble and Complex MLP used resources provided via Google Colab on the cloud, which provided a significant speedup in model training and able to handle larger model architectures. For the working environment utilized, Python 3.12 was used, where all code and results were version-controlled and archived to ensure reproducibility and track experimental changes.

Results and analysis

This section offers a comprehensive study of the application of six machine learning classifiers to find the optimum model in terms of accuracy and latency time. The evaluation metrics involve classification performance, inference latency, model complexity, and trade-off identification. Each classifier is also evaluated in terms of real-world implementation feasibility, particularly in resource-constrained IoMT environments.

High accuracy performance across classifiers

The performance metrics of all models were evaluated. Ensemble-based methods, specifically the Voting Classifier and XGBoost, achieved the highest scores across all measures, with accuracies of approximately 99.0% and 98.8%, respectively. Notably, lightweight models such as C4.5 also produced strong results and may serve as a viable alternative in resource-constrained environments. Figure 5 depicts detailed results for each parameter (Accuracy, Precision, Recall & F1 Score) for binary classification. Similarly, Figs. 6 and 7 represent the same for Categorical and Multiclass, respectively.

Fig. 5
Fig. 5
Full size image

Performance metrics for all classifiers—binary (2-classes).

Fig. 6
Fig. 6
Full size image

Performance metrics for all classifiers—categorical (5-classes).

Fig. 7
Fig. 7
Full size image

Performance metrics for all classifiers—multi class (14-classes).

Inference latency evaluation for IoMT feasibility

The average inference times for each model across 200 prediction runs are calculated for binary classification only (2 x classes) and are shown in Fig. 8.

Fig. 8
Fig. 8
Full size image

Average inference time (binary classification)-each model.

The lightweight models (C4.5 and Simple Neural Network) had the lowest latency of 9.64 ms and 25.24 ms, respectively, suggesting that these models would work best for real-time applications. Meanwhile, Ensemble (RF + GB), Complex Neural Network, and Random Forest had latency exceeding 600 ms, which may hinder adoption within latency-sensitive healthcare contexts. This Inference time benchmark is a critical measure for recognizing which classifiers meet the timing constraints of edge-based IoMT intrusion detection systems.

Performance of the reinforcement learning agent

The Total Reward per Episode plot shows the DQN agent’s learning development in the IoMT intrusion detection environment. The starting rewards are on the negative side because the agent is exploring its environment. However, with increasing episode numbers, the rewards rise and settle around 750–780 at around 920 episodes as represented in Fig. 9, indicating some convergence to an optimal policy. This is also depicted in the area of stabilization, which suggests the agent has learned to separate attacks and benign behaviors successfully. Again, the high and stable total reward corresponds with plenty of correct classifications and few misclassifications, which represents how effective the intrusion detection is in the real world. The maximum theoretical cumulative reward per episode depends on the Eq. 4.

$$\begin{aligned} R_{\text {max}} = 2 \times \#\text {Attack Samples} + 1 \times \#\text {Benign Samples} \end{aligned}$$
(4)

Max reward (Rmax) can go up to  1000 if nearly all 500 samples are attacks (but that’s statistically unlikely). Since Max cumulative reward depends on the class distribution in the sampled 500 data points per episode. Given our balanced dataset and random sampling, most episodes have  250 attack samples (theoretically yielding  750 max reward). However, due to random variation, some episodes have slightly more attacks, resulting in a theoretical maximum reward up to  780, confirming that the agent reaches near-optimal performance.

This is consistent with similar findings in previous studies using DQN for IoT intrusion detection, where cumulative reward trends directly indicate improved learning and threat detection80. To provide additional context and strengthen the comparison, we have included two baseline policies: a random policy, represented as a red dashed line, and a static policy, represented by green dots. Both baselines give consistently low and highly variable rewards over all episodes, clearly showing that they cannot adapt to dynamic attack scenarios. The DQN agent learns incrementally and consistently outperforms both baselines. The value of the static policy remains constant at -652, while the value of a random policy bordered between -268 and -38 over the course of 1500 episodes.The stabilization in rewards supports the idea that DQN is indeed a self-improving approach to adaptive intrusion detection in a dynamic environment.

Fig. 9
Fig. 9
Full size image

Total reward administered by DQN.

The “Accuracy per Episode (%)” plot indicates a sharp increase in performance from 50% to around 99% as illustrated in Figure 10, suggesting the agent became increasingly proficient in decision making. This high degree of accuracy is stable and near perfect, suggesting a generalization of the hybrid model to many different and ubiquitous types of threat patterns. Performance-wise, this model surpasses or performs similarly to static models, such as Voting Ensemble (99.0% accuracy) and XGBoost (98.8% accuracy), while continuing to learn. The dynamic nature of the training allows for new or changing threats to be detected, an important strength over traditional static models81,82. The accuracy curve suggests the long-term stability of the model as well as its fit for use in security-sensitive healthcare settings.

Fig. 10
Fig. 10
Full size image

Accuracy per episode graph-DQN.

The “Action Distribution” chart denotes how many times the agent identifies inputs as either “Benign (0)” or “Attack (1)” in its training. A consistent and symmetrical distribution demonstrates that the agent discovered a reliable classification policy as shown in Fig. 11.

As long as the agent consistently detects “Attack (1)” cases and does not overfit on benign behavior, the agent demonstrates the ability to generalize. The action distributions align with other behavioral indicators (i.e., reward and accuracy) and suggest the model’s efficacy for real world deployment. Similar distributions were observed in multi-agent reinforcement learning IDS studies that reported stable action policies once the model matured81,83. The observations confirm clear evidence of strong learning and decision making exhibited by the RL agent84.

Fig. 11
Fig. 11
Full size image

Action distribution graph.

The amalgamation of the pre-trained C4.5 model and DQN is a unique solution contextualizing the trade-off between detection accuracy and deployment readiness. Although C4.5 alone is capable of good low latency ( 9.6 ms), but has static performance at a given point in time. Whereas, the ensemble models have achieved higher accuracy but at the cost of high inference time i-e (>600 ms). The proposed hybrid system (C4.5-DQN), effectively engages the efficiency of C4.5 in the first instance and then contributes to a low-to-moderate overall inference time for the adaptive footprint of the system. When efficiently deployed, the DQN algorithm provides both continuous learning and adaptation, so that the system can maintain high accuracy dynamically to detect zero-day attacks or evolving attacks that static systems cannot detect64.

Trade-off between detection accuracy and deployment readiness

In IDS research, the factor of accuracy often overcomes the factor of real-time suitability; however, in some clinical IoMT situations, real-time effectiveness is equally important. Table 7 integrates both these factors. Models such as Voting Ensemble and Complex Neural Network have higher accuracy, but much greater inference times, and therefore will not be applicable for immediate threat detection. C4.5 and Simple Neural Network provide a good compromise on both accuracy and time. These results have reinforced that in the scope of suitable IDS for IoMT, fast static classifier (C4.5) coupled with dynamic threat detection ability of Deep Q Networks provides an optimum trade-off but offers a wholesome approach for accurate and timely detection all types of threats on the go, which is critical in the case of real traffic routing through IoMT networks.

Table 7 Comparison of models in terms of accuracy and real-time suitability.

Cross-dataset generalization results

To validate the robustness and adaptability of the proposed hybrid C4.5-DQN model, the program was tested on a variety of real-world datasets taken from the IoT and IoMT domains. The goal was to examine whether the model could generalize to data, scenarios, and contexts beyond what was seen in the training set, and still perform optimally in new heterogeneous environments. WUSTL-EHMS: Accuracy improved from 90.35 to 96.40% as illustrated in Fig. 12, verifying the model’s capabilities to adapt in a smart healthcare environment.ECU-IoHT (binary classification): A modest yet important improvement from 99.00% to 99.20% shows stability in precision-critical environments in Fig. 13.

Fig. 12
Fig. 12
Full size image

C4.5-DQN results on WUSTL-EHMS dataset.

Fig. 13
Fig. 13
Full size image

C4.5–DQN Results–ECU-IoHT.

DF-IoMT: The model retained its full 100% accuracy as represented in Fig. 14, which establishes resilience across cases due to unseen distributions of threat data.

Fig. 14
Fig. 14
Full size image

C4.5–DQN Results–IoMT.

CICIoT23: The model managed to gain performance from 94.23% to 99.20% in a solid degree of robustness across a modern and large-scale diverse attack dataset represented in Fig. 15. ECU-IoHT (Multiclass): The hybrid system also confirmed multiclass classification with excellent performance at 99.35% from 97.28% demonstrating versatility across granular threat categories. These improvements across multiple, independent datasets support the stated system’s ability to generalize. Unlike domain static models that operate well when trained under fixed conditions, the hybrid architecture maintains a dynamic learning backbone that adapts well to new attack patterns and data characteristics. This adaptability, coupled with low-latency and strong detection accuracy, renders the C4.5-DQN system a feasible and scalable system for real-time IoMT security implementations where resilience and generalizability matter. However, an equally important aspect of IDS performance in IoMT environments is the system’s generalization across heterogeneous environments and new threat patterns85 (Fig. 16).

Fig. 15
Fig. 15
Full size image

C4.5 - DQN Results - CICIoT23.

Fig. 16
Fig. 16
Full size image

C4.5–DQN Results–ECU-IoHT (MultiClass).

Comparison with recent Hybrid IDSs

To put the performance and novelty of the C4.5-DQN hybrid IDS into perspective, we expanded our review to assess not just standard classifiers, but also the latest hybrid and adaptive intrusion detection systems from existing literature. Detail given as per Table  8

Table 8 Comparison of proposed Hybrid C4.5–DQN IDS with selected recent IDS approaches.

Discussion and future recommendations

The findings this section indicate that when applying machine learning classifiers to intrusion detection in the Internet of Medical Things (IoMT) ecosystems, selecting the machine learning classifier is not simply an act of finding the most accurate detection; it is also the consideration of practical aspects of the deployment; and thus how usable the model will be in practice11. Each classifier produced varying levels of performance in terms of accuracy, inference time, and latency. The ensemble models such as voting Ensemble or XGBoost tended to have the most accuracy, but also included the biggest inference times, which is important when the application is a real-time medical application that integrates other IoMT systems in near real-time—i.e. the lower the latency in detecting the intrusion, the better it is for patient safety and care15.

By comparison, Decision trees (C4.5) and the Simple Neural Network provided the greatest compromise, with accuracy that was high enough to be adequate, with much lower inference times of 9.64 ms and 25.24 ms, respectively. These attributes contribute to C4.5 and the Simple Neural Network being valuable options for real-world intrusion detection solutions on systems in a resource-constrained environment. In addition to this experimental paradigm, this work also examined hybrid modeling to improve adaptability. The proposed mixed C4.5—DQN intrusion detection system combined the rapid decisions afforded by a pre-trained C4.5 classifier with the adaptability and self-learning behaviors of Deep Q-Networks (DQN). The hybrid model continuously adjusted its detection policy based on the interaction observed with the environment, and over each episode observed increases in reward and accuracy. The self-adaptive quality of the model allows it to maintain effective performance under changing cyberattack conditions.

Furthermore, the results provide evidence that the proposed model produced meaningful performance gains over baseline mechanisms in cross-domain evaluations on the unseen datasets WUSTL-EHMS, CICIoT23, DF-IoMT and ECU-IoHT. The consistent performance across a variety of heterogeneous datasets further supports the framework’s robustness and real-world applicability. These findings also validate recent observations showing how generalization continues to be among the most challenging aspects of IDS systems due to the heterogeneous nature of IoMT data and attack behavior volatility.

Future research recommendations include advancing the adaptability and robustness of IoMT intrusion detection systems by exploring more sophisticated Deep Reinforcement Learning (DRL) architectures, such as Policy Gradient and Actor-Critic models. Emphasis should also be placed on improving generalization to entirely new attack types through comprehensive stress testing and robustness evaluations. Additionally, conducting real-world pilot deployments across diverse IoMT network conditions and communication protocols will be essential for validating system performance and minimizing dataset-specific dependencies. Finally, incorporating formal statistical hypothesis testing using paired t-tests when normality assumptions are satisfied90,91 or the Wilcoxon signed-rank test for non-normal or small-sample cases92 will further enhance the reliability and rigor of model evaluation.

Limitations

Although this study offers a thorough comparative evaluation of machine learning classifiers used for intrusion detection in IoMT environments, there were several limitations identified through experimentation and system design

Long training time with complex models

One of the significant hurdles encountered throughout this study was the lengthy training times associated with some of the classifiers. The Voting Ensemble model, which is used for multi-class classification, it has taken roughly 1476 min (24.6 h) to complete a single training cycle, even where systems had strong computational power. This limitation is a glaring obstacle for real-time model retraining or updates on IoMT systems.

Protocol dependency and latency variation

IoMT networks can contain an array of heterogeneous devices communicating in different protocols (for example, Bluetooth, MQTT, WiFi), each of which has its latency characteristics and makes it more challenging for an IDS model to generalize its performance across the protocol layers. Recent work has shown that the heterogeneity of protocol leads to disparate data patterns and timing behaviors, which can reduce the consistency of the IDS detection given the same context derived from different device types43.

Adaptability to unseen real-world attacks

While the proposed system establishes robust generalization across datasets, the most difficult hurdle remains directly adapting to novel or evolving attack strategies, especially in real-time operational environments. Most training datasets do not represent the full scale of diverse attack vectors related to IoMT ecosystems, which may impact adaptability upon deployment. Research has suggested that reinforcement learning agents are, in theory, potentially effective, but they require extensive live interaction with the system to genuinely generalize across non-stationary attack patterns93,94.

Conclusion

This research presented a comprehensive comparative analysis of six different machine learning classifiers: Random Forest, XGBoost, C4.5, Voting Ensemble, Simple Neural Network, and Complex Neural Network for IoMT intrusion detection using the CICIoMT2024 dataset. While ensemble models such as Voting Ensemble and XGBoost achieved the highest classification accuracies of up to 99%. However, they were considerably slower and had a higher computational complexity compared to lightweight models. The lightweight models like C4.5 and Simple Neural Network demonstrated the next highest level of accuracy, but with considerably lower latencies (9.64 ms and 25.24 ms, respectively). The proposed hybrid C4.5–DQN model further enhanced adaptability, maintaining stable performance under evolving threat conditions. In summary, this study finds that the optimal IoMT intrusion detection systems will be those that combine accuracy, speed, and adaptability while also guaranteeing reliability and safety in healthcare settings with limited resources.