Introduction

Cardiovascular diseases (CVDs) represent the foremost global health challenge, responsible for approximately 17.9 million deaths in 2019, accounting for 32% of all global fatalities1. A significant proportion of these deaths, particularly 85%, is attributed to heart attacks and strokes, highlighting the critical impact of cardiac conditions. Alarmingly, over 75% of CVD-related mortalities occur in low and middle-income countries, exacerbating global health disparities2. Moreover, CVDs contribute to 38% of premature deaths among individuals under 70 years of age due to non-communicable diseases3. Cardiac arrhythmias, including ventricular flutter and ventricular fibrillation, are closely linked to the onset of CVDs and can lead to severe events such as sudden death, hemodynamic collapse and cardiac arrest4. Therefore, the timely and precise detection of these arrhythmias is paramount for preventing life-threatening outcomes. Electrocardiogram (ECG) signals are an indispensable non-invasive diagnostic tool for monitoring heart health and identifying cardiovascular abnormalities5. ECG records the electrical activity of the heart, with a typical waveform comprising P waves, QRS complexes and T waves. The QRS complex, in particular, is critical for arrhythmia detection, as abnormalities in this segment often indicate irregular heart rhythms6. Despite their diagnostic utility, the inherent volume, complexity and variability of ECG data, influenced by subject differences, time and environmental conditions, pose significant challenges for accurate analysis and interpretation by healthcare practitioners.

Traditional ECG monitoring methods, including Holter monitors, telemetry and episodic monitors, face significant challenges in the detection of arrhythmias7. These methods often involve manual interpretation, which can cause delays in diagnosis and they also rely heavily on patient compliance. Furthermore, their use is typically limited to clinical settings, making them less effective for long-term or remote monitoring8. For example, although wearable ECG devices improve monitoring convenience, traditional systems still struggle to handle the variability of ECG signals across patients and environments9. Implantable devices offer continuous monitoring for arrhythmias but are invasive and not suitable for all patients7. However, recent innovations in deep learning (DL), notably Convolutional Neural Networks (CNNs) and Deep Neural Networks (DNNs), have presented a powerful avenue to address these challenges. These models excel at processing complex, high-dimensional data and are adept at capturing subtle temporal patterns within ECG signals. Their ability to automatically extract meaningful features from raw data has demonstrably improved diagnostic accuracy in various medical applications, including arrhythmia detection and classification10. By reducing manual interpretation, these models enhance diagnostic accuracy, enabling continuous monitoring even outside clinical environments. Furthermore, DL-powered systems can help monitor and analyze ECG signals more efficiently, addressing issues related to noise, data complexity and the variability of ECG signals across different patients11. This shift towards automated analysis is making ECG monitoring more accessible, cost-effective and adaptable to non-clinical settings, offering significant improvements over traditional methods7.

Despite the advancements made in DL for ECG analysis, generalization and robustness continue to be significant challenges. Many models tend to overfit on specific datasets, which compromises their ability to perform well on diverse or unseen ECG data, a key concern for real-world applicability12. For instance, while noise elimination techniques have improved, their ability to generalize across various types of noise and signal distortions remains a challenge. Traditional methods often fail to handle multi-source noise effectively, whereas DL models can adapt but still face limitations in noisy environments13. Moreover, computational complexity is a major hurdle for advanced DL models, such as hybrid LSTM-CNNs, which are highly accurate but require significant computational resources14. This makes them less suited for real-time applications or use on resource-constrained devices, such as mobile or wearable ECG monitors. Additionally, dataset balancing techniques like SMOTE (Synthetic Minority Over-sampling Technique) help mitigate class imbalance but sometimes introduce synthetic data that may not accurately reflect real-world variations, leading to potential biases in the model’s predictions15. Feature extraction is another area where DL methods still face challenges. Existing techniques may not consistently identify the most relevant features from complex ECG signals, leading to the omission of critical patterns that are vital for accurate arrhythmia detection16. Furthermore, the black-box nature of DL models limits their transparency. This lack of interpretability is a significant barrier to their clinical adoption, as healthcare professionals require an understanding of the model’s decision-making process to trust and use it in patient care17.

To overcome the aforementioned limitations, this study introduces a novel, explainable DL framework designed for accurate and reliable arrhythmia detection from ECG signals. Our proposed approach integrates cutting-edge DL models, specifically CNN and DNN, within a sophisticated multi-stage pipeline. This pipeline encompasses cutting-edge data preparation, meticulous signal preprocessing and robust multi-strategy data balancing techniques to enhance model performance and robustness. Furthermore, to address the critical interpretability gap and foster clinical applicability and trust, the framework incorporates XAI methodologies. This comprehensive combination aims to achieve superior performance, provide crucial insights into model decision-making and ultimately push the boundaries of cardiovascular diagnostics.

Contributions

The key contributions of this study are as follows:

  • Novel explainable deep learning framework: We propose an innovative framework that seamlessly integrates cutting-edge DL architectures (CNN and DNN) with a meticulously designed data processing pipeline and crucial XAI techniques (SHAP, LIME and FIA). This integration provides both high-performance arrhythmia detection and transparency in model predictions, a critical aspect for clinical adoption.

  • Advanced ECG data preparation and preprocessing: The study employs a systematic data preparation pipeline, including dynamic loading from multiple benchmark ECG datasets (MITDB, NSTDB, PTBDB), precise 1-second segment extraction around R-peaks and parallel processing for efficiency. This is coupled with state-of-the-art signal processing techniques (Butterworth bandpass, 50Hz notch and high-pass filtering, followed by Z-score normalization) to significantly reduce noise and enhance signal quality for more effective analysis.

  • Robust multi-strategy data balancing: To effectively mitigate the inherent class imbalance in ECG datasets, our approach comprehensively investigates and leverages multiple sophisticated data balancing techniques, including ADASYN, SMOTE, SMOTETomek and ROS. The selection of the most effective method ensures a balanced dataset, addressing bias and improving the robustness and sensitivity of the model towards rare but critical arrhythmia events.

  • Comprehensive and interpretable evaluation: The proposed framework is rigorously evaluated on publicly available arrhythmia datasets using a diverse set of performance metrics. Beyond quantitative performance, the integration of XAI techniques provides qualitative interpretability, demonstrating the effectiveness and reliability of our approach while offering actionable insights into the model’s detection, thereby outpacing traditional methods and enhancing clinical utility.

This paper is structured to present our novel framework comprehensively. Following this introduction, Section “Related works”, provides an overview of existing research in arrhythmia detection, highlighting their strengths and limitations. Section “Methodology”, details our proposed explainable AI framework, including data preparation, preprocessing, balancing techniques and deep learning model architecture. Section “Result analysis”, presents the experimental results and a thorough analysis of the framework’s performance, complemented by insights from the XAI methods. Section “Discussion”, represents the discussion of our proposed work with state-of-the-art works. Finally, Section “Conclusion”, summarizes our findings, discusses the implications of our work and outlines directions for future research.

Related works

Recent advancements in arrhythmia detection from ECG signals, focusing on studies that have utilized the benchmark datasets relevant to our research. We categorize these works by the primary dataset employed, highlighting their methodologies, key contributions and reported performance metrics.

Arrhythmia detection on MITDB dataset

Srivastava et al.18 introduced rECGnition_v1.0, a multi-modal DL model that integrates ECG morphological features with patient characteristics (age, gender, BMI) for enhanced arrhythmia classification. Their model, combining a CNN for ECG feature extraction and a Squeeze and Excitation-based Patient characteristic Encoding Network (SEPcEnet), achieved an accuracy of 98.56% on the MITDB, demonstrating improved performance through the correlation of patient-specific data with ECG morphology.

Kim et al.19 presented a Local-Global Temporal Fusion Network incorporating an attention mechanism for multi-class arrhythmia classification from single-lead ECGs. This framework integrated temporal convolutional networks (TCN) with multiscale temporal information fusion (TIF) and temporal multi-head attention (MHA), achieving an F1-score of 96.45% for duration classification and 96.31% for episode classification on the MITDB. While effective, the model was noted for its high computational cost and hardware requirements.

Tudjarski et al.20 explored a transformer-based approach for Atrial Fibrillation (AFIB) detection. They utilized a bidirectional transformer model (RoBERTa), pre-trained on a large unlabeled ECG dataset and fine-tuned on a smaller labeled one. This method achieved a notable accuracy of 98.81%, a sensitivity of 98.81% and an F1 score of 91.57% on the MITDB, showcasing the power of transformer models with self-supervised pre-training.

Zhou et al.21 introduced mRMEBP, a unified framework for online AF detection that combines statistical inference and probabilistic modeling of cardiac interbeat intervals. Their model leverages five robust features analyzed by a Back Propagation Neural Network (BPNN). The mRMEBP achieved an accuracy of 95.42% for the MITDB, demonstrating the importance of robust feature selection for improved accuracy in online monitoring scenarios. He et al.22 proposed a dynamic ECG signal quality assessment method based on a hybrid CNN-LSTM model. Utilizing the MITDB, their model achieved an accuracy of 98.65% for classifying signal quality into excellent, qualified and failed categories. This work underscores the efficacy of combining CNN and LSTM for noise reduction and enhanced diagnostic accuracy in ECG signals.

El-Ghaish et al.23 introduced ECGTransForm, a DL framework that integrates multi-scale convolutions, channel recalibration and a bidirectional transformer for arrhythmia classification. This architecture adeptly models both past and future temporal dependencies, thereby enhancing the detection of subtle arrhythmic patterns. Evaluated on the MITDB and PTBDB datasets, ECGTransForm achieved an impressive accuracy of 99.35% and a macro F1-score of 94.26%. Furthermore, the incorporation of Context-Aware Loss (CAL) significantly improved class balance, enabling robust and reliable detection across a wide spectrum of arrhythmia types.

Di et al.24 introduced a multimodal CNN with adaptive attention for ECG arrhythmia classification. Using Hilbert space-filling curves and recurrence plots to convert ECG signals into images, their model achieved 98.48% accuracy and an F1 score of 81.91% for interpatient classification and 99.70% accuracy with a 97.64% F1 score for intrapatient classification on the MITDB. Dual-lead input (MLII and V1) and attention refinement notably improved detection of supraventricular arrhythmias.

Islam et al.25 proposed CAT-Net, a hybrid DL model combining convolution, channel attention and transformer encoders for single-lead ECG arrhythmia classification. On the MITDB, it achieved 99.14% accuracy and a macro F1-score of 94.69%. Using SMOTETomek for class balancing, CAT-Net improved the detection of minority classes like supraventricular and fusion beats. Its lightweight design and single-lead input make it suitable for real-time wearable and IoT-based applications.

Berrahou et al.26 proposed a 1D CNN-based model for arrhythmia detection that integrates morphological ECG features with RR interval and entropy rate descriptors. Evaluated on the MITDB, the model achieved 99.17% accuracy for intra-patient and 98.73% for inter-patient classification and 98.20% accuracy on the INCART dataset. The approach demonstrated strong generalization and effective handling of class imbalance across diverse ECG sources

Issa et al.27 proposed a deep neural network with residual blocks (DNN-RB) for single-lead ECG heartbeat classification. Using MLII signals from the MITDB, their model achieved 99.51% accuracy, 99.70% sensitivity and 98.20% specificity. The architecture outperformed existing methods and proved effective for mobile ECG devices and real-time monitoring, even under class imbalance conditions.

Anitha et al.28 developed a hybrid DL model combining an ensemble CNN-RNN for feature extraction with a bidirectional capsule network (Bi-CapsNet) for arrhythmia classification. Tested on the MITDB, the model achieved 97.19% accuracy, outperforming CNN (89.87%), FTBO (85%) and standalone capsule networks (97.0%). Its robustness against noisy ECG signals and ability to capture spatial-temporal features make it suitable for clinical deployment.

Kumar et al.29 introduced an improved Hawks Optimizer (HO)-based stacked ensemble model for CVD classification. The framework incorporated a Neural Network Reasoning component and addressed class imbalance in the Kaggle CVD dataset. They evaluated a collected dataset from Kaggle, the model achieved strong results with 97% accuracy. The HO optimizer enhanced global search capability, leading to superior predictive performance compared to benchmark models, with notable improvements in Matthews Correlation Coefficient (MCC), accuracy and F-measure.

Arunachalam et al.30 developed a novel ML-based model for CVD risk assessment, using K-Nearest Neighbor (KNN) as a baseline alongside ensemble methods such as XGBoost, AdaBoost and Random Subspace. With the aid of Linear Support Vector Feature Measure (LSVFM) for feature prediction, the system demonstrated strong performance on the MITDB, achieving 96% accuracy and 97% precision. The results highlighted the model’s effectiveness as a potential clinical decision-support tool.

Saranya et al.31 introduced DenseNet-ABiLSTM, a hybrid DL model combining densely connected convolutional networks with Attention-based Bidirectional LSTM for multiclass arrhythmia detection from ECG signals. Leveraging 1D convolutional kernels for multiscale feature extraction and attention-enhanced BiLSTM for temporal analysis, the model classified arrhythmia. The dataset MITDB achieved an average F1 score of 87.74% and accuracy of 89.14%, outperforming traditional ECG-based approaches.

Finally, Rajagopal et al.32 investigated the role of unsupervised dimensionality reduction (DR) methods in arrhythmia classification using ECG signals. By comparing techniques such as PCA, fastICA, kernel PCA, hierarchical nonlinear PCA and principal polynomial analysis (PPA) with a probabilistic neural network (PNN) classifier, the study highlighted the advantages of nonlinear DR.

Arrhythmia detection on PTBDB dataset

Research on the PTB Diagnostic ECG Database (PTBDB) has also seen significant advancements in arrhythmia detection and ECG signal analysis. Mondal et al.33 proposed a lightweight Convolutional Neural Network (CNN) that utilizes derivative ECG (dECG) signals for automatic ECG signal quality assessment (ECG-SQA). Their dECG-based CNN, designed to mitigate noise challenges in wearable devices, achieved an accuracy of 97.59% on the PTBDB, demonstrating its feasibility for real-time applications, including on a Raspberry Pi platform.

Khan et al.34 introduced a hybrid GRU-CNN model for cardiac abnormality prediction using the PTBDB. By combining Gated Recurrent Units (GRU) with CNNs, their model effectively processed long sequences and learned non-linear features from ECG signals. They further enhanced performance with an RB-GRU-CNN model, achieving an RMSE of 0.02679, which highlighted the effectiveness of incorporating residual bias for error reduction in time-series ECG data.

Subhiyakto et al.35 focused on addressing class imbalance in ECG classification using a CNN-based model on the PTBDB. They extensively experimented with CNN, Transformer and LSTM architectures in conjunction with various SMOTE techniques, including SMOTE Borderline, ADASYN, Tomek and ENN. The highest accuracy of 99.36% was achieved with CNN combined with SMOTE Borderline, underscoring the importance of resampling methods in imbalanced datasets.

Bai et al.36 developed a hybrid deep learning model named CBGM, which integrates CNN, Bidirectional Gated Recurrent Units (BiGRU) and a multi-head attention mechanism. This model was validated on both MITDB and PTBDB, achieving an accuracy of 98.82% on the latter. The CBGM model effectively captures both spatial and temporal features, making it suitable for real-time ECG screening and clinical decision support.

Lee et al.37 proposed a novel cross-database learning framework for ECG arrhythmia classification, utilizing a two-dimensional beat-score-map (BSM) representation. Their approach addressed generalization challenges across heterogeneous databases by employing both fine-grained and coarse-grained annotations. While validated on PTB-XL (a related dataset) among others, their framework demonstrated an F1 score of 0.9267, showcasing improved generalization performance for complex cardiac arrhythmias.

Finally, Padmavathi et al.38 explored hybrid deep learning models for automated cardiovascular disease identification. They proposed a 1D CNN combined with a Recurrent Hopfield Neural Network (RHNN) and another with a Residual Network (ResNet). Evaluated on the PTBDB, the 1D-CNN-RHNN model achieved a 96.62% accuracy for a 4-class classification system, demonstrating the potential of such hybrid approaches in improving ECG signal analysis for real-time medical diagnostics.

Arrhythmia detection on NSTDB dataset

Singh et al.39 introduced an Attention-Based Convolutional Denoising Autoencoder (ACDAE) model, enhanced with a lightweight channel attention (ECA) module, for robust ECG signal denoising and arrhythmia classification. This model employs skip-layer connections to minimize information loss during reconstruction and demonstrated high performance under noise stress across four ECG databases, achieving an impressive 98.88% accuracy for ECG beat classification.

Kumari et al.40 developed a computational model for classifying ECG signals into normal and abnormal categories, tested on both the MITDB and NSTDB. Their methodology integrates Superlet Transform (SLT) for pre-processing and noise filtering, VGG18 for feature extraction via transfer learning and KNN for classification. This approach achieved an impressive 99.46% accuracy in noisy environments, highlighting its robustness for real-world arrhythmia detection.

He et al.22 proposed a dynamic ECG signal quality assessment method based on a hybrid CNN and LSTM network. Aimed at improving diagnostic accuracy in noisy environments, their model categorizes ECG signals into three quality levels and was validated on the MITDB and NSTDB. It achieved an accuracy of 98.65% with a macro-averaged F1 score of 98.50%, demonstrating its effectiveness in robust ECG signal analysis for heart disease diagnosis.

Lee et al.37 presented a robust method for arrhythmia detection from wearable ECG devices, specifically addressing noise contamination. Their system combines an adaptive-threshold QRS detector with a hybrid neural network comprising LSTM and Artificial Neural Networks (ANN), along with SMOTE for class imbalance. Evaluated on the MITDB and NSTDB, the model achieved 97.38% sensitivity and 97.08% precision on the NSTDB, showcasing strong performance even under noisy conditions.

Wei et al.41 proposed a DL-based denoising model for multichannel ECG signals. Their Fully Convolutional Network-based Denoising Autoencoder (FCN-DAE) with Jacobian regularization aimed at noise removal while preserving critical local information. The model achieved up to 97.02% noise removal accuracy and outperformed traditional methods, providing a robust solution for clinical applications by effectively preserving vital ECG features like the QRS complex.

Finally, Nurmaini et al.42 introduced a DL-based stacked denoising autoencoder (DAE) and autoencoder (AE) model integrated with Deep Neural Networks (DNNs) for ECG heartbeat classification. Tested on the MITDB and NSTDB under varying noise levels, their model achieved high performance with an accuracy of 99.34%, demonstrating that the DAE and AE architecture significantly improved feature extraction and denoising capabilities compared to conventional ML models.

Methodology

This study proposes a framework based on Explainable Artificial Intelligence (XAI) and Deep Learning (DL) for detecting arrhythmias from ECG signals, as illustrated in Fig. 1. The methodology consists of several key stages: data acquisition from publicly available datasets, comprehensive preprocessing, and the development and evaluation of DL models. To address the potential class imbalance commonly found in medical datasets, various data balancing techniques were explored, and the most effective method was selected to optimize model performance. Specifically, Dense Neural Networks (DNN) and Convolutional Neural Networks (CNN) were employed to learn the discriminative features necessary for arrhythmia detection. This systematic approach emphasizes robust data preparation, proper data balancing, effective model building, and rigorous evaluation to ensure a reliable framework for detecting arrhythmias.

Fig. 1
figure 1

The proposed XAI-DL model framework for arrhythmia detection.

Dataset description

To develop a robust and generalizable arrhythmia classification model, we utilized three benchmark ECG datasets: the MIT-BIH Arrhythmia Database (MITDB), the Noise Stress Test Database (NSTDB) and the PTB Diagnostic ECG Database (PTBDB). These datasets were selected for their clinical relevance, diverse arrhythmia types and varying levels of signal quality. Together, they provide a comprehensive training and evaluation foundation for DL-based ECG analysis under both ideal and noisy conditions.

The MITDB is the primary dataset used in this study. It contains 48 half-hour, two-channel ambulatory ECG recordings from 47 subjects, sampled at 360 Hz. Each recording is annotated beat-by-beat by expert cardiologists, covering both normal and a wide range of arrhythmic classes such as left/right bundle branch blocks (L/R), atrial premature beats (A) and premature ventricular contractions (V). This dataset offers a rich variety of real-world heartbeat morphologies, making it highly suitable for supervised arrhythmia classification tasks.

The NSTDB provides clean ECG signals combined with various types of synthetic noise, including baseline wander, muscle artifact and electrode motion. All signals are sampled at 360 Hz to match MITDB. Although NSTDB does not include annotated arrhythmia labels, it is valuable for testing model robustness under noise conditions. By adding NSTDB noise to MITDB signals during training and testing, we can simulate real-world ECG interference and enhance the model’s noise resilience.

The PTBDB contains high-resolution ECG recordings (sampled at 1,000 Hz) from 290 patients and healthy volunteers. It includes multiple ECG leads and covers a wide range of cardiac conditions such as myocardial infarction, conduction blocks and hypertrophy. For consistency, the signals were resampled to 360 Hz. Although PTBDB does not provide beat-level annotations like MITDB, its pathological variety helps in augmenting training data and validating generalization across patient populations. This dataset strengthens the model’s capacity to detect arrhythmias beyond the scope of MITDB alone.

Dataset preparation

The dataset preparation involved systematic processing of multiple ECG databases such as MITDB, NSTDB and PTBDB to create a unified, well-structured input for DL models. The WFDB Python library was used to load raw ECG signals and their corresponding heartbeat annotations from each dataset dynamically, avoiding hardcoding and improving flexibility.

For each dataset, heartbeat-centered signal segments were extracted using a fixed time window of one second. This window size ensures capturing sufficient temporal context around each annotated beat. Heartbeats were filtered and labeled according to a predefined mapping scheme for binary classification, where normal beats are labeled as 0 and various arrhythmias are grouped under label 1. Beats outside these categories were ignored to maintain label consistency and reduce noise.

To efficiently handle large amounts of data, parallel processing was employed using the joblib library, which sped up segment extraction across multiple records. After segmentation, each ECG window was flattened into a feature vector and combined into a single DataFrame with corresponding class labels. This DataFrame formed the basis for model training and evaluation, ensuring consistent representation across all three databases.

ECG signal extraction process

To extract ECG signal segments for detection correctly, we utilized several key Python libraries, including wfdb for reading waveform and annotation files, NumPy for numerical operations, pandas for data structuring and joblib for parallel processing.

Step-by-step process:

  1. 1.

    Read ECG and annotation data: For each record, the ECG signal was read using wfdb.rdrecord() and corresponding annotations were retrieved with wfdb.rdann().

  2. 2.

    Windowing: A 1-second window (WINDOW_SEC = 1) was applied around each annotated beat. Each segment was extracted symmetrically around the annotated R-peak. The number of samples was computed as w = int(WINDOW_SEC * rec.fs) based on the sampling frequency.

  3. 3.

    Class mapping: Each annotation symbol was mapped to a binary label using the dictionary LABEL_MAP = {’N’:0, ’L’:1, ’R’:1, ’A’:1, ’V’:1}. This merges all arrhythmic types into a single abnormal class.

  4. 4.

    Segment extraction: For each annotated beat of interest, a segment was extracted using indexing. If the window extended beyond signal boundaries, zero-padding was used to maintain segment length consistency.

  5. 5.

    Parallel processing: The function process_record was executed in parallel across all records using joblib.Parallel to accelerate data loading.

  6. 6.

    Flattening and storage: All extracted segments were concatenated, reshaped into flat feature vectors and stored in a pandas DataFrame. Each row represents a single beat segment, with the class label stored in the Target column.

This structured pipeline ensures uniform segment length, consistent sampling and accurate beat alignment for input into DL models.

Data preprocessing

Effective preprocessing of ECG signals is critical to enhance signal quality and improve the performance of DL models for arrhythmia classification. Raw ECG data often contain various types of noise and artifacts, such as baseline wander, powerline interference and high-frequency noise, which can obscure important cardiac features. To address these issues, a series of filtering and normalization steps was applied systematically.

Bandpass filtering: A Butterworth bandpass filter with cutoff frequencies at 0.5 Hz and 45 Hz was used to remove baseline drift and high-frequency noise. The low cutoff frequency of 0.5 Hz helps in eliminating slow baseline wander caused by respiration and movement, while the high cutoff at 45 Hz removes muscle noise and other high-frequency artifacts. This range preserves the relevant ECG frequency components critical for detecting arrhythmic patterns.

Notch filtering: Powerline interference at 50 Hz is a common source of noise in ECG recordings, especially in regions where the electrical grid operates at this frequency. To mitigate this, a notch filter centered at 50 Hz was applied. This filter selectively attenuates the narrowband noise without significantly affecting the ECG signal, ensuring cleaner recordings.

Baseline wander removal: Although the bandpass filter reduces baseline drift, residual slow fluctuations may remain. To further correct this, a high-pass Butterworth filter with a cutoff frequency of approximately 0.5 Hz was applied. This step ensures that baseline drift, which can interfere with accurate heartbeat delineation, is minimized.

Normalization: Following noise and artifact removal, each ECG segment was normalized using Z-score standardization. This technique centers the data by subtracting the mean and scales it by the standard deviation, resulting in features with zero mean and unit variance. Normalization is essential to bring all input features to a common scale, improving the stability and convergence speed of DL models during training.

All these preprocessing steps were implemented as a pipeline, consistently applied to all ECG segments to produce clean, normalized data ready for model training and evaluation.

Figure 2 illustrates the impact of the preprocessing pipeline on ECG heart signals, highlighting changes in signal quality across two representative samples.

Fig. 2
figure 2

Before and after data preprocessing.

Data balancing

Imbalanced datasets, where some arrhythmia classes are significantly underrepresented compared to others, pose a major challenge for developing robust DL models. Without addressing this imbalance, models tend to be biased towards the majority class, resulting in poor detection of rare but clinically important arrhythmia events. To mitigate this, several data balancing techniques were applied to the training dataset to synthetically increase minority class samples and improve class distribution.

Adaptive synthetic sampling (ADASYN): ADASYN generates synthetic samples for minority classes by adaptively focusing on harder-to-learn examples near class boundaries. By emphasizing these complex samples, ADASYN helps the model better distinguish subtle arrhythmic patterns that are often overlooked, improving sensitivity to rare classes.

Synthetic minority over-sampling technique (SMOTE): SMOTE creates new synthetic samples for minority classes by interpolating between existing minority instances. This approach balances the dataset without simply duplicating samples, which helps reduce overfitting and enhances the generalization capability of the model.

SMOTE combined with Tomek links (SMOTETomek): This hybrid method combines SMOTE oversampling with Tomek Links undersampling to both synthesize minority samples and remove borderline majority samples that overlap with minorities. The result is a cleaner and more balanced dataset, which can improve classifier performance by reducing class overlap.

Random over-sampling (ROS): ROS duplicates existing minority class samples to balance the dataset. It ensures that the minority class is adequately represented, helping models avoid strong bias toward the majority class. Since no synthetic data is introduced, the natural distribution of the minority samples is preserved. It is simple, fast, and easy to implement compared to SMOTE, ADASYN, and SMOTE-Tomek. It preserves the original feature space without introducing synthetic or noisy samples, making results more interpretable. While prone to overfitting, ROS avoids the higher computational cost and complexity of other oversampling methods, serving as a strong baseline for comparison.

By applying these balancing methods, the dataset’s class distribution was adjusted to reduce bias and improve the DL models’ ability to accurately detect and classify arrhythmic events across all classes.

The impact of data balancing using our proposed ROS technique is demonstrated in Fig. 3, Fig. 4 and Fig. 5, which illustrate the class distributions before and after balancing across the MITDB, PTBDB and NSTDB datasets.

Fig. 3
figure 3

Before and after data balancing of MITDB dataset.

Fig. 4
figure 4

Before and after data balancing of PTBDB dataset.

Fig. 5
figure 5

Before and after data balancing of NSTDB dataset.

Deep learning models

For arrhythmia classification, two DL architectures were developed and evaluated: a fully connected Dense Neural Network (DNN) and a one-dimensional Convolutional Neural Network (CNN). Both models were designed for binary classification to distinguish normal from arrhythmic ECG segments.

Dense neural network (DNN):

The DNN model consists of an input layer matching the feature dimension of the preprocessed ECG segments, followed by three dense layers with 128, 64 and 32 neurons respectively. Each hidden layer uses the ReLU activation function to introduce non-linearity and dropout layers with rates of 0.3 and 0.2 were added after the first and second dense layers to reduce overfitting by randomly disabling neurons during training. The output layer contains a single neuron with a sigmoid activation function, providing a probability score for binary classification. The model was compiled with the Adam optimizer and binary cross-entropy loss, tracking accuracy and the Area Under the ROC Curve (AUC) as evaluation metrics. Early stopping based on validation loss with a patience of 10 epochs was employed to prevent overfitting during training, which was conducted for a maximum of 30 epochs with a batch size of 32.

Convolutional neural network (CNN):

The CNN model leverages the temporal structure of ECG signals by applying one-dimensional convolutional layers. Input segments were reshaped to include a channel dimension, enabling convolutional operations. The architecture comprises two convolutional layers with 128 and 64 filters respectively, each followed by max-pooling layers that downsample the feature maps, reducing computational complexity and extracting dominant features. After flattening the output, a dropout layer with a rate of 0.3 was applied before a dense layer of 32 neurons with ReLU activation. The final output layer uses a sigmoid activation function for binary classification. The CNN was compiled and trained with the same optimizer, loss function and early stopping criteria as the DNN model.

Both models were evaluated on a held-out test set. Predictions were thresholded at 0.5 to convert probabilities into binary class labels. Extensive performance was assessed to show the efficiency of each model. The CNN typically benefits from capturing local temporal dependencies in ECG data, while the DNN exploits global features from the flattened input. Their comparative results provide insights into the efficacy of feature extraction approaches for arrhythmia detection.

Model size and memory footprint discussion

The DNN model comprises approximately 103K trainable parameters, with a total memory footprint (including optimizer states) of around 1.17 MB. This relatively small size makes it computationally efficient and suitable for deployment in resource-constrained environments. The fully connected architecture effectively learns from flattened input features but may have limited ability to capture temporal dependencies in ECG signals (Table 1).

In contrast, the CNN model contains about 390K trainable parameters and requires approximately 4.46 MB of memory, including optimizer states. The increased complexity is largely due to the convolutional layers and the large dense layer following flattening, which enable the model to automatically extract hierarchical temporal features from ECG data. This capacity often translates into improved classification performance but requires more computational resources and memory (Table 2).

The trade-off between model size and performance should be considered when selecting an architecture. The DNN offers faster training and lower memory usage, while the CNN provides more powerful feature extraction at the cost of increased resource demand.

Table 1 Summary of DNN model architecture and parameters.
Table 2 Summary of CNN model architecture and parameters.

Model evaluation metrics

To rigorously assess the effectiveness and reliability of the proposed arrhythmia detection models, a comprehensive suite of performance metrics was utilized. These metrics offer a holistic evaluation of the models’ classification capabilities, particularly in the context of binary classification for distinguishing between normal and abnormal heartbeats. The definitions of these metrics, alongside their mathematical formulations, are provided below. In these equations, TP represents True Positives, TN denotes True Negatives, FP stands for False Positives and FN signifies False Negatives. n is the number of data points, \(y_i\) is the actual value and \(\hat{y}_i\) is the predicted value. A summary of the performance metric equations is presented in Table 3.

Table 3 Performance metrics and their formulas.

Explainable AI (XAI) techniques

To enhance the transparency and interpretability of our DL models for arrhythmia detection, this study integrates several cutting-edge Explainable AI (XAI) techniques. These methods provide critical insights into the model’s decision-making process, fostering greater trust and facilitating clinical adoption. The specific XAI techniques employed are described below:

SHAP (SHapley additive exPlanations)

SHAP is a game-theoretic approach that assigns an importance value to each feature for a particular prediction. It computes the marginal contribution of each feature to the difference between the actual prediction and the average prediction, considering all possible permutations of feature coalitions. This “Shapley value” ensures fairness by accounting for feature interactions and providing a consistent measure of influence. In the context of ECG analysis, SHAP values can highlight which specific segments or derived features of an ECG signal are most influential in classifying a particular arrhythmia, offering localized explanations for individual predictions.

LIME (Local interpretable model-agnostic explanations)

LIME is a model-agnostic XAI technique that focuses on explaining individual predictions of any black-box model. It works by locally perturbing the input data and observing how the model’s prediction changes. A simpler, interpretable model (e.g., linear regression) is then trained on these perturbed samples and their corresponding predictions. This local model approximates the complex model’s behavior around the specific instance being explained, providing an intuitive understanding of why a particular prediction was made. For ECG signals, LIME can identify which parts of the waveform (e.g., QRS complex, P-wave) are most critical for a given classification decision for a specific patient.

Feature importance analysis (FIA)

FIA assesses the global contribution of each input feature to the model’s overall predictive performance. While specific implementations may vary (e.g., permutation importance, impurity-based importance in tree models, or weight analysis in neural networks), the core idea is to quantify how much a model’s performance degrades when a specific feature’s information is removed or randomized. In the context of DL models like CNNs and DNNs, FIA can reveal which raw ECG lead signals, or which derived high-level features learned by the network, consistently contribute most to accurate arrhythmia classification across the entire dataset. This provides a macroscopic view of feature relevance, complementing the localized explanations offered by SHAP and LIME.

Result analysis

Performance analysis on MITDB dataset

Table 4 presents a comprehensive performance analysis of our proposed DL models (CNN and DNN) on the MITDB, evaluated across various data balancing techniques: ADASYN, SMOTE, SMOTETomek and ROS. The evaluation encompasses a broad range of metrics, including Accuracy, Precision, TPR, TNR, FPR, FNR, F1-Score, AUC, MCC, Cohen’s Kappa, MAE, MSE and RMSE.

A thorough examination of the results reveals that both CNN and DNN models achieve remarkably high performance across all data balancing strategies. Notably, the ROS technique, when coupled with the CNN model, yielded superior performance, establishing new benchmarks for arrhythmia detection on the MITDB dataset. This combination achieved an outstanding Accuracy of 99.74%, a Precision of 99.75%, a TPR of 99.73% and an F1-Score of 99.74%. Furthermore, it demonstrated exceptional discriminatory power with an AUC of 99.98% and strong agreement with MCC and Cohen’s Kappa both at 99.48%. The error metrics for this configuration were also the lowest, with MAE at 0.31, MSE at 0.23 and RMSE at 4.78, indicating minimal prediction errors.

While other balancing techniques also performed strongly, ROS consistently provided a slight edge, particularly in conjunction with the CNN architecture. For instance, SMOTETomek with CNN achieved an Accuracy of 99.60%, slightly lower than ROS+CNN. The DNN model also showed robust performance across techniques, with its best Accuracy of 99.60% achieved with ROS. These results underscore the effectiveness of our advanced preprocessing pipeline and the inherent capabilities of both CNN and DNN architectures, with ROS proving to be the most effective strategy for mitigating class imbalance in this context, leading to highly accurate and reliable arrhythmia detection.

Table 4 Performance analysis of MITDB dataset.

The analysis of our DL models on the MITDB dataset, as visually represented in Fig. 6, provides compelling evidence of their high efficacy for arrhythmia detection. In the Fig. 6(a), both CNN and DNN models consistently achieve remarkably high Accuracy and Precision scores across all tested data balancing techniques. Notably, the ROS technique, when coupled with the CNN model, yields the most superior performance, with Accuracy 99.74% and Precision 99.75% respectively, establishing a new benchmark. While other balancing methods such as SMOTE and SMOTETomek, also enable both models to achieve performance scores consistently above 99.50% (e.g., SMOTE CNN Accuracy 99.52%, SMOTETomek CNN Accuracy 99.60%), the ROS+CNN combination stands out. Correspondingly, the Fig. 6(b) highlights the minimized prediction errors achieved by our framework. The CNN model, particularly with ROS, exhibits the lowest MAE and MSE, with MAE 0.31 and MSE 0.23, indicating exceptionally low prediction discrepancies. Although other combinations, such as ADASYN with CNN, show slightly higher error rates (MAE 0.67, MSE 0.41), the overall trend underscores the robust and reliable performance of our proposed models, with ROS proving to be the most effective data balancing strategy for achieving highly accurate and trustworthy arrhythmia detection.

Fig. 6
figure 6

Result analysis of DL models on MITDB dataset.

Performance analysis on PTBDB dataset

Table 5 summarizes the comprehensive performance evaluation of our proposed DL models CNN and DNN applied to the PTBDB. Consistent with the results observed in the MITDB dataset, the CNN model combined with the ROS balancing technique delivered the highest performance on PTBDB. This configuration attained an outstanding Accuracy of 99.43%, a Precision of 99.57%, a TPR of 99.29% and an F1-Score of 99.43%. The AUC value peaked at 99.98%, underscoring the model’s excellent discriminative capability. Additionally, the MCC and Cohen’s Kappa coefficients both reached 98.86%, demonstrating strong classification reliability. The CNN with ROS also yielded the lowest error rates, with MAE at 0.92, MSE at 0.43 and RMSE at 6.56, indicating highly precise predictions with minimal deviation.

A pronounced performance gap is evident between the CNN and DNN models. While CNN maintains high accuracies exceeding 98% across all balancing methods, the DNN’s accuracies hover between approximately 96.15% and 96.81%. For example, under the ROS technique, the DNN model achieves an Accuracy of 96.17% and Precision of 96.63%, markedly lower than the CNN counterpart. The DNN’s error rates are also substantially higher, with MAE of 6.03, MSE of 2.96 and RMSE of 17.19 in the ROS setting, which highlights the CNN’s superior ability to capture complex spatial features within ECG signals compared to the fully connected DNN architecture.

Table 5 Performance analysis of PTBDB dataset.

The analysis of our DL models on the PTBDB dataset, visually summarized in Fig. 7, demonstrates their robust efficacy for arrhythmia detection. In Fig. 7(a), both CNN and DNN models achieve high Accuracy and Precision scores across all tested data balancing techniques, though the CNN consistently outperforms the DNN. Notably, the ROS technique paired with the CNN model delivers the best performance, achieving an Accuracy of 99.43% and a Precision of 99.57%, setting a new benchmark for this dataset. Other balancing methods such as SMOTE and SMOTETomek also enable strong CNN performance (e.g., CNN with SMOTE: Accuracy 99.12%, Precision 99.47%; CNN with SMOTETomek: Accuracy 99.10%, Precision 99.24%). Meanwhile, the DNN models show comparatively lower results, for instance, DNN with ROS attains an Accuracy of 96.17% and Precision of 96.63%.

Correspondingly, Fig. 7(b) highlights the prediction errors of the models. The CNN with ROS exhibits the lowest error rates, with a MAE of 0.92 and MSE of 0.43, indicating highly precise predictions with minimal deviations. Conversely, the DNN shows consistently higher error rates across balancing methods, such as MAE 6.03 and MSE 2.96 with ROS, underscoring the CNN’s superior accuracy. This comprehensive evaluation confirms the CNN’s effectiveness as a highly accurate and reliable model for arrhythmia detection on the PTBDB dataset, with ROS emerging as the most effective data balancing technique to optimize model performance and reduce prediction errors.

Fig. 7
figure 7

Result analysis of DL models on PTBDB dataset.

Performance analysis on NSTDB dataset

For external validation of our framework’s robustness and generalization capability, we evaluated performance on the NSTDB, with detailed results presented in Table 6. The results further highlight the exceptional efficacy of our models. Consistent with prior findings, the CNN combined with the ROS technique delivered the best overall performance on NSTDB, achieving an impressive Accuracy of 99.98%, Precision of 99.97%, perfect TPR of 100.00% and an F1-Score of 99.98%. Remarkably, both CNN and DNN models attained a perfect AUC of 100.00% across all balancing techniques, indicating near-perfect separability between classes. The ROS+CNN configuration also recorded extremely low prediction errors, with MAE at 0.08, MSE at 0.02 and RMSE at 1.37, demonstrating highly accurate and reliable predictions. The DNN model also exhibited excellent performance, particularly with ADASYN and ROS, achieving Accuracy scores of 99.95% and 99.95% respectively, both with perfect AUC values.

While the CNN consistently maintains the edge, the minimal performance differences across all balancing techniques and models underscore both the intrinsic quality of the NSTDB dataset and the superior capability of our proposed models to handle noise and stress in ECG signals. These findings confirm the strong generalizability and resilience of our framework for robust arrhythmia detection in noisy clinical environments.

Table 6 Performance analysis of NSTDB dataset.

The analysis of our DL models on the NSTDB dataset, as illustrated in Fig. 8, demonstrates exceptionally high performance for arrhythmia detection. In Fig. 8(a), both CNN and DNN models consistently achieve near-perfect Accuracy and Precision scores across all tested data balancing techniques. Notably, the CNN model combined with ROS attains the highest Accuracy of 99.98% and Precision of 99.97%, while the DNN model with ADASYN achieves an Accuracy of 99.95% and Precision of 99.97%, both setting a new pinnacle for this dataset. Other balancing techniques such as SMOTE and SMOTETomek also support very high performance, with both models maintaining Accuracy and Precision scores above 99.86% and 99.82%, respectively, underscoring their robustness and consistent classification power.

Correspondingly, Fig. 8(b) highlights the remarkably low prediction errors achieved by our framework. The DNN model with ADASYN records the lowest MAE of 0.08 and MSE of 0.04, indicating highly precise and reliable predictions. The CNN model with ROS exhibits similarly low error rates, with an MAE of 0.08 and MSE of 0.02, reflecting outstanding prediction accuracy. Other combinations, such as the CNN with SMOTETomek, show slightly higher but still very low error values (e.g., MAE of 0.21, MSE of 0.11), confirming the overall effectiveness of our DL framework for robust arrhythmia detection under noisy conditions represented by NSTDB.

Fig. 8
figure 8

Result analysis of DL models on NSTDB dataset.

The CNN model outperforms the DNN by effectively capturing spatial patterns and temporal dependencies within ECG signals through convolutional layers. Its hierarchical feature extraction enables better generalization and improved classification accuracy for complex arrhythmic patterns.

Classification reports analysis

The classification reports on the MITDB dataset (Table 7) reveal consistently high performance for both CNN and DNN models across all data balancing techniques, with ROS achieving the best overall results. For the CNN model with ROS, Precision, Recall and F1-Score exceed 99.70% for both Normal and Arrhythmia classes, demonstrating balanced and robust classification. Similarly, the DNN model with ROS attains slightly lower but still excellent scores above 99.50% across all metrics for both classes. Other balancing methods such as ADASYN, SMOTE and SMOTETomek also yield strong and nearly symmetrical precision and recall values between the two classes, reflecting the models’ effectiveness in accurately detecting arrhythmias while minimizing false positives and negatives. This high degree of consistency underscores the reliability and precision of the proposed DL framework for arrhythmia detection on the MITDB dataset.

Table 7 Classification reports of DL models on MITDB dataset.

The classification reports for the PTBDB dataset (Table 8) demonstrate the superior performance of the CNN model compared to the DNN model across all data balancing techniques. The CNN combined with ROS achieved the highest Precision, Recall and F1-Score, exceeding 99.20% for both Normal and Arrhythmia classes, indicating highly balanced and accurate classification. In contrast, the DNN model, while performing well, showed comparatively lower metrics, with Precision and Recall values generally in the mid-90s range. Among balancing techniques, ROS consistently provided the best results, closely followed by SMOTETomek and SMOTE, underscoring its effectiveness in addressing class imbalance. The DNN’s relatively lower scores suggest it is less effective at capturing the nuanced features of the PTBDB ECG signals, whereas the CNN excels due to its spatial feature extraction capabilities. Overall, these results confirm the robustness of the CNN model paired with ROS as the most effective framework for arrhythmia detection on the PTBDB dataset.

Table 8 Classification reports of DL models on PTBDB dataset.

The classification reports on the NSTDB dataset, shown in Table 9, highlight the outstanding performance of both CNN and DNN models across all data balancing techniques. The CNN model with ROS achieved near-perfect classification results, with Precision, Recall and F1-Score consistently above 99.97% for both Normal and Arrhythmia classes, demonstrating exceptional sensitivity and specificity. Similarly, the DNN model also performed remarkably well, closely matching the CNN’s metrics, particularly under ROS, where F1-Scores reached 99.95% for both classes. Other balancing methods such as ADASYN, SMOTE and SMOTETomek also delivered excellent results, with minimal variation between the models, reinforcing the robustness of the framework. These findings confirm that both DL architectures maintain highly reliable and accurate arrhythmia detection on this noisy, real-world external dataset, underscoring their strong generalization and adaptability in challenging clinical environments.

Table 9 Classification reports of DL models on NSTDB dataset.

ROC curve analysis

Figure 9 illustrates the ROC curves of the CNN model evaluated on three different datasets: MITDB, PTBDB and NSTDB. The ROC curves demonstrate excellent classification performance, with the CNN model achieving near-perfect discrimination ability across all datasets. Specifically, the AUC values are extremely high—99.99% for MITDB, 99.98% for PTBDB and a perfect 100.00% for NSTDB—indicating outstanding capability in distinguishing between arrhythmia and normal classes. The curves closely approach the top-left corner of each plot, reflecting very low false positive rates alongside high true positive rates, which confirms the model’s robustness and reliability in arrhythmia detection across diverse ECG datasets. This consistent performance further validates the effectiveness of the CNN architecture in handling varied data distributions and complexities inherent in these benchmark datasets.

Fig. 9
figure 9

ROC CURVE of CNN model on arrhythmia datasets.

Figure 10 illustrates the class-wise ROC curves of the CNN model across three benchmark ECG datasets: MITDB, PTBDB and NSTDB. Each subfigure presents the model’s ability to distinguish between Normal and Arrhythmia classes, with corresponding AUC scores. The CNN model demonstrates exceptional discriminative power, achieving near-perfect AUCs: 99.99% for both classes in MITDB, 99.98% and 99.96% in PTBDB and a flawless 100.00% for both classes in NSTDB. The macro average AUCs further confirm the model’s robustness, indicating consistent performance across class distributions. These results validate the effectiveness of our ROS+CNN framework in accurately identifying arrhythmic patterns, reinforcing its suitability for clinical-grade ECG analysis.

Fig. 10
figure 10

Class-wise ROC Curve of CNN model on arrhythmia datasets.

XAI Analysis

Figure 11 presents a comprehensive XAI analysis of the CNN model applied to the MITDB dataset, highlighting its interpretability across multiple dimensions. Subfigure (a) shows the FIA, which ranks the most influential ECG features contributing to the model’s predictions. Subfigures (b) and (c) display SHAP values for the top 10 features in the Arrhythmia and Normal classes, respectively, revealing how each feature impacts the model’s output—positively or negatively. Meanwhile, Subfigures (d) and (e) illustrate LIME visualizations, offering localized explanations for individual predictions in both classes. Together, these XAI techniques provide transparent, class-specific insights into the CNN’s decision-making process, enhancing clinical trust and validating the model’s reliability for real-world ECG interpretation

Fig. 11
figure 11

XAI of CNN model on MITDB.

Figure 12 showcases the XAI evaluation of the CNN model on the PTBDB dataset, offering a transparent view into its decision-making process. Subfigure (a) presents the FIA, where features such as Time_2, Time_33 and Time_34 emerge as dominant contributors to model predictions. Subfigures (b) and (c) illustrate SHAP-based interpretations for the Arrhythmia and Normal classes, respectively, highlighting how specific temporal features influence the model’s output—either enhancing or suppressing the likelihood of a given class. Subfigures (d) and (e) provide LIME-based local explanations, revealing the most influential features in individual predictions for both classes. Notably, features like Time_34 and Time_35 consistently appear across methods, reinforcing their diagnostic relevance. This multi-method XAI approach strengthens clinical trust by demystifying the CNN’s internal logic and validating its reliability for arrhythmia detection in PTBDB

Fig. 12
figure 12

XAI of CNN model on PTBDB.

Figure 13 presents the XAI evaluation of the CNN model on the NSTDB dataset, offering a multi-angle interpretability assessment. Subfigure (a) displays the FIA, where features such as Time_367, Time_369 and Time_365 emerge as the most influential in driving model predictions. Subfigures (b) and (c) show SHAP visualizations for the Arrhythmia and Normal classes, respectively, revealing how individual feature values (color-coded from low to high) impact the model’s output. Notably, Time_367 and Time_369 consistently exhibit strong influence across both classes. Subfigures (d) and (e) provide LIME-based local explanations, identifying specific feature intervals and thresholds that contribute most to class-specific predictions. For instance, conditions like \(I \le -3.54\) and \(-3.14 < \texttt {Time}\_\texttt {369} \le -1.28\) are pivotal in distinguishing arrhythmic from normal signals. This layered XAI approach enhances transparency and clinical trust, affirming the CNN’s reliability in interpreting ECG signals from NSTDB.

Fig. 13
figure 13

XAI of CNN model on NSTDB.

Clinical implications

Our proposed explainable DL framework for arrhythmia detection holds significant clinical implications, addressing several critical needs in cardiovascular diagnostics. The remarkably high accuracy and robustness achieved across diverse datasets (MITDB, PTBDB and NSTDB) signify a substantial step towards more reliable and automated arrhythmia diagnosis. This enhanced diagnostic precision can lead to earlier and more accurate identification of life-threatening cardiac conditions, enabling timely medical intervention and potentially reducing mortality rates associated with cardiovascular diseases.

Crucially, the integration of XAI methodologies, including SHAP, LIME and FIA, distinguishes our framework from conventional black-box DL models. In a clinical setting, interpretability is paramount; clinicians need to understand not just what a model predicts, but why. Our XAI components provide transparent insights into the model’s decision-making process, highlighting the specific ECG features that contribute most to arrhythmia detection. This transparency fosters greater trust and confidence among healthcare professionals, empowering them to validate predictions, identify potential biases and use the AI-driven insights to complement their clinical judgment rather than blindly accepting automated outputs. Such interpretability is vital for the widespread adoption of AI in critical medical domains.

Discussion

This section interprets the experimental findings and provides a comparative perspective with our proposed method and recent state-of-the-art (SOA) methods for arrhythmia detection. The results highlight both the performance advantages and the unique contributions of our XAI-enabled DL pipeline.

Comparative analysis on MITDB dataset

Table 10 summarizes a comparison of our model against contemporary approaches evaluated on the MITDB. The comparison considers architectural choices, data balancing techniques and the explicit use of XAI methods.

Our proposed framework (RCX) achieves a remarkable accuracy of 99.74%, outperforming other recent works, including CNN + SEPcEnet (98.56%)18 and RoBERTa Transformer (98.81%)20. Even when compared to highly optimized designs such as ECGTransForm (99.35%)23 and DNN with residual blocks (99.51%)27, our framework demonstrates superior or comparable performance. This consistency underscores the effectiveness of our data-centric pipeline, where careful preprocessing, robust feature handling and targeted data balancing with ROS play critical roles in maximizing classification accuracy.

A key differentiator of our work lies in the explicit integration of XAI methods—SHAP, LIME and FIA. While most existing studies emphasize predictive performance, few incorporate interpretable mechanisms. As shown in Table 10, nearly all compared works lack explainability components, which limits their direct applicability in clinical environments. In contrast, our model not only achieves the highest reported accuracy but also provides transparent insights into the decision-making process. This interpretability is essential in healthcare, where clinicians must understand and validate algorithmic predictions before integrating them into diagnostic workflows.

Another important observation is the limited adoption of systematic data balancing strategies in prior works. Apart from CAT-Net25, which employed SMOTE-Tomek and ECGTransForm23, which introduced CAL, most studies did not explicitly address class imbalance. This omission is critical, as ECG datasets often contain a heavy skew toward normal beats, making minority arrhythmia classes more challenging to detect. By employing ROS, our framework effectively alleviates this imbalance, improving both overall performance and minority-class sensitivity.

In summary, the proposed framework establishes a new benchmark on MITDB by combining three pillars of clinical AI: high predictive accuracy, robust data balancing and transparent explainability. This synergy not only advances state-of-the-art performance but also bridges the gap between high-performing black-box algorithms and real-world clinical trustworthiness.

Table 10 Comparison analysis of State-of-the-Art (SOA) works with our proposed work on MITDB.

Comparative analysis on PTBDB dataset

Table 11 presents a comparative analysis of our proposed framework with recent SOA methods applied to the PTBDB. This dataset has been extensively used to benchmark arrhythmia detection models, providing valuable insights into architectural innovations, data balancing strategies and explainability.

Our proposed framework (RCX) achieves an accuracy of 99.43%, establishing a new benchmark on PTBDB and positioning it among the most effective solutions reported to date. In comparison, Subhiyakto et al.35 reported 99.36% accuracy using CNN, Transformer and LSTM models in conjunction with advanced SMOTE-based resampling, while El-Ghaish et al.23 achieved 99.35% accuracy and a macro F1-score of 94.26% with ECGTransForm, which integrated multi-scale CNN, a bidirectional transformer and CAL. Other strong baselines include Bai et al.’s CBGM model (98.82%)36 and Mondal et al.’s lightweight CNN for dECG signals (97.59%)33. By contrast, Padmavathi et al.38 achieved 96.62% with hybrid CNN–RHNN and CNN–ResNet architectures, while Lee and Shin37 demonstrated cross-database generalization with a beat score map approach, achieving an F1-score of 92.67% on PTB-XL.

A defining strength of our framework is the explicit integration of XAI techniques—SHAP, LIME and FIA—into the diagnostic pipeline. As shown in Table 11, while prior works emphasize accuracy or employ class balancing strategies such as CAL23 or SMOTE variants35, none incorporate interpretability. This leaves most models as “black boxes,” limiting their clinical applicability. By contrast, our framework provides transparency alongside top-tier accuracy, enabling clinicians to understand the reasoning behind model predictions.

Equally important is the role of ROS in addressing class imbalance, a persistent challenge in PTBDB. Unlike studies that relied exclusively on architectural complexity, our results demonstrate that systematically addressing imbalance through ROS enhances both predictive accuracy and fairness across classes.

In summary, the proposed model not only outperforms existing SOA methods in terms of accuracy but also delivers two essential attributes absent in most prior works: robust class balancing and transparent interpretability. Together, these qualities significantly strengthen the framework’s potential for clinical deployment as a reliable, trustworthy and explainable diagnostic support system.

Table 11 Comparison analysis of State-of-the-Art (SOA) works with our proposed work on PTBDB.

Comparative analysis on NSTDB dataset

Table 12 presents a comparative analysis of our proposed framework against recent SOA models developed for arrhythmia detection on the NSTDB. This dataset is widely recognized for its noisy and stress-induced ECG signals, making it a robust benchmark for evaluating model generalization and resilience under challenging conditions.

Our proposed framework (RCX) achieves an accuracy of 99.98% on the NSTDB, setting a new benchmark that highlights its superior robustness. This performance surpasses other competitive approaches, including Kumari’s Superlet+VGG18+KNN (99.46%)40, Nurmaini et al.’s Stacked DAE+AE+DNN (99.34%)42 and Singh’s ACDAE+ECA (98.88%)39. Other methods, such as CNN–LSTM22 and FCN+Autoencoder41, achieved lower accuracies of 98.65% and 97.02%, respectively, while Lee’s LSTM–ANN+QRS37 reported a substantially lower accuracy of 93.47%. For fairness, all reported results correspond to NSTDB-based evaluations as presented in their respective studies.

A key distinguishing feature of our framework is the explicit integration of XAI techniques—SHAP, LIME and FIA. As summarized in Table 12, none of the comparative models incorporated explainability, leaving their predictive mechanisms largely opaque. In contrast, our framework combines state-of-the-art performance with interpretable outputs, fostering clinical trust and supporting real-world diagnostic adoption.

Another critical factor contributing to our model’s robustness is the application of ROS for data balancing. While some prior works adopted strategies such as SMOTE37, most did not explicitly address imbalance issues in NSTDB. Our systematic balancing approach, coupled with explainability, provides a dual advantage: it ensures equitable learning across arrhythmia classes and enhances transparency in the decision-making process.

In summary, the proposed framework not only achieves the highest reported accuracy on NSTDB but also addresses two critical requirements for medical AI—robustness under noisy conditions and interpretability of predictions—reinforcing its suitability for clinical deployment.

Table 12 Comparison analysis of State-of-the-Art (SOA) works with our proposed work on NSTDB.

Ablation study

To identify the most effective optimization strategy for arrhythmia detection, we conducted an extensive ablation study using multiple optimizers—Adagrad, Adam, Adamax and RMSprop—on the PTBDB dataset. As shown in Table 13, the Adam optimizer consistently outperformed others across all evaluation metrics, including Accuracy (99.43%), Precision (99.57%), F1-Score (99.43%) and AUC (99.98%) for the CNN model. It also achieved the lowest error rates (MAE: 0.92, RMSE: 6.56), indicating superior generalization and robustness.

Given its outstanding performance, we adopted the Adam optimizer for subsequent experiments on other datasets (MITDB and NSTDB). This choice not only enhanced classification accuracy but also improved model stability and convergence speed. The results affirm that optimizer selection plays a critical role in fine-tuning DL models for ECG-based arrhythmia detection and Adam offers a reliable balance between precision and computational efficiency.

Table 13 Performance analysis of different optimization techniques on PTBDB dataset.

Beyond optimization, we performed an enhanced ablation study on four data balancing methods: ROS, SMOTE, ADASYN and SMOTETomek, to address class imbalance in ECG datasets. The comparative results are presented in Table 4, Table 5 and Table 6. Across all datasets, ROS consistently outperformed other techniques, especially when paired with CNN models. For instance, the ROS+CNN configuration achieved peak accuracies of 99.74% (MITDB), 99.41% (PTBDB) and 99.95% (NSTDB), along with high F1-scores and AUC values. In contrast, DNN models showed relatively lower performance under identical balancing conditions.

Comparison analysis of GAN and ROS

The comparative analysis between GAN-based data balancing and our proposed ROS technique, combined with CNN and DNN models on the PTBDB dataset, clearly demonstrates a superior performance of the ROS approach. As detailed in Table 14, the ROS+CNN model significantly outperforms the GAN+CNN counterpart across all evaluated metrics. ROS+CNN achieves an impressive Accuracy of 99.43%, Precision of 99.57%, Recall (TPR) of 99.29% and an exceptional AUC of 99.98%, compared to the GAN+CNN model’s considerably lower Accuracy of 85.84% and AUC of 94.44%. Similarly, the ROS+DNN model shows marked improvements over the GAN+DNN, with an Accuracy of 96.17% versus 85.56%. Furthermore, ROS+CNN records drastically lower error rates, with MAE, MSE and RMSE values significantly reduced, indicating more precise and reliable predictions. This comprehensive evaluation confirms that our ROS-based data balancing strategy enhances model robustness and efficacy far beyond what GAN-based augmentation offers, establishing ROS+CNN as a more effective solution for arrhythmia detection on the PTBDB dataset.

Table 14 Comparison analysis of GAN and ROS on DL models on PTBDB.

Figure 14 visually reinforces the quantitative findings and clearly illustrates the stark performance difference between GAN- and ROS-based data balancing techniques on the PTBDB dataset. The ROS+CNN model consistently achieves near-perfect metrics across Accuracy, Precision, Recall and AUC, which are represented by the significantly taller bars or peaks in the figure compared to the GAN+CNN model. In contrast, the GAN-based models show considerably lower performance and higher error rates, reflected by shorter bars and greater variability in the plotted metrics. This graphical comparison highlights how ROS enhances the CNN’s predictive ability and robustness, leading to superior arrhythmia detection outcomes relative to GAN augmentation. Overall, the figure complements the tabulated data by providing an intuitive and immediate visual confirmation of ROS’s clear advantage over GAN in this application.

Fig. 14
figure 14

Performance analysis of GAN vs ROS on PTBDB.

Complexity analysis

To evaluate the computational efficiency of the proposed ROS+CNN framework for arrhythmia detection, we analyzed both training and prediction times across three benchmark ECG datasets: MITDB, NSTDB and PTBDB. As shown in Table 15, training time varied significantly depending on the dataset and model architecture. For instance, the CNN model required 399.249 seconds to train on the MTBDB dataset, while the DNN model took 531.811 seconds. Prediction times were notably shorter, averaging between 0.686 and 5.422 seconds across all configurations. These results highlight the scalability of the CNN-based approach, especially when paired with ROS, offering a practical balance between accuracy and computational cost for real-time clinical deployment

Table 15 Computational time analysis of each dataset.

Conclusion

This study presents a novel and explainable DL framework for accurate and trustworthy detection of cardiac arrhythmias from ECG signals, addressing key challenges in automated cardiovascular diagnostics. By integrating advanced DL architectures—particularly a CNN optimized with ROS within a multi-stage data processing pipeline, the proposed model achieved exceptional performance across three benchmark datasets: MITDB (99.74% accuracy), PTBDB (99.43%) and NSTDB (99.98%). A central contribution of this work is the incorporation of XAI techniques, including SHAP, LIME and FIA, which provide transparent insights into the model’s decision-making process. This interpretability is crucial for fostering clinical trust and enabling the practical deployment of AI in real-world medical settings.

Limitations: Despite its strong performance and interpretability, the current framework does not incorporate feature fusion strategies or transformer-based architectures, which may offer additional improvements in representation learning and temporal modeling.

Future Work: Future research will focus on integrating feature fusion and transformer techniques to enhance model robustness and generalization. We also plan to explore real-time deployment on edge devices for continuous patient monitoring and investigate the fusion of multi-modal patient data (e.g., clinical records, imaging) to support more comprehensive and context-aware diagnostics.