Abstract
Multivariate biosignals such as Electroencephalography (EEG) and Electrocardiography (ECG) are widely used for understanding various pathologies that affect the brain and cardiovascular activity. However, effectively modeling such signals is challenging due to their complex temporal patterns, long-range dependencies, and dynamically evolving structures. Traditional models, such as recurrent neural networks, convolutional neural networks, and Transformers, encounter challenges with long-term temporal modeling, scalability, and computational efficiency. In this work, we propose a novel model that integrates Mamba architecture, channel attention, and dynamic graph learning for efficient and adaptive long-range temporal modeling of biosignals. Specifically, We address the biosignals modeling problem through: (i) long-range temporal modeling using parallel Mamba layers that process both time and frequency domain representations; (ii) a low-cost channel attention mechanism that identifies discriminative sensor channels with minimal computational overhead; and (iii) dynamic graph structure learning that adapts graph representations over time to capture evolving spatial relationships in biosignal data. We validate the proposed approach on three benchmark datasets: TUSZ dataset (for EEG-based epileptic seizure detection), DOD-H dataset (for EEG-based sleep stage classification), and ICBEB dataset (for ECG-based cardio disease classification). The model achieves state-of-the-art performance on these datasets. Notably, ablation studies demonstrate the effectiveness of each proposed component.
Introduction
Multivariate biosignals, such as Electroencephalography (EEG) and Electrocardiography (ECG), play a pivotal role in both clinical and research domains by providing non-invasive, real-time insights into neural and cardiovascular dynamics 1,2. EEG signals, for instance, are widely used for diagnosing various neurological disorders including epilepsy, sleep disorders, coma, and brain death 1,3,4,5,6. More broadly, biosignals form the foundation of brain-computer interface (BCI) systems, cognitive neuroscience studies, and continuous physiological monitoring applications. Despite their importance, analyzing biosignals remains challenging due to their complex temporal patterns, evolving spatial dependencies among sensor channels, and vulnerability to noise and inter-subject variability 7.
A range of deep learning models—such as Recurrent Neural Networks (RNNs), Convolutional Neural Networks (CNNs), and Transformer-based architectures—have been employed to address these challenges. However, each class of model presents limitations in capturing long-range temporal dependencies. RNNs are hindered by vanishing and exploding gradient issues that limit their memory over extended sequences 8,9. While variants like Hierarchically Gated RNNs (HGRNNs) introduce mechanisms for long-term memorization through dynamic gating and complex-valued recurrence, they incur substantial computational cost 10. CNNs are efficient at local feature extraction but require architectural augmentations, such as dilated convolutions 11 or Temporal Convolutional Networks (TCNs) 12, to expand their temporal receptive field—often at the expense of scalability. Transformer models, though highly effective in sequence modeling, suffer from quadratic attention complexity, which limits their applicability to long biosignal sequences. Approximations such as Linformer 13, Performer 14, Reformer 15, Longformer 16, and BigBird 17 attempt to alleviate this, but often with trade-offs in accuracy, generalizability, or architectural complexity.
Beyond sequence modeling, recent developments in self-supervised learning 18 and domain adaptation 19 have shown promise in improving the robustness and generalization of biosignal analysis models. Another line of work focuses on spectro-spatio-temporal representations 20,21,22, which extract features from the time domain (e.g., waveform morphology, RR intervals), frequency domain (e.g., power spectral density, differential entropy), and spatial domain (e.g., inter-channel correlations or fixed electrode graphs). While effective in controlled environments, these approaches often treat temporal, spectral, and spatial features as separate components, relying heavily on hand-crafted features and static architectures. This decoupling limits their ability to fully capture the complex, dynamic nature of biosignals.
Graph Neural Networks (GNNs) 23 have recently gained attention for their capacity to model relational and topological dependencies between biosignal channels. While static GNNs are effective in representing known spatial configurations, they are ill-suited for modeling the non-stationary and time-varying relationships typical in EEG and ECG signals 23. Dynamic Graph Structure Learning (GSL) offers a compelling alternative by enabling the graph topology to adapt over time. Complementarily, integrating both time-domain and frequency-domain features has been shown to enhance biosignal modeling. Time-domain metrics such as raw waveforms and differential entropy 24 capture transient dynamics, while frequency-domain features derived from Fast Fourier Transform (FFT) or Power Spectral Density (PSD) offer more stable, compact representations. However, many existing models fail to unify these domains effectively within a learnable and scalable framework.
Recently, Mamba 25, a selective state-space model, has emerged as a promising alternative for long-sequence modeling. Mamba achieves linear time complexity and parallel sequence processing by decoupling the trade-off between sequence length and computational cost, making it highly suitable for long biosignal sequences 26. Despite its advantages, Mamba has not yet been fully explored in the context of biosignal analysis.
To address these limitations, we propose a unified, end-to-end framework that holistically captures the temporal, spectral, and spatial complexity of multivariate biosignals. Our model introduces three key innovations. First, we leverage Mamba for efficient modeling of long-range temporal dependencies, enabling scalable and parallel processing of biosignal sequences. Second, we integrate a low-cost channel attention mechanism that adaptively emphasizes the most discriminative sensor inputs across both time and frequency domains. Third, we employ a dynamic graph structure learning module to capture evolving spatial relationships among sensor channels, allowing the model to adapt to the inherent non-stationarity of biosignals.
These components are tightly integrated into a cohesive architecture that overcomes the limitations of prior work by offering greater flexibility, interpretability, and computational efficiency. We evaluate our method across three benchmark datasets—TUSZ (EEG-based epileptic seizure detection), DOD-H (EEG-based sleep stage classification), and ICBEB (ECG-based cardiovascular disease classification)—and demonstrate its effectiveness and generalizability across diverse biosignal modalities.
Contributions
The key contributions of this work are as follows:
-
Mamba-based joint time-frequency modeling: We present the first unified Mamba-based framework that integrates parallel time–frequency modeling, a lightweight channel-attention layer and dynamic graph learning, demonstrating its capability to model long-range temporal dependencies in both EEG and ECG signals in both time and frequency domains. The joint encoding of temporal features and spectral features of EEG and ECG signals enhances the temporal expressiveness and representation capacity of the model.
-
Adaptive and efficient spatial modeling: A low-cost, generalizable channel attention mechanism is introduced to dynamically select salient biosignal (EEG and ECG) channels. This is coupled with a dynamic graph structure learning module that captures evolving inter-channel relationships over time, allowing adaptive spatial representation of non-stationary biosignal dynamics.
-
Robust cross-domain validation and interpretability: The proposed framework is rigorously validated on three clinically relevant datasets spanning EEG and ECG modalities, achieving state-of-the-art performance. Ablation studies further quantify the individual contributions of the Mamba architecture, channel attention, and dynamic graph learning components, providing insights into their roles in performance gains.
The proposed framework combines Mamba and channel attention modules. The input consists of multiple feature channels, which are first processed by the Mamba module to extract high-level feature representations. The output of the Mamba module is then passed through a channel attention mechanism to emphasize the most informative channels. Next, a graph structure learning module captures relationships between features and constructs a graph representation. This graph representation is further processed by a Graph Neural Network layer. Finally, the output is passed to a classification head that predicts the class labels based on the refined features.
Background
Structured state space model
State-space models (SSMs) are widely used to represent the internal state of a system and predict its future states based on given inputs. These models are particularly effective in capturing temporal dependencies by mapping an input sequence to a latent state representation, which is then used to generate an output sequence.
At time t, an SSM processes an input sequence x(t) by mapping it to a hidden state h(t), which encapsulates the system’s underlying dynamics. Using this latent state, the model generates a predicted output sequence y(t), allowing it to model long-range dependencies and make future predictions. This process can be mathematically described as:
where \(\textbf{A}, \textbf{B}, \textbf{C},\) and \(\textbf{D}\) are learnable parameters optimized through gradient descent. In practice, \(\textbf{D}\) is often omitted since the term \(\textbf{D}x(t)\) acts as a skip connection and can be easily incorporated without additional complexity.
Since real-world applications typically involve discrete-time inputs, discretization is necessary. A common approach is applying the zero-order hold method, which transforms matrices \(\textbf{A}\) and \(\textbf{B}\) into their discrete equivalents \(\overline{\textbf{A}}\) and \(\overline{\textbf{B}}\). A learnable step size parameter \(\Delta\) is introduced, representing the resolution of the input:
Building on this foundation, the Structured State Space for Sequences (S4) 27 was introduced as an efficient SSM capable of handling long-range dependencies.
Mamba
Mamba 25 extends the S4 model by introducing input-dependent adaptability, allowing the state-space parameters to vary dynamically based on the input sequence. This enhancement enables the model to selectively propagate or forget information as needed, improving its ability to handle long-range dependencies while maintaining computational efficiency. Mamba builds on the S4 framework while introducing a hardware-optimized, parallelizable recurrent processing mode. This design significantly improves speed and scalability, making it particularly suited for applications in natural language processing, audio processing, genomics, and biosignal analysis.
Graph structure learning
Mamba and other state-space models learn channel representations independently but fail to capture correlations. Therefore, we design the channel attention and correlation in the following modules. To achieve this, we divide the processing into two parts: channel attention learning and channel correlation learning.
Channel attention 12 is a powerful mechanism in deep learning that has been widely applied across various domains to improve model performance. It enhances feature representation by emphasizing informative channels while suppressing less relevant ones. It has been extensively used in computer vision, such as in Squeeze-and-Excitation (SE) blocks 28, combinations of channel and spatial attention 29, and self-attention-based non-local relationships 30. However, most of these modules require tuning hyperparameters, and their training increases the overall parameter count of the models, potentially hindering training speed.
We propose a new low-cost channel attention mechanism that reduces model complexity and computational overhead. For channel correlation learning, we dynamically learn graph structures over time and aggregate representations from multiple graphs.
Method
Problem setup
We represent the multivariate biosignals as a graph as follows: \(\mathcal {G} = (\mathcal {V}, \mathcal {E}, \textbf{W})\), where the set of nodes \(\mathcal {V}\) corresponds to the sensors (channels), \(\mathcal {E}\) is the set of edges (i.e., the correlation amount between the channels), and \(\textbf{W}\) is the adjacency matrix. Since \(\mathcal {E}\) and \(\textbf{W}\) are unknown, they are learned by our model. Such graph formulation of biosignals could help solve various node and graph-level classification or regression tasks. In this work, we focus on graph classification whereby the aim is to learn a function that maps an input biosignal \(\textbf{X}\) to a class label y:
where \(C\) is the number of classes.
Model architecture
Let \(\textbf{X} \in \mathbb {R}^{N \times T \times M}\) denote a multivariate biosignal, where \(N\) is the number of sensors, \(T\) is the sequence length, and \(M\) is the input dimension of the signal (typically \(M = 1\)). Figure 1 illustrates the architecture of the proposed Mamba-based framework for biosignal analysis. The proposed architecture comprises four main components:
-
1.
Stacked Mamba layers with a linear layer to model temporal dependencies within each sensor independently, transforming the raw signals \(\textbf{X} \in \mathbb {R}^{N \times T \times M}\) to a latent representation \(\textbf{H} \in \mathbb {R}^{N \times T \times D}\), where \(D\) is the embedding dimension;
-
2.
A new low-cost channel attention layer that learns weights for each channel;
-
3.
A graph correlation layer that captures dynamically evolving adjacency matrices, \(\textbf{W}^{(1)}, \ldots , \textbf{W}^{\left( n_d\right) }\); and
-
4.
GNN layers to model spatial dependencies among sensors, using the learned graph structures \(\textbf{W}^{(1)}, \ldots , \textbf{W}^{\left( n_d\right) }\) and node features \(\textbf{H}\). While our model supports any GNN layer type, we use graph attention networks in our experiments.
-
5.
Finally, a temporal pooling layer and a graph pooling layer are included to aggregate temporal and spatial representations, respectively. These are followed by a fully connected layer that generates predictions for each multivariate biosignal.
Next, we detail the design and integration of the stacked Mamba layers, channel attention modules, graph correlation mechanisms.
Modeling with Mamba
Traditionally, multichannel biosignals are represented in either the time or frequency domain. To fully capture signal characteristics, we use a parallel Mamba structure to leverage the benefits of both representations, as shown in Fig. 2. The raw signal, represented in the time domain, is first normalized and then processed by a Mamba layer to extract temporal features. The frequency representation is obtained by computing the power spectral features, which are then processed by a separate Mamba layer. This parallel Mamba design ensures that both time- and frequency-domain features are effectively captured.
Time and frequency Mamba block: The Mamba block combines linear projections, sequence transformations, and nonlinear operations. It includes a structured state-space model for capturing long-range dependencies and a convolutional module for local pattern extraction. Nonlinearities (activations or element-wise multiplications) are applied throughout, and residual connections preserve input information.
Multichannel biosignals often exhibit long-range temporal correlations, which Mamba can effectively capture. However, directly applying Mamba to multivariate signals by projecting the \(N\) signal channels into a hidden dimension using a linear layer may be suboptimal, as it disregards the underlying graph structure of the data. To address this, we use stacked Mamba layers to embed signals from each channel (sensor) independently, producing an embedding \(\textbf{H} \in \mathbb {R}^{N \times T \times D}\) for each input signal \(\textbf{X}\).
Low-cost channel attention module
The low-cost channel attention module is designed to enhance feature representations in neural networks by emphasizing informative channels while maintaining minimal computational overhead. Unlike conventional attention mechanisms, such as self-attention in Transformers or Squeeze-and-Excitation (SE) modules, which involve numerous parameters and increased computational burden, our module offers a lightweight solution that achieves effective feature refinement with fewer learnable parameters.
The core idea is to identify the most discriminative channels and assign them higher importance weights. We formulate this process as a classification problem and demonstrate that, with a specific choice of loss function, the classifier can be solved in closed form, thereby enabling efficient training.
We define a binary classification problem where the objective is to distinguish a candidate channel vector \(\textbf{p} \in \mathbb {R}^d\) from a set of N other channel vectors \(\{\textbf{x}_i\}_{i=1}^N\), where each \(\textbf{x}_i \in \mathbb {R}^d\). We treat \(\textbf{p}\) as the single positive example (label \(y_p = +1\)) and all other channels \(\textbf{x}_i\) as negative examples (label \(y_i = -1\)). We train a linear classifier defined by:
where \(\textbf{w} \in \mathbb {R}^d\) and \(b \in \mathbb {R}\) are the weight vector and bias term, respectively. Unlike Euclidean distance, which treats all components equally, this classification score incorporates feature relevance through the learned weights \(\textbf{w}\).
To train the classifier, we minimize the following regularized loss function:
where \(L(y, y')\) is a loss function, and \(\lambda > 0\) is a regularization hyperparameter. For the squared loss \(L(y, y') = (y - y')^2\), the optimal solution can be derived in closed form. Let
be the mean and covariance matrix of the negative samples. The squared Mahalanobis distance is defined as:
For each channel \(c\), the Mamba encoder outputs \(H_c\in \mathbb {R}^{T\times D}\). We form the channel embedding \(\textbf{p}\in \mathbb {R}^{d}\) by global average pooling (over time and the parallel frequency branch) to a \(D\)-dimensional vector, followed by a linear projection to dimension \(d\).
The closed-form solutions for the optimal parameters and classification cost are:
Here, \(\mathcal {F}^*\) quantifies the discriminability of channel \(\textbf{p}\): lower values indicate better separability from other channels. Consequently, \(\kappa (\textbf{p})\) can be used to rank channel importance, with larger values implying higher discriminability.
The channel attention mechanism is implemented as a layer. For an input signal \(\textbf{X} \in \mathbb {R}^{C \times T}\) with C channels and T time steps, the output \(\widetilde{\textbf{X}}\) is computed as:
where \(\alpha , \beta \in \mathbb {R}\) are learnable scaling and shifting parameters, \(\sigma (\cdot )\) denotes the sigmoid function, and \(\odot\) represents element-wise multiplication.
Channel correlation learning
Multivariate biosignals often lack known sensor configurations, and their inter-channel relationships may vary over time. To handle this, we introduce a channel correlation learning module that dynamically infers graph structures over fixed-length time intervals. Let the sequence length be T and define a time resolution t such that T is divisible by t. The number of dynamic graphs is then \(n_d = T / t\).
For the t-th interval, we compute a hybrid adjacency matrix \(\textbf{W}^{(t)}\) by combining geometry-based and feature-based graphs. The geometry-based graph is defined as:
where \(\Vert \cdot \Vert _1\) denotes the Manhattan distance, \(\textbf{x}_i\) and \(\textbf{x}_j\) are geometric embeddings of channels i and j derived from electrode/lead coordinates (e.g., EEG 10–20 montage positions or ECG lead topology). \(\gamma > 0\) is a hyperparameter.
The feature-based correlation matrix \(\overline{\textbf{W}}^{(t)}\) is computed using normalized cross-correlation (NCC). Given two vectors \(\textbf{X}, \textbf{Y} \in \mathbb {R}^n\), the NCC is:
where \(\bar{x} = \frac{1}{n} \sum _{i=1}^n x_i\) and \(\bar{y} = \frac{1}{n} \sum _{i=1}^n y_i\) are the means of \(\textbf{X}\) and \(\textbf{Y}\), respectively.
The final adjacency matrix is given by:
where \(\lambda \in [0, 1)\) is a weighting hyperparameter. To improve computational efficiency and enforce graph sparsity, we prune edges with weights below a threshold \(\kappa\):
Since signal correlation is symmetric, the resulting graphs are treated as undirected.
Experiment
Description of the datasets
TUSZ dataset for EEG-based epileptic seizure detection: We first assess the proposed model on EEG-based seizure detection using the publicly available Temple University Hospital Seizure Detection Corpus (TUSZ) 31. We follow the experimental setup from 32, performing seizure detection on 60-second EEG clips. The data includes 19 EEG sensors, sampled at 200 Hz, resulting in a sequence length of 12,000 time steps per clip. The task is a binary classification to determine if a 60-second EEG clip contains a seizure. In the dynamic graph correlation learning layer, the resolution is set to 10 seconds (2,000 time steps), inspired by the approach taken by trained EEG readers. In consistency with former studies and benchmarks, we evaluate the model’s performance using AUROC, F1-score, area under the precision-recall curve (AUPRC), sensitivity, and specificity. We use AUROC as the final classification metric. The ROC curve illustrates the trade-off between the True Positive Rate (TPR) and the False Positive Rate (FPR) across various classification thresholds. The AUROC summarizes this curve into a single scalar value, providing a threshold-independent measure of the model’s performance.The AUPRC is calculated as the area under the PR curve. A PR curve shows the trade-off between precision and recall across different decision thresholds.
DOD-H dataset for EEG-based sleep stage classification: The DOD-H dataset includes EEG recordings from healthy participants across multiple nights of sleep 33. Each subject’s data corresponds to a full-night recording. Sleep stage labels are provided at 30-second epochs. Each epoch is categorized into one of the following stages: Wake, N1, N2, N3, REM.The EEG is sampled at 250 Hz. Following 34, we use macro-F1 score and Cohen’s Kappa as the evaluation metrics.
ICBEB dataset for ECG-based cardiac disease classification:34 Each ECG record is annotated by up to three labels chosen from a set of nine classes. To create a multi-label dataset, we take the union of all labels assigned to each record. Following prior work35,36, we use macroF1, macro-F2, macro-G2, and macro-AUROC as the evaluation metrics. The dataset contains 12 ECG channels with a sampling rate of 100 Hz. The dataset consists of 6,877 12-lead ECG recordings with durations ranging from 6 to 60 seconds. resulting in sequence lengths between 600 and 6,000 time steps. When we train the model, the shorter sequences are padded with zeros. The dataset includes 9 classes, consisting of one normal class and eight abnormal classes representing various diseases. Each ECG record is annotated by up to three labels chosen from a set of nine classes. This setting makes this experiment a multi-label classification task.
Data splitting strategy
We split each of the three datasets into train, validation and test sets as follows:
-
Epileptic Seizure Detection (TUH EEG): Inline with prior works32, we randomly split the official TUSZ training set into training and validation subsets using a 90/10 patient-level split for model training and hyperparameter tuning, respectively. For final evaluation, we used the official TUSZ test set, excluding five patients that overlapped with the training set to avoid data leakage. As a result, the training, validation, and test sets comprise mutually exclusive patient groups.
-
Sleep Stage Classification (DOD-H EEG): We adopt the same subject-wise 60/20/20 split as prior works such as SimpleSleepNet37 and RobustSleepNet34. Each split uses mutually exclusive subjects to avoid data overlap.
-
Cardiovascular Disease Classification (ICBEB ECG): Since the official test set is not publicly available, we replicate the 10-fold stratified split described in the work38. Folds 1–8 are used for training, fold 9 for validation, and fold 10 for testing. This setup is consistent with baseline models on the ICBEB dataset.
Experiment setup
Each experiment was conducted with three runs using different random seeds, and we report the mean and standard deviation of the results. The regularization weight \(\lambda\) was selected based on validation performance using the hyperparameter tuning tool wandb sweep, whereby we conducted a grid search over the set \(\{0.001, 0.01, 0.1, 1.0\}\). Eventually, we selected \(\lambda =\)0.01 for seizure detection and sleep staging classification task and \(\lambda =\)0.1 for ECG classification.
Experiment result
EEG-based epileptic seizure detection: Table 1 compares the proposed Mamba-based model with existing approaches on the TUSZ dataset for seizure detection. Our model achieves the highest performance across all evaluation metrics, including AUROC, F1-Score, AUPRC, sensitivity, and specificity, thereby establishing a new state-of-the-art.
It is worth noting that certain baselines (e.g., GraphS4MER 39) report relatively high F1-Scores despite lower AUPRC values. This apparent discrepancy arises from the complementary nature of these metrics and the strong class imbalance present in seizure detection: AUPRC summarizes precision–recall performance over all decision thresholds and is highly sensitive to the low prevalence of seizure events, leading to lower values. By contrast, F1-Score is computed at a single threshold (selected on the validation set to maximize F1) and therefore may remain relatively high if precision and recall are balanced at that particular operating point, even when the overall precision–recall trade-off is weaker (Table 2).
For consistency and fairness, we adopt a unified evaluation protocol: (i) thresholds for F1, sensitivity, and specificity are selected using the validation set, following prior work 32,39, and (ii) AUROC and AUPRC are computed directly from the continuous model scores without thresholding, providing threshold-independent measures of ranking quality. This ensures comparability across methods and highlights the robustness of our approach, which yields improvements under both threshold-dependent and threshold-free metrics.
ECG-based cardiovascular disease classification results on ICBEB dataset: Table 3 provides ECG classification results on ICBEB dataset. As can been seen, our model achieves consistent and superior results across all metrics, showcasing its general robustness. The improvements in Macro G2-Score highlight the ability of the model to accurately classify underrepresented classes, which are often overlooked in imbalanced datasets. To analyze further, individual ECG classes classification result are shown in Table 4. The model shows consistent improvements across all 9 ECG classes. Certain classes such as LBBB (Left Bundle Branch Block) and STE (ST-elevation) are known to exhibit data imbalance, where fewer samples are available, our model still shows high performance in this imbalance case (Table 5).
EEG-based sleep stage classification using DOD-H dataset: The result on DOD-H dataset is given in Table 6. Our model achieves the highest performance on the DOD-H sleep staging classification task, with a Macro-F1 of 0.830 and Kappa of 0.814. Compared to prior methods such as GraphS4MER, it provides over 5% improvement on both metrics. This demonstrates the effectiveness of our temporal-frequency modeling and efficient channel attention in capturing complex sleep dynamics.
Training overhead of proposed low-cost channel attention module
We now discuss the training overhead of proposed low-cost channel attention module by providing a detailed comparison of the number of added parameters and training time per epoch for our proposed low-cost channel attention module and existing alternatives. The reported values are averaged on the three tasks. As shown in Table 2, our method requires significantly fewer parameters and results in shorter training time per epoch, demonstrating its computational efficiency, when compared with related works such as ECAnet 44, CBAM 29, or SKNet 45.
Computational complexity for \(\Sigma\) and \(\Sigma ^{-1}\): In a mini-batch with \(B\) samples and \(N\) channels each, stack embeddings into \(P\in \mathbb {R}^{(BN)\times d}\). The shared covariance \(\Sigma =\frac{1}{BN-1}(P-\mu )^\top (P-\mu )\) incurs \(O(BN\,d^2)\) time and \(O(d^2)\) memory. Computing \(\Sigma ^{-1}\) (e.g., via Cholesky) costs \(O(d^3)\) time and \(O(d^2)\) memory per step. These costs are independent of \(T\) and \(D\); only covariance formation scales with \(BN\), while the inversion depends solely on \(d\). We measure the per-step wall-clock time attributable to \(\Sigma /\Sigma ^{-1}\) on all three datasets (see Table 8).
Ablation study
We conduct ablation studies to investigate the contribution of two key components in our model. The results are shown in Tables 5 and 7. First, to evaluate the role of the low-cost channel attention layer, we replace it with several representative channel attention modules, including ECA-Net 44, SE-Net 51, CBAM 29, and GC 52. While these alternatives are well-established and effective, they introduce higher computational and parameter overhead. Our results show that the proposed low-cost attention mechanism achieves competitive channel selection with substantially fewer parameters. Second, we examine the effect of combining temporal and frequency Mamba layers. We compare three configurations: using only the frequency Mamba layer, using only the time Mamba layer, and removing the Mamba layer entirely. Both the time-only and frequency-only variants outperform the Mamba-free baseline, demonstrating that each component independently enhances temporal modeling. Furthermore, integrating both layers yields the best performance, highlighting the complementary strengths of time and frequency representations and underscoring the value of capturing multi-scale temporal dependencies.
Further, in order to study the interaction between dynamic graph learning and channel attention modules, we conduct an additional set of experiments in which we selectively enable or disable the dynamic graph learning and channel attention modules in combination. The results reveal that the two modules are complementary rather than conflicting. Specifically, the channel attention mechanism operates at the individual channel level to emphasize discriminative signals based on temporal and spectral properties, whereas the dynamic graph module captures inter-channel relationships that evolve over time. Because the attention mechanism enhances the quality of individual node (channel) features before they are processed by the graph structure, it can in fact improve the quality of the learned graph representations. In particular, we observe that enabling channel attention improves the signal-to-noise ratio of node features, leading to more informative graph edges and better performance in downstream tasks. Conversely, when dynamic graph learning is applied without channel attention, the model is more susceptible to spurious correlations caused by noisy or irrelevant channels.
Panels 1(a–e) Mean EEG adjacency matrices computed from correctly classified test segments for (a) focal seizures, (b) generalized seizures, and (c) non-seizure; difference maps (d) focal minus non-seizure and (e) generalized minus non-seizure. Panels 2(a–e) Mean PSG adjacency matrices for (a) WAKE, (b) N1, and (c) N3; difference maps (d) WAKE–N1 and (e) WAKE–N3. Panels 3 (a–e) Mean ECG adjacency matrices for (a) normal rhythm, (b) first-degree atrioventricular block (I-AVB), and (c) left bundle branch block (LBBB); difference maps (d) IAVB–normal and (e) LBBB–normal.
Model interpretation
We evaluated whether the learned graphs are meaningful by visualizing the average adjacency matrices for EEG, PSG, and ECG signals from correctly predicted test samples as in Fig. 3. These adjacency matrices are grouped according to seizure classes, sleep stages, and ECG classes, respectively. To quantify differences between any two mean adjacency matrices, we calculated the mean and standard deviation of their absolute differences.
For EEG (First Row): The Generalized Seizures adjacency matrix shows higher overall connectivity (darker red blocks) than Focal Seizures and Non-Seizure. This suggests more widespread synchronization among channels during generalized seizures, consistent with literature indicating that generalized seizures often exhibit globally synchronized brain activity. Comparing Focal Seizures differing Non-Seizure and Generalized Seizures differing Non-Seizure: the difference map for Generalized Seizures versus Non-Seizure shows stronger positive regions than the difference map for Focal Seizures versus Non-Seizure. This implies that generalized seizures deviate more from the non-seizure state in terms of overall connectivity, whereas focal seizures exhibit less pronounced global changes. These findings align with the established clinical understanding that generalized seizures typically involve more widespread, synchronized neuronal activity compared to focal seizures.
For PSG (Middle Row): As sleep deepens from Wake to N1 and then to N2, the connectivity patterns evolve. The difference map between Wake and N2 exhibits more pronounced variations compared to the difference between Wake and N1, reflecting the greater physiological changes associated with deeper sleep stages.
For ECG (Bottom Row): The I-AVB matrix closely resembles the normal matrix, indicating that first-degree AV block is associated with only subtle conduction changes. In contrast, the LBBB matrix shows more significant deviations, particularly in the precordial leads, which is consistent with the pronounced conduction abnormalities observed in LBBB. The corresponding difference maps further highlight these distinctions: the differences for I-AVB versus normal are relatively minor, while those for LBBB versus normal are substantially larger.
Statistical analysis of class-specific connectivity in EEG and ECG graphs
To further quantitatively analyze differences between class-specific graphs statistically, we performed the following: For each pair of classes (e.g., Focal vs. Non-Seizure), we computed the edge-wise two-sample t-tests on the adjacency weights across all test samples. Each edge (i, j) was tested for statistical significance between the two groups. at significance level \(\alpha = 0.05\) (Table 8).
The results obtained for the three downstream tasks (corresponding to the three datasets) are summarized in Table 9. Specifically, Table 9 provides information about the top three statistically significant connectivity edges identified between each group pair, along with their corresponding p-values. Below, we go over the main findings for the three datasets, one by one.
EEG-based seizure detection: Focal vs. Non-Seizure and Generalized vs. Non-Seizure both exhibit moderate levels of significant connectivity differences (17.6% and 23.8%, respectively). Notably, the most significant edge in the focal comparison is between P4 and P3 (\(p = 0.012\)), indicating localized alterations in posterior-parietal connectivity. In contrast, the generalized group highlights frontal and temporal connections such as T3–F7 and T4–F8, consistent with the broader network disruptions expected in generalized seizures.
EEG-based sleep stage classification: The comparisons Wake vs. N1 and Wake vs. N2 yield relatively higher percentages of significant edges (29.2% and 26.9%, respectively), indicating more widespread connectivity changes during the transition from wakefulness to sleep. The most prominent differences involve frontal-posterior and frontal-central interactions (e.g., FP1_O1–FP1_M2 and EOG1–EOG2), which reflect global shifts in brain activity during sleep onset.
ECG-based cardiac disease diagnosis: LBBB vs. Normal shows the highest percentage of significant edges (64.5%), with very low p-values for all top edges (e.g., V1–V2, \(p = 0.011\)), indicating robust and widespread connectivity alterations across ECG leads. I-AVB vs. Normal also displays substantial differences (55.6%), particularly involving limb and precordial leads (e.g., II–V2, aVF–V2), likely reflecting delayed atrioventricular conduction. All reported p-values are below 0.05, with the majority in the range of 0.011 to 0.027, meeting standard thresholds for statistical significance. These findings demonstrate the utility of connectivity-based features for distinguishing between diverse neurological and physiological conditions, with the strongest effects observed in cardiac abnormalities and sleep-stage transitions.
In summary, the adjacency matrices and their difference maps reveal clinically meaningful patterns across EEG, PSG, and ECG signals. These observations support known distinctions in neuronal synchronization for seizure types, the progression of sleep stages, and the morphological differences in cardiac conduction abnormalities.
Conclusion
In this paper, we introduced a unified framework for multivariate biosignal modeling that integrates long-range temporal modeling via Mamba, low-cost channel attention for efficient feature selection, and dynamic graph structure learning for adaptive spatial representation. Through extensive experimentation on EEG, ECG, and sleep stage classification tasks, our method consistently achieved state-of-the-art performance, outperforming strong baselines and demonstrating its ability to handle complex spatial temporal dependencies and imbalanced datasets. The results confirm that combining time-frequency representations with dynamically learned graphs enables more accurate and robust predictions. Importantly, the ablation studies demonstrate the significance of each model component in driving overall performance. This work sets a foundation for future research into scalable, efficient, and interpretable bio-signal analysis, with broad potential for real-world applications in clinical diagnostics and neuroscience research.
Although our framework is designed to learn dynamic spatial relationships across multiple sensors, it remains effective in few-sensor settings by leveraging the temporal modeling capabilities of Mamba layers and the channel attention mechanism. In such cases, the graph learning module simplifies to capture intra-channel dependencies or uses a degenerate graph structure. Additionally, prior knowledge—such as anatomical priors or cross-subject statistics—can be incorporated to compensate for limited spatial information. While spatial modeling benefits are reduced, the model maintains strong performance in low-channel scenarios.
As for the future work, while the proposed model achieves strong performance across diverse benchmarks, yet challenges remain in computational efficiency, generalizability, and clinical readiness. Future efforts will focus on optimizing deployment through robust handling of label noise, distribution shift, and post-deployment drift, in line with recent systems-oriented ML pipelines for healthcare 54,55. We will further extend validation to noisier, heterogeneous, and multi-center datasets, and explore multi-task and multi-modal regimes (e.g., EEG+ECG, EEG+PPG) with domain generalization and subject-specific adaptation. These directions aim to enhance interpretability, scalability, and real-world clinical applicability.
Data availability
The three benchmark datasets: TUSZ (EEG-based seizure detection), DOD-H (sleep stage classification), and ICBEB (ECG classification) utilized in this research study are public datasets and are available online.
References
Yu, X., Wang, L. & Zhang, Y. A systematic review on machine learning and deep learning techniques in the effective diagnosis of Alzheimer’s disease. Brain Inform. 9, 20 (2022).
Yousuf, A. et al. Inferior myocardial infarction detection from lead ii of ecg: a gramian angular field-based 2d-cnn approach. IEEE Sensors Lett. (2024).
Aviles, M., Sánchez-Reyes, L. M., Álvarez-Alvarado, J. M. & Rodríguez-Reséndiz, J. Machine and deep learning trends in EEG-based detection and diagnosis of Alzheimer’s disease: A systematic review. Eng 5, 1464–1484 (2024).
Li, B. & Cao, J. Classification of coma/brain-death EEG dataset based on one-dimensional convolutional neural network. Cognit. Neurodyn. 18, 961–972 (2024).
Aellen, F. M. et al. Auditory stimulation and deep learning predict awakening from coma after cardiac arrest. Brain 146, 778–788 (2023).
Sadoun, M. S. N., Rahman, M. M. U., Al-Naffouri, T. & Laleg-Kirati, T.-M. Eeg epileptic data classification using the schrodinger operator’s spectrum. In 2023 45th Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), 1–6 (IEEE, 2023).
Jiahao, H., Rahman, M. M. U., Al-Naffouri, T. & Laleg-Kirati, T.-M. Uncertainty estimation and model calibration in eeg signal classification for epileptic seizures detection. In 2024 46th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), 1–5 (IEEE, 2024).
Orvieto, A. et al. Resurrecting recurrent neural networks for long sequences. In International Conference on Machine Learning, 26670–26698 (PMLR, 2023).
Lindemann, B., Müller, T., Vietz, H., Jazdi, N. & Weyrich, M. A survey on long short-term memory networks for time series prediction. Procedia Cirp 99, 650–655 (2021).
Chen, H., Rossi, R. A., Mahadik, K., Kim, S. & Eldardiry, H. Hypergraph neural networks for time-series forecasting. In 2023 IEEE International Conference on Big Data (BigData), 1076–1080 (IEEE, 2023).
Yu, F. Multi-scale context aggregation by dilated convolutions. (2015).
Guo, M.-H. et al. Attention mechanisms in computer vision: A survey. Comput. Visual Med. 8, 331–368 (2022).
Wang, S., Li, B. Z., Khabsa, M., Fang, H. & Ma, H. Linformer: Self-attention with linear complexity (2020).
Choromanski, K. et al. Rethinking attention with performers (2020).
Kitaev, N., Kaiser, L. & Levskaya, A. Reformer: The efficient transformer. In Proceedings of the International Conference on Learning Representations (ICLR) (2020).
Beltagy, I., Peters, M. E. & Cohan, A. Longformer: The long-document transformer (2020).
Zaheer, M. et al. Big bird: Transformers for longer sequences (2020).
Weng, W. et al. Self-supervised learning for electroencephalogram: A systematic survey. ACM Comput. Surv. 57(12), 1–38 (2024).
Zhong, X.-C. et al. A deep domain adaptation framework with correlation alignment for EEG-based motor imagery classification. Comput. Biol. Med. 163, 107235 (2023).
Lamoš, M. et al. Spatial-temporal-spectral EEG patterns of bold functional network connectivity dynamics. J. Neural Eng. 15, 036025 (2018).
Jiang, Y., Chen, N. & Jin, J. Detecting the locus of auditory attention based on the spectro-spatial-temporal analysis of EEG. J. Neural Eng. 19, 056035 (2022).
Zhao, Z., Särkkä, S. & Rad, A. B. Spectro-temporal ecg analysis for atrial fibrillation detection. In 2018 IEEE 28Th international workshop on machine learning for signal processing (MLSP), 1–6 (IEEE, 2018).
Klepl, D., Wu, M. & He, F. Graph neural network-based EEG classification: A survey. IEEE Trans. Neural Syst. Rehabil. Eng. 32, 493–503 (2024).
Zheng, W.-L. & Lu, B.-L. Investigating EEG-based functional connectivity patterns for emotion recognition with directed graph representations. IEEE Trans. Affect. Comput. 7, 113–123 (2015).
Gu, A. & Dao, T. Mamba: Linear-time sequence modeling with selective state spaces (2023).
Hu, J., Rahman, M. M. U., Alnaffouri, T. & Kirati, T.-M. L. Mamba-CAM-Sleep: A mamba-based channel attention model for sleep staging classification. In 47th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC) (IEEE, 2025).
Gu, A., Goel, K. & Ré, C. Efficiently modeling long sequences with structured state spaces. In International Conference on Learning Representations (ICLR) (2022).
Hu, J., Shen, L. & Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, 7132–7141 (2018).
Woo, S., Park, J., Lee, J.-Y. & Kweon, I. S. Cbam: Convolutional block attention module. In Proceedings of the European conference on computer vision (ECCV), 3–19 (2018).
Wang, X., Girshick, R., Gupta, A. & He, K. Non-local neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, 7794–7803 (2018).
Shah, V. et al. The temple university hospital EEG corpus (tueg). Front. Neurosci. 12, 735. https://doi.org/10.3389/fnins.2018.00735 (2018).
Tang, S. et al. Self-supervised graph neural networks for improved electroencephalographic seizure analysis (2021).
Guillot, A., Sauvet, F., During, E. H. & Thorey, V. Dreem open datasets: Multi-scored sleep datasets to compare human and automated sleep staging. IEEE Trans. Neural Syst. Rehabil. Eng. 28, 1955–1965. https://doi.org/10.1109/TNSRE.2020.3011181 (2020).
Guillot, A. & Thorey, V. Robustsleepnet: Transfer learning for automated sleep staging at scale. In Proceedings of the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1245–1249 (IEEE, 2021).
Liu, F. et al. An open access database for evaluating the algorithms of electrocardiogram rhythm and morphology abnormality detection. J. Med. Imaging Health Inform. 8, 1368–1373 (2018).
Xu, D., Ruan, C., Korpeoglu, E., Kumar, S. & Achan, K. Inductive representation learning on temporal graphs (2020).
Guillot, A., Thorey, V. & Vivar, G. Simplesleepnet: A simple deep learning model for sleep stage classification with raw single-channel EEG. Biomed. Signal Process. Control 58, 101871 (2020).
Strothoff, N., Wannenmacher, T. & Ballester, P. Deep learning for ecg classification: An evaluation framework and analysis of arrhythmia characteristics. In Proceedings of the 1st China Physiological Signal Challenge on ECG Classification (ICBEB 2021) (2021).
Tang, S. et al. Modeling multivariate biosignals with graph neural networks and structured state space models. In Conference on Health, Inference, and Learning, 50–71 (PMLR, 2023).
Hochreiter, S. & Schmidhuber, J. Long short-term memory. Neural Comput. 9, 1735–1780 (1997).
Saab, K. Dense-cnn: A densely connected convolutional neural network for multi-channel biomedical signals. IEEE J. Biomed. Health Inform. 24, 232–241 (2020).
Ahmedt-Aristizabal, D. A convolutional neural network for seizure detection from EEG signals. Comput. Biol. Med. 120, 103753 (2020).
Tang, O. A. Learning temporal dynamics in multivariate biosignals: A deep learning approach with graph neural networks. IEEE Trans. Neural Syst. Rehabil. Eng. 30, 2456–2465 (2022).
Wang, Q. et al. Eca-net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 11531–11539 (2020).
Li, X., Wang, W., Hu, X. & Yang, J. Selective kernel networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 510–519 (2019).
Ismail Fawaz, H., Forestier, G., Weber, J. & Idoumghar, L. Inceptiontime: Finding alexnet for time series classification (2020).
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019).
Wang, Z., Yan, W. & Oates, T. Time series classification from scratch with deep neural networks: A strong baseline. In Proceedings of the International Joint Conference on Neural Networks (IJCNN) (2017).
Hochreiter, S. & Schmidhuber, J. Long short-term memory. Neural Comput. 9, 1735–1780 (1997).
Strodthoff, J., Schmidt, A. & Fingscheidt, T. Waveletnn: A neural network for wavelet transform (2021).
Hu, J., Shen, L. & Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 7132–7141 (2018).
Cao, Y., Xu, J., Lin, S., Wei, F. & Hu, H. Global context networks. IEEE Trans. Pattern Anal. Mach. Intell. 45(6), 6881–6895 (2020).
Supratak, A., Dong, H., Wu, C. & Guo, Y. Deepsleepnet: A model for automatic sleep stage scoring based on raw single-channel EEG. IEEE Trans. Neural Syst. Rehabil. Eng. 25, 1998–2008 (2017).
Zeng, Z. et al. Mitigating world biases: A multimodal multi-view debiasing framework for fake news video detection. In Proceedings of the 32nd ACM International Conference on Multimedia (MM ’24), https://doi.org/10.1145/3664647.3681673 (Association for Computing Machinery, New York, NY, USA, 2024).
Li, G. et al. Discovering consensus regions for interpretable identification of RNA n6-methyladenosine modification sites via graph contrastive clustering. IEEE J. Biomed. Health Inform. 28, 2362–2372. https://doi.org/10.1109/JBHI.2024.3357979 (2024).
Funding
This work is supported in part by King Abdullah University of Science and Technology (KAUST) with the Base Research Fund (BAS/1/1627-01-01), KAUST Center of Excellence for Smart Health (KCSH), under award number 5932, and the National Institute for Research in Digital Science and Technology (INRIA).
Author information
Authors and Affiliations
Contributions
H. J. and T. M. L. K. conceived the idea. H. J. implemented the Mamba deep learning model for biosignal analysis, and generated the results. All authors wrote the main manuscript text. H. J. prepared the figures and block diagrams. All authors reviewed the manuscript. T. M. L. K. and M.M.R supervised the work.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Hu, J., Rahman, M.M.U. & Laleg-Kirati, TM. Adaptive long-range modeling of EEG and ECG with Mamba and dynamic graph learning. Sci Rep 15, 38762 (2025). https://doi.org/10.1038/s41598-025-22684-x
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41598-025-22684-x


