Introduction

Multivariate biosignals, such as Electroencephalography (EEG) and Electrocardiography (ECG), play a pivotal role in both clinical and research domains by providing non-invasive, real-time insights into neural and cardiovascular dynamics 1,2. EEG signals, for instance, are widely used for diagnosing various neurological disorders including epilepsy, sleep disorders, coma, and brain death 1,3,4,5,6. More broadly, biosignals form the foundation of brain-computer interface (BCI) systems, cognitive neuroscience studies, and continuous physiological monitoring applications. Despite their importance, analyzing biosignals remains challenging due to their complex temporal patterns, evolving spatial dependencies among sensor channels, and vulnerability to noise and inter-subject variability 7.

A range of deep learning models—such as Recurrent Neural Networks (RNNs), Convolutional Neural Networks (CNNs), and Transformer-based architectures—have been employed to address these challenges. However, each class of model presents limitations in capturing long-range temporal dependencies. RNNs are hindered by vanishing and exploding gradient issues that limit their memory over extended sequences 8,9. While variants like Hierarchically Gated RNNs (HGRNNs) introduce mechanisms for long-term memorization through dynamic gating and complex-valued recurrence, they incur substantial computational cost 10. CNNs are efficient at local feature extraction but require architectural augmentations, such as dilated convolutions 11 or Temporal Convolutional Networks (TCNs) 12, to expand their temporal receptive field—often at the expense of scalability. Transformer models, though highly effective in sequence modeling, suffer from quadratic attention complexity, which limits their applicability to long biosignal sequences. Approximations such as Linformer 13, Performer 14, Reformer 15, Longformer 16, and BigBird 17 attempt to alleviate this, but often with trade-offs in accuracy, generalizability, or architectural complexity.

Beyond sequence modeling, recent developments in self-supervised learning 18 and domain adaptation 19 have shown promise in improving the robustness and generalization of biosignal analysis models. Another line of work focuses on spectro-spatio-temporal representations 20,21,22, which extract features from the time domain (e.g., waveform morphology, RR intervals), frequency domain (e.g., power spectral density, differential entropy), and spatial domain (e.g., inter-channel correlations or fixed electrode graphs). While effective in controlled environments, these approaches often treat temporal, spectral, and spatial features as separate components, relying heavily on hand-crafted features and static architectures. This decoupling limits their ability to fully capture the complex, dynamic nature of biosignals.

Graph Neural Networks (GNNs) 23 have recently gained attention for their capacity to model relational and topological dependencies between biosignal channels. While static GNNs are effective in representing known spatial configurations, they are ill-suited for modeling the non-stationary and time-varying relationships typical in EEG and ECG signals 23. Dynamic Graph Structure Learning (GSL) offers a compelling alternative by enabling the graph topology to adapt over time. Complementarily, integrating both time-domain and frequency-domain features has been shown to enhance biosignal modeling. Time-domain metrics such as raw waveforms and differential entropy 24 capture transient dynamics, while frequency-domain features derived from Fast Fourier Transform (FFT) or Power Spectral Density (PSD) offer more stable, compact representations. However, many existing models fail to unify these domains effectively within a learnable and scalable framework.

Recently, Mamba 25, a selective state-space model, has emerged as a promising alternative for long-sequence modeling. Mamba achieves linear time complexity and parallel sequence processing by decoupling the trade-off between sequence length and computational cost, making it highly suitable for long biosignal sequences 26. Despite its advantages, Mamba has not yet been fully explored in the context of biosignal analysis.

To address these limitations, we propose a unified, end-to-end framework that holistically captures the temporal, spectral, and spatial complexity of multivariate biosignals. Our model introduces three key innovations. First, we leverage Mamba for efficient modeling of long-range temporal dependencies, enabling scalable and parallel processing of biosignal sequences. Second, we integrate a low-cost channel attention mechanism that adaptively emphasizes the most discriminative sensor inputs across both time and frequency domains. Third, we employ a dynamic graph structure learning module to capture evolving spatial relationships among sensor channels, allowing the model to adapt to the inherent non-stationarity of biosignals.

These components are tightly integrated into a cohesive architecture that overcomes the limitations of prior work by offering greater flexibility, interpretability, and computational efficiency. We evaluate our method across three benchmark datasets—TUSZ (EEG-based epileptic seizure detection), DOD-H (EEG-based sleep stage classification), and ICBEB (ECG-based cardiovascular disease classification)—and demonstrate its effectiveness and generalizability across diverse biosignal modalities.

Contributions

The key contributions of this work are as follows:

  • Mamba-based joint time-frequency modeling: We present the first unified Mamba-based framework that integrates parallel time–frequency modeling, a lightweight channel-attention layer and dynamic graph learning, demonstrating its capability to model long-range temporal dependencies in both EEG and ECG signals in both time and frequency domains. The joint encoding of temporal features and spectral features of EEG and ECG signals enhances the temporal expressiveness and representation capacity of the model.

  • Adaptive and efficient spatial modeling: A low-cost, generalizable channel attention mechanism is introduced to dynamically select salient biosignal (EEG and ECG) channels. This is coupled with a dynamic graph structure learning module that captures evolving inter-channel relationships over time, allowing adaptive spatial representation of non-stationary biosignal dynamics.

  • Robust cross-domain validation and interpretability: The proposed framework is rigorously validated on three clinically relevant datasets spanning EEG and ECG modalities, achieving state-of-the-art performance. Ablation studies further quantify the individual contributions of the Mamba architecture, channel attention, and dynamic graph learning components, providing insights into their roles in performance gains.

Fig. 1
Fig. 1
Full size image

The proposed framework combines Mamba and channel attention modules. The input consists of multiple feature channels, which are first processed by the Mamba module to extract high-level feature representations. The output of the Mamba module is then passed through a channel attention mechanism to emphasize the most informative channels. Next, a graph structure learning module captures relationships between features and constructs a graph representation. This graph representation is further processed by a Graph Neural Network layer. Finally, the output is passed to a classification head that predicts the class labels based on the refined features.

Background

Structured state space model

State-space models (SSMs) are widely used to represent the internal state of a system and predict its future states based on given inputs. These models are particularly effective in capturing temporal dependencies by mapping an input sequence to a latent state representation, which is then used to generate an output sequence.

At time t, an SSM processes an input sequence x(t) by mapping it to a hidden state h(t), which encapsulates the system’s underlying dynamics. Using this latent state, the model generates a predicted output sequence y(t), allowing it to model long-range dependencies and make future predictions. This process can be mathematically described as:

$$\begin{aligned}\textbf{h}^{\prime }(\textrm{t})=\textbf{A h}(\textrm{t})+\textbf{B x}(\textrm{t}) \nonumber \\ \textbf{y}(\textrm{t})=\textbf{C h}(\textrm{t})+\textbf{D x}(\textrm{t}) \end{aligned}$$
(1)

where \(\textbf{A}, \textbf{B}, \textbf{C},\) and \(\textbf{D}\) are learnable parameters optimized through gradient descent. In practice, \(\textbf{D}\) is often omitted since the term \(\textbf{D}x(t)\) acts as a skip connection and can be easily incorporated without additional complexity.

Since real-world applications typically involve discrete-time inputs, discretization is necessary. A common approach is applying the zero-order hold method, which transforms matrices \(\textbf{A}\) and \(\textbf{B}\) into their discrete equivalents \(\overline{\textbf{A}}\) and \(\overline{\textbf{B}}\). A learnable step size parameter \(\Delta\) is introduced, representing the resolution of the input:

$$\begin{aligned}\overline{\textbf{A}} & =\exp (\Delta \textbf{A}) \nonumber \\ \overline{\textbf{B}} & =(\Delta \textbf{A})^{-1}(\exp (\Delta \textbf{A}) - I) \cdot \Delta \textbf{B} \end{aligned}$$
(2)

Building on this foundation, the Structured State Space for Sequences (S4) 27 was introduced as an efficient SSM capable of handling long-range dependencies.

Mamba

Mamba 25 extends the S4 model by introducing input-dependent adaptability, allowing the state-space parameters to vary dynamically based on the input sequence. This enhancement enables the model to selectively propagate or forget information as needed, improving its ability to handle long-range dependencies while maintaining computational efficiency. Mamba builds on the S4 framework while introducing a hardware-optimized, parallelizable recurrent processing mode. This design significantly improves speed and scalability, making it particularly suited for applications in natural language processing, audio processing, genomics, and biosignal analysis.

Graph structure learning

Mamba and other state-space models learn channel representations independently but fail to capture correlations. Therefore, we design the channel attention and correlation in the following modules. To achieve this, we divide the processing into two parts: channel attention learning and channel correlation learning.

Channel attention 12 is a powerful mechanism in deep learning that has been widely applied across various domains to improve model performance. It enhances feature representation by emphasizing informative channels while suppressing less relevant ones. It has been extensively used in computer vision, such as in Squeeze-and-Excitation (SE) blocks 28, combinations of channel and spatial attention 29, and self-attention-based non-local relationships  30. However, most of these modules require tuning hyperparameters, and their training increases the overall parameter count of the models, potentially hindering training speed.

We propose a new low-cost channel attention mechanism that reduces model complexity and computational overhead. For channel correlation learning, we dynamically learn graph structures over time and aggregate representations from multiple graphs.

Method

Problem setup

We represent the multivariate biosignals as a graph as follows: \(\mathcal {G} = (\mathcal {V}, \mathcal {E}, \textbf{W})\), where the set of nodes \(\mathcal {V}\) corresponds to the sensors (channels), \(\mathcal {E}\) is the set of edges (i.e., the correlation amount between the channels), and \(\textbf{W}\) is the adjacency matrix. Since \(\mathcal {E}\) and \(\textbf{W}\) are unknown, they are learned by our model. Such graph formulation of biosignals could help solve various node and graph-level classification or regression tasks. In this work, we focus on graph classification whereby the aim is to learn a function that maps an input biosignal \(\textbf{X}\) to a class label y:

$$\begin{aligned} f: \textbf{X} \rightarrow y, \quad y \in \{1, 2, \dots , C\} \end{aligned}$$

where \(C\) is the number of classes.

Model architecture

Let \(\textbf{X} \in \mathbb {R}^{N \times T \times M}\) denote a multivariate biosignal, where \(N\) is the number of sensors, \(T\) is the sequence length, and \(M\) is the input dimension of the signal (typically \(M = 1\)). Figure 1 illustrates the architecture of the proposed Mamba-based framework for biosignal analysis. The proposed architecture comprises four main components:

  1. 1.

    Stacked Mamba layers with a linear layer to model temporal dependencies within each sensor independently, transforming the raw signals \(\textbf{X} \in \mathbb {R}^{N \times T \times M}\) to a latent representation \(\textbf{H} \in \mathbb {R}^{N \times T \times D}\), where \(D\) is the embedding dimension;

  2. 2.

    A new low-cost channel attention layer that learns weights for each channel;

  3. 3.

    A graph correlation layer that captures dynamically evolving adjacency matrices, \(\textbf{W}^{(1)}, \ldots , \textbf{W}^{\left( n_d\right) }\); and

  4. 4.

    GNN layers to model spatial dependencies among sensors, using the learned graph structures \(\textbf{W}^{(1)}, \ldots , \textbf{W}^{\left( n_d\right) }\) and node features \(\textbf{H}\). While our model supports any GNN layer type, we use graph attention networks in our experiments.

  5. 5.

    Finally, a temporal pooling layer and a graph pooling layer are included to aggregate temporal and spatial representations, respectively. These are followed by a fully connected layer that generates predictions for each multivariate biosignal.

Next, we detail the design and integration of the stacked Mamba layers, channel attention modules, graph correlation mechanisms.

Modeling with Mamba

Traditionally, multichannel biosignals are represented in either the time or frequency domain. To fully capture signal characteristics, we use a parallel Mamba structure to leverage the benefits of both representations, as shown in Fig. 2. The raw signal, represented in the time domain, is first normalized and then processed by a Mamba layer to extract temporal features. The frequency representation is obtained by computing the power spectral features, which are then processed by a separate Mamba layer. This parallel Mamba design ensures that both time- and frequency-domain features are effectively captured.

Fig. 2
Fig. 2
Full size image

Time and frequency Mamba block: The Mamba block combines linear projections, sequence transformations, and nonlinear operations. It includes a structured state-space model for capturing long-range dependencies and a convolutional module for local pattern extraction. Nonlinearities (activations or element-wise multiplications) are applied throughout, and residual connections preserve input information.

Multichannel biosignals often exhibit long-range temporal correlations, which Mamba can effectively capture. However, directly applying Mamba to multivariate signals by projecting the \(N\) signal channels into a hidden dimension using a linear layer may be suboptimal, as it disregards the underlying graph structure of the data. To address this, we use stacked Mamba layers to embed signals from each channel (sensor) independently, producing an embedding \(\textbf{H} \in \mathbb {R}^{N \times T \times D}\) for each input signal \(\textbf{X}\).

Low-cost channel attention module

The low-cost channel attention module is designed to enhance feature representations in neural networks by emphasizing informative channels while maintaining minimal computational overhead. Unlike conventional attention mechanisms, such as self-attention in Transformers or Squeeze-and-Excitation (SE) modules, which involve numerous parameters and increased computational burden, our module offers a lightweight solution that achieves effective feature refinement with fewer learnable parameters.

The core idea is to identify the most discriminative channels and assign them higher importance weights. We formulate this process as a classification problem and demonstrate that, with a specific choice of loss function, the classifier can be solved in closed form, thereby enabling efficient training.

We define a binary classification problem where the objective is to distinguish a candidate channel vector \(\textbf{p} \in \mathbb {R}^d\) from a set of N other channel vectors \(\{\textbf{x}_i\}_{i=1}^N\), where each \(\textbf{x}_i \in \mathbb {R}^d\). We treat \(\textbf{p}\) as the single positive example (label \(y_p = +1\)) and all other channels \(\textbf{x}_i\) as negative examples (label \(y_i = -1\)). We train a linear classifier defined by:

$$\begin{aligned} f(\textbf{x}) = \textbf{w}^{\top } \textbf{x} + b, \end{aligned}$$
(3)

where \(\textbf{w} \in \mathbb {R}^d\) and \(b \in \mathbb {R}\) are the weight vector and bias term, respectively. Unlike Euclidean distance, which treats all components equally, this classification score incorporates feature relevance through the learned weights \(\textbf{w}\).

To train the classifier, we minimize the following regularized loss function:

$$\begin{aligned} \mathcal {F}(\textbf{w}, b) = L(1, \textbf{w}^\top \textbf{p} + b) + \frac{1}{N} \sum _{i=1}^N L(-1, \textbf{w}^\top \textbf{x}_i + b) + \lambda \Vert \textbf{w} \Vert ^2, \end{aligned}$$
(4)

where \(L(y, y')\) is a loss function, and \(\lambda > 0\) is a regularization hyperparameter. For the squared loss \(L(y, y') = (y - y')^2\), the optimal solution can be derived in closed form. Let

$$\begin{aligned} \varvec{\mu }= \frac{1}{N} \sum _{i=1}^N \textbf{x}_i, \end{aligned}$$
(5)
$$\begin{aligned} \varvec{\Sigma }= \frac{1}{N} \sum _{i=1}^N (\textbf{x}_i - \varvec{\mu })(\textbf{x}_i - \varvec{\mu })^\top , \end{aligned}$$
(6)

be the mean and covariance matrix of the negative samples. The squared Mahalanobis distance is defined as:

$$\begin{aligned} \kappa (\textbf{p}) = (\textbf{p} - \varvec{\mu })^\top \varvec{\Sigma }^{-1} (\textbf{p} - \varvec{\mu }). \end{aligned}$$
(7)

For each channel \(c\), the Mamba encoder outputs \(H_c\in \mathbb {R}^{T\times D}\). We form the channel embedding \(\textbf{p}\in \mathbb {R}^{d}\) by global average pooling (over time and the parallel frequency branch) to a \(D\)-dimensional vector, followed by a linear projection to dimension \(d\).

The closed-form solutions for the optimal parameters and classification cost are:

$$\begin{aligned} \textbf{w}^*= \frac{2}{2 + 2\lambda + \kappa (\textbf{p})} \varvec{\Sigma }^{-1}(\textbf{p} - \varvec{\mu }), \end{aligned}$$
(8)
$$\begin{aligned} b^*= -\frac{1}{2}(\textbf{p} + \varvec{\mu })^\top \textbf{w}^*, \end{aligned}$$
(9)
$$\begin{aligned} \mathcal {F}^*= \frac{1}{2 + 2\lambda + \kappa (\textbf{p})}. \end{aligned}$$
(10)

Here, \(\mathcal {F}^*\) quantifies the discriminability of channel \(\textbf{p}\): lower values indicate better separability from other channels. Consequently, \(\kappa (\textbf{p})\) can be used to rank channel importance, with larger values implying higher discriminability.

The channel attention mechanism is implemented as a layer. For an input signal \(\textbf{X} \in \mathbb {R}^{C \times T}\) with C channels and T time steps, the output \(\widetilde{\textbf{X}}\) is computed as:

$$\begin{aligned} \textbf{E}= \alpha \kappa (\textbf{p}) + \beta , \end{aligned}$$
(11)
$$\begin{aligned} \widetilde{\textbf{X}}= \sigma (\textbf{E}) \odot \textbf{X}, \end{aligned}$$
(12)

where \(\alpha , \beta \in \mathbb {R}\) are learnable scaling and shifting parameters, \(\sigma (\cdot )\) denotes the sigmoid function, and \(\odot\) represents element-wise multiplication.

Channel correlation learning

Multivariate biosignals often lack known sensor configurations, and their inter-channel relationships may vary over time. To handle this, we introduce a channel correlation learning module that dynamically infers graph structures over fixed-length time intervals. Let the sequence length be T and define a time resolution t such that T is divisible by t. The number of dynamic graphs is then \(n_d = T / t\).

For the t-th interval, we compute a hybrid adjacency matrix \(\textbf{W}^{(t)}\) by combining geometry-based and feature-based graphs. The geometry-based graph is defined as:

$$\begin{aligned} {[}\textbf{W}_{\text {geo}}^{(t)}]_{ij} = \exp (-\gamma \Vert \textbf{x}_i - \textbf{x}_j\Vert _1), \end{aligned}$$
(13)

where \(\Vert \cdot \Vert _1\) denotes the Manhattan distance, \(\textbf{x}_i\) and \(\textbf{x}_j\) are geometric embeddings of channels i and j derived from electrode/lead coordinates (e.g., EEG 10–20 montage positions or ECG lead topology). \(\gamma > 0\) is a hyperparameter.

The feature-based correlation matrix \(\overline{\textbf{W}}^{(t)}\) is computed using normalized cross-correlation (NCC). Given two vectors \(\textbf{X}, \textbf{Y} \in \mathbb {R}^n\), the NCC is:

$$\begin{aligned} \text {NCC}(\textbf{X}, \textbf{Y}) = \frac{\sum _{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum _{i=1}^n (x_i - \bar{x})^2} \cdot \sqrt{\sum _{i=1}^n (y_i - \bar{y})^2}}, \end{aligned}$$
(14)

where \(\bar{x} = \frac{1}{n} \sum _{i=1}^n x_i\) and \(\bar{y} = \frac{1}{n} \sum _{i=1}^n y_i\) are the means of \(\textbf{X}\) and \(\textbf{Y}\), respectively.

The final adjacency matrix is given by:

$$\begin{aligned} \textbf{W}^{(t)} = \lambda \textbf{W}_{\text {geo}}^{(t)} + (1 - \lambda ) \overline{\textbf{W}}^{(t)}, \end{aligned}$$
(15)

where \(\lambda \in [0, 1)\) is a weighting hyperparameter. To improve computational efficiency and enforce graph sparsity, we prune edges with weights below a threshold \(\kappa\):

$$\begin{aligned} {[}\textbf{W}^{(t)}]_{ij} = 0 \quad \text {if} \quad [\textbf{W}^{(t)}]_{ij} < \kappa . \end{aligned}$$
(16)

Since signal correlation is symmetric, the resulting graphs are treated as undirected.

Experiment

Description of the datasets

TUSZ dataset for EEG-based epileptic seizure detection: We first assess the proposed model on EEG-based seizure detection using the publicly available Temple University Hospital Seizure Detection Corpus (TUSZ) 31. We follow the experimental setup from 32, performing seizure detection on 60-second EEG clips. The data includes 19 EEG sensors, sampled at 200 Hz, resulting in a sequence length of 12,000 time steps per clip. The task is a binary classification to determine if a 60-second EEG clip contains a seizure. In the dynamic graph correlation learning layer, the resolution is set to 10 seconds (2,000 time steps), inspired by the approach taken by trained EEG readers. In consistency with former studies and benchmarks, we evaluate the model’s performance using AUROC, F1-score, area under the precision-recall curve (AUPRC), sensitivity, and specificity. We use AUROC as the final classification metric. The ROC curve illustrates the trade-off between the True Positive Rate (TPR) and the False Positive Rate (FPR) across various classification thresholds. The AUROC summarizes this curve into a single scalar value, providing a threshold-independent measure of the model’s performance.The AUPRC is calculated as the area under the PR curve. A PR curve shows the trade-off between precision and recall across different decision thresholds.

DOD-H dataset for EEG-based sleep stage classification: The DOD-H dataset includes EEG recordings from healthy participants across multiple nights of sleep 33. Each subject’s data corresponds to a full-night recording. Sleep stage labels are provided at 30-second epochs. Each epoch is categorized into one of the following stages: Wake, N1, N2, N3, REM.The EEG is sampled at 250 Hz. Following 34, we use macro-F1 score and Cohen’s Kappa as the evaluation metrics.

ICBEB dataset for ECG-based cardiac disease classification:34 Each ECG record is annotated by up to three labels chosen from a set of nine classes. To create a multi-label dataset, we take the union of all labels assigned to each record. Following prior work35,36, we use macroF1, macro-F2, macro-G2, and macro-AUROC as the evaluation metrics. The dataset contains 12 ECG channels with a sampling rate of 100 Hz. The dataset consists of 6,877 12-lead ECG recordings with durations ranging from 6 to 60 seconds. resulting in sequence lengths between 600 and 6,000 time steps. When we train the model, the shorter sequences are padded with zeros. The dataset includes 9 classes, consisting of one normal class and eight abnormal classes representing various diseases. Each ECG record is annotated by up to three labels chosen from a set of nine classes. This setting makes this experiment a multi-label classification task.

Data splitting strategy

We split each of the three datasets into train, validation and test sets as follows:

  • Epileptic Seizure Detection (TUH EEG): Inline with prior works32, we randomly split the official TUSZ training set into training and validation subsets using a 90/10 patient-level split for model training and hyperparameter tuning, respectively. For final evaluation, we used the official TUSZ test set, excluding five patients that overlapped with the training set to avoid data leakage. As a result, the training, validation, and test sets comprise mutually exclusive patient groups.

  • Sleep Stage Classification (DOD-H EEG): We adopt the same subject-wise 60/20/20 split as prior works such as SimpleSleepNet37 and RobustSleepNet34. Each split uses mutually exclusive subjects to avoid data overlap.

  • Cardiovascular Disease Classification (ICBEB ECG): Since the official test set is not publicly available, we replicate the 10-fold stratified split described in the work38. Folds 1–8 are used for training, fold 9 for validation, and fold 10 for testing. This setup is consistent with baseline models on the ICBEB dataset.

Experiment setup

Each experiment was conducted with three runs using different random seeds, and we report the mean and standard deviation of the results. The regularization weight \(\lambda\) was selected based on validation performance using the hyperparameter tuning tool wandb sweep, whereby we conducted a grid search over the set \(\{0.001, 0.01, 0.1, 1.0\}\). Eventually, we selected \(\lambda =\)0.01 for seizure detection and sleep staging classification task and \(\lambda =\)0.1 for ECG classification.

Experiment result

EEG-based epileptic seizure detection: Table 1 compares the proposed Mamba-based model with existing approaches on the TUSZ dataset for seizure detection. Our model achieves the highest performance across all evaluation metrics, including AUROC, F1-Score, AUPRC, sensitivity, and specificity, thereby establishing a new state-of-the-art.

It is worth noting that certain baselines (e.g., GraphS4MER  39) report relatively high F1-Scores despite lower AUPRC values. This apparent discrepancy arises from the complementary nature of these metrics and the strong class imbalance present in seizure detection: AUPRC summarizes precision–recall performance over all decision thresholds and is highly sensitive to the low prevalence of seizure events, leading to lower values. By contrast, F1-Score is computed at a single threshold (selected on the validation set to maximize F1) and therefore may remain relatively high if precision and recall are balanced at that particular operating point, even when the overall precision–recall trade-off is weaker (Table 2).

For consistency and fairness, we adopt a unified evaluation protocol: (i) thresholds for F1, sensitivity, and specificity are selected using the validation set, following prior work 32,39, and (ii) AUROC and AUPRC are computed directly from the continuous model scores without thresholding, providing threshold-independent measures of ranking quality. This ensures comparability across methods and highlights the robustness of our approach, which yields improvements under both threshold-dependent and threshold-free metrics.

Table 1 Results of seizure detection on TUSZ dataset.
Table 2 Computational cost comparison of channel attention modules.
Table 3 Results of ECG classification on ICBEB dataset.
Table 4 Model performance (AUROC) for individual ECG classes.
Table 5 Ablation study for EEG-based seizure detection on TUSZ dataset.
Table 6 Results on sleep staging classification on DOD-H dataset.
Table 7 Ablation Study on DOD-H and ICBEB datasets.

ECG-based cardiovascular disease classification results on ICBEB dataset: Table 3 provides ECG classification results on ICBEB dataset. As can been seen, our model achieves consistent and superior results across all metrics, showcasing its general robustness. The improvements in Macro G2-Score highlight the ability of the model to accurately classify underrepresented classes, which are often overlooked in imbalanced datasets. To analyze further, individual ECG classes classification result are shown in Table 4. The model shows consistent improvements across all 9 ECG classes. Certain classes such as LBBB (Left Bundle Branch Block) and STE (ST-elevation) are known to exhibit data imbalance, where fewer samples are available, our model still shows high performance in this imbalance case (Table 5).

EEG-based sleep stage classification using DOD-H dataset: The result on DOD-H dataset is given in Table 6. Our model achieves the highest performance on the DOD-H sleep staging classification task, with a Macro-F1 of 0.830 and Kappa of 0.814. Compared to prior methods such as GraphS4MER, it provides over 5% improvement on both metrics. This demonstrates the effectiveness of our temporal-frequency modeling and efficient channel attention in capturing complex sleep dynamics.

Training overhead of proposed low-cost channel attention module

We now discuss the training overhead of proposed low-cost channel attention module by providing a detailed comparison of the number of added parameters and training time per epoch for our proposed low-cost channel attention module and existing alternatives. The reported values are averaged on the three tasks. As shown in Table 2, our method requires significantly fewer parameters and results in shorter training time per epoch, demonstrating its computational efficiency, when compared with related works such as ECAnet 44, CBAM 29, or SKNet 45.

Computational complexity for \(\Sigma\) and \(\Sigma ^{-1}\): In a mini-batch with \(B\) samples and \(N\) channels each, stack embeddings into \(P\in \mathbb {R}^{(BN)\times d}\). The shared covariance \(\Sigma =\frac{1}{BN-1}(P-\mu )^\top (P-\mu )\) incurs \(O(BN\,d^2)\) time and \(O(d^2)\) memory. Computing \(\Sigma ^{-1}\) (e.g., via Cholesky) costs \(O(d^3)\) time and \(O(d^2)\) memory per step. These costs are independent of \(T\) and \(D\); only covariance formation scales with \(BN\), while the inversion depends solely on \(d\). We measure the per-step wall-clock time attributable to \(\Sigma /\Sigma ^{-1}\) on all three datasets (see Table 8).

Table 8 Overhead due to covariance formation \(\Sigma\) and its inversion \(\Sigma ^{-1}\) (Eq. (7)).

Ablation study

We conduct ablation studies to investigate the contribution of two key components in our model. The results are shown in Tables 5 and 7. First, to evaluate the role of the low-cost channel attention layer, we replace it with several representative channel attention modules, including ECA-Net 44, SE-Net 51, CBAM 29, and GC 52. While these alternatives are well-established and effective, they introduce higher computational and parameter overhead. Our results show that the proposed low-cost attention mechanism achieves competitive channel selection with substantially fewer parameters. Second, we examine the effect of combining temporal and frequency Mamba layers. We compare three configurations: using only the frequency Mamba layer, using only the time Mamba layer, and removing the Mamba layer entirely. Both the time-only and frequency-only variants outperform the Mamba-free baseline, demonstrating that each component independently enhances temporal modeling. Furthermore, integrating both layers yields the best performance, highlighting the complementary strengths of time and frequency representations and underscoring the value of capturing multi-scale temporal dependencies.

Further, in order to study the interaction between dynamic graph learning and channel attention modules, we conduct an additional set of experiments in which we selectively enable or disable the dynamic graph learning and channel attention modules in combination. The results reveal that the two modules are complementary rather than conflicting. Specifically, the channel attention mechanism operates at the individual channel level to emphasize discriminative signals based on temporal and spectral properties, whereas the dynamic graph module captures inter-channel relationships that evolve over time. Because the attention mechanism enhances the quality of individual node (channel) features before they are processed by the graph structure, it can in fact improve the quality of the learned graph representations. In particular, we observe that enabling channel attention improves the signal-to-noise ratio of node features, leading to more informative graph edges and better performance in downstream tasks. Conversely, when dynamic graph learning is applied without channel attention, the model is more susceptible to spurious correlations caused by noisy or irrelevant channels.

Fig. 3
Fig. 3
Full size image

Panels 1(ae) Mean EEG adjacency matrices computed from correctly classified test segments for (a) focal seizures, (b) generalized seizures, and (c) non-seizure; difference maps (d) focal minus non-seizure and (e) generalized minus non-seizure. Panels 2(ae) Mean PSG adjacency matrices for (a) WAKE, (b) N1, and (c) N3; difference maps (d) WAKE–N1 and (e) WAKE–N3. Panels 3 (ae) Mean ECG adjacency matrices for (a) normal rhythm, (b) first-degree atrioventricular block (I-AVB), and (c) left bundle branch block (LBBB); difference maps (d) IAVB–normal and (e) LBBB–normal.

Model interpretation

We evaluated whether the learned graphs are meaningful by visualizing the average adjacency matrices for EEG, PSG, and ECG signals from correctly predicted test samples as in Fig. 3. These adjacency matrices are grouped according to seizure classes, sleep stages, and ECG classes, respectively. To quantify differences between any two mean adjacency matrices, we calculated the mean and standard deviation of their absolute differences.

For EEG (First Row): The Generalized Seizures adjacency matrix shows higher overall connectivity (darker red blocks) than Focal Seizures and Non-Seizure. This suggests more widespread synchronization among channels during generalized seizures, consistent with literature indicating that generalized seizures often exhibit globally synchronized brain activity. Comparing Focal Seizures differing Non-Seizure and Generalized Seizures differing Non-Seizure: the difference map for Generalized Seizures versus Non-Seizure shows stronger positive regions than the difference map for Focal Seizures versus Non-Seizure. This implies that generalized seizures deviate more from the non-seizure state in terms of overall connectivity, whereas focal seizures exhibit less pronounced global changes. These findings align with the established clinical understanding that generalized seizures typically involve more widespread, synchronized neuronal activity compared to focal seizures.

For PSG (Middle Row): As sleep deepens from Wake to N1 and then to N2, the connectivity patterns evolve. The difference map between Wake and N2 exhibits more pronounced variations compared to the difference between Wake and N1, reflecting the greater physiological changes associated with deeper sleep stages.

For ECG (Bottom Row): The I-AVB matrix closely resembles the normal matrix, indicating that first-degree AV block is associated with only subtle conduction changes. In contrast, the LBBB matrix shows more significant deviations, particularly in the precordial leads, which is consistent with the pronounced conduction abnormalities observed in LBBB. The corresponding difference maps further highlight these distinctions: the differences for I-AVB versus normal are relatively minor, while those for LBBB versus normal are substantially larger.

Statistical analysis of class-specific connectivity in EEG and ECG graphs

To further quantitatively analyze differences between class-specific graphs statistically, we performed the following: For each pair of classes (e.g., Focal vs. Non-Seizure), we computed the edge-wise two-sample t-tests on the adjacency weights across all test samples. Each edge (ij) was tested for statistical significance between the two groups. at significance level \(\alpha = 0.05\) (Table 8).

The results obtained for the three downstream tasks (corresponding to the three datasets) are summarized in Table 9. Specifically, Table 9 provides information about the top three statistically significant connectivity edges identified between each group pair, along with their corresponding p-values. Below, we go over the main findings for the three datasets, one by one.

EEG-based seizure detection: Focal vs. Non-Seizure and Generalized vs. Non-Seizure both exhibit moderate levels of significant connectivity differences (17.6% and 23.8%, respectively). Notably, the most significant edge in the focal comparison is between P4 and P3 (\(p = 0.012\)), indicating localized alterations in posterior-parietal connectivity. In contrast, the generalized group highlights frontal and temporal connections such as T3–F7 and T4–F8, consistent with the broader network disruptions expected in generalized seizures.

EEG-based sleep stage classification: The comparisons Wake vs. N1 and Wake vs. N2 yield relatively higher percentages of significant edges (29.2% and 26.9%, respectively), indicating more widespread connectivity changes during the transition from wakefulness to sleep. The most prominent differences involve frontal-posterior and frontal-central interactions (e.g., FP1_O1–FP1_M2 and EOG1–EOG2), which reflect global shifts in brain activity during sleep onset.

ECG-based cardiac disease diagnosis: LBBB vs. Normal shows the highest percentage of significant edges (64.5%), with very low p-values for all top edges (e.g., V1–V2, \(p = 0.011\)), indicating robust and widespread connectivity alterations across ECG leads. I-AVB vs. Normal also displays substantial differences (55.6%), particularly involving limb and precordial leads (e.g., II–V2, aVF–V2), likely reflecting delayed atrioventricular conduction. All reported p-values are below 0.05, with the majority in the range of 0.011 to 0.027, meeting standard thresholds for statistical significance. These findings demonstrate the utility of connectivity-based features for distinguishing between diverse neurological and physiological conditions, with the strongest effects observed in cardiac abnormalities and sleep-stage transitions.

Table 9 Summary of group-wise connectivity analysis: percentage of Statistically Significant Edges (SSE) and Top 3 SSE pairs (Node i , Node j, p-value).

In summary, the adjacency matrices and their difference maps reveal clinically meaningful patterns across EEG, PSG, and ECG signals. These observations support known distinctions in neuronal synchronization for seizure types, the progression of sleep stages, and the morphological differences in cardiac conduction abnormalities.

Conclusion

In this paper, we introduced a unified framework for multivariate biosignal modeling that integrates long-range temporal modeling via Mamba, low-cost channel attention for efficient feature selection, and dynamic graph structure learning for adaptive spatial representation. Through extensive experimentation on EEG, ECG, and sleep stage classification tasks, our method consistently achieved state-of-the-art performance, outperforming strong baselines and demonstrating its ability to handle complex spatial temporal dependencies and imbalanced datasets. The results confirm that combining time-frequency representations with dynamically learned graphs enables more accurate and robust predictions. Importantly, the ablation studies demonstrate the significance of each model component in driving overall performance. This work sets a foundation for future research into scalable, efficient, and interpretable bio-signal analysis, with broad potential for real-world applications in clinical diagnostics and neuroscience research.

Although our framework is designed to learn dynamic spatial relationships across multiple sensors, it remains effective in few-sensor settings by leveraging the temporal modeling capabilities of Mamba layers and the channel attention mechanism. In such cases, the graph learning module simplifies to capture intra-channel dependencies or uses a degenerate graph structure. Additionally, prior knowledge—such as anatomical priors or cross-subject statistics—can be incorporated to compensate for limited spatial information. While spatial modeling benefits are reduced, the model maintains strong performance in low-channel scenarios.

As for the future work, while the proposed model achieves strong performance across diverse benchmarks, yet challenges remain in computational efficiency, generalizability, and clinical readiness. Future efforts will focus on optimizing deployment through robust handling of label noise, distribution shift, and post-deployment drift, in line with recent systems-oriented ML pipelines for healthcare 54,55. We will further extend validation to noisier, heterogeneous, and multi-center datasets, and explore multi-task and multi-modal regimes (e.g., EEG+ECG, EEG+PPG) with domain generalization and subject-specific adaptation. These directions aim to enhance interpretability, scalability, and real-world clinical applicability.