Introduction

Motivations

Mental workload (MWL) indicates the cognitive effort needed to complete a task and is a vital factor in neuroergonomics, cognitive neuroscience, and human–machine interaction. Accurate estimation of MWL allows adaptive systems to enhance safety and performance in high-stakes environments like driving, air-traffic management, medical monitoring, and intelligent tutoring, settings where overload and underload can impair decision-making and raise risks1,2. EEG is beneficial for MWL monitoring because it offers a direct, time-sensitive measurement of brain activity. Nonetheless, classifying MWL using EEG remains difficult due to issues such as strong nonstationarity, nonlinear dynamics, and inter-individual variability3. Moreover, MWL signatures often span multiple frequency bands and channels, necessitating methods that can analyze both spectro-temporal patterns and dependencies across channels.

A key technical challenge in MWL classification systems is the quality of the signal representation they rely on. Traditional spectral techniques, such as FFT/PSD, summarize frequency information globally but often overlook transient or rapidly changing workload effects. Even advanced time–frequency methods such as STFT and CWT are limited by a fundamental compromise between time and frequency resolution, often resulting in blurred signals from energy smearing. This hampers the ability to distinguish workload states clearly4,5, leading downstream models to rely on less accurate or poorly localized features, which can affect their robustness and ability to generalize.

A second limitation relates to the common way multichannel EEG is modeled. Many existing methods rely on handcrafted features or analyze channels separately, thereby neglecting the inter-channel relationships that indicate functional coupling and distributed neural activity. Overlooking these spatial dependencies can lead to features that are less consistent across different subjects or experimental conditions6. Although deep learning, especially CNN-based methods, has advanced automatic feature extraction from EEG3, typical CNNs often treat EEG inputs as static 2D patterns and may not adequately highlight the most relevant time–frequency regions unless specific mechanisms are incorporated7,8,9,10. Therefore, there is a clear need for a framework that (i) offers a sharper, more informative time–frequency representation for nonstationary EEG, (ii) maintains the multichannel structure rather than analyzing channels separately, and (iii) adaptively concentrates the learning on the most discriminative spectro-temporal features.

Motivated by these gaps, this study introduces a multi-stage MWL classification framework. It combines the multivariate synchrosqueezing transform (MSST)11 which provides detailed multichannel time–frequency analysis, using a CNN model augmented with a time–frequency attention module (CNN-TFAN) to improve feature learning. Additionally, it uses semi-supervised discriminant analysis (SDA) and the support vector machine (SVM) classifier to support reliable decision-making. This approach directly addresses the issues of smeared time–frequency representations and unstructured multichannel modeling, aiming to enhance resolution, interpretability, and stability in classification on standard MWL benchmarks.

Related works

With this motivation, the following section reviews (i) EEG-based MWL assessment studies, (ii) time–frequency and decomposition-based representations for cognitive-state classification, and (iii) deep and attention-based architectures designed to improve robustness and generalization in multichannel EEG learning.

In recent years, EEG signals have become increasingly crucial for measuring cognitive load in human-computer interaction, intelligent education, and mental performance tracking. Early research indicates that variations in the brain’s electrical activity can reliably reflect a person’s cognitive load in both real-world and laboratory environments. For example, real-time EEG monitoring of remote maintenance operators showed that changes in power in specific frequency bands can effectively indicate mental workload12. Studies with n-back tasks have also demonstrated a clear link between cognitive difficulty and EEG responses13. Additionally, examining the role of cognitive load in creating intelligent tutoring systems underscores the need for accurate, real-time assessment of this metric within advanced learning frameworks14,15.

To enhance the accuracy of cognitive load measurement, special focus has been given to processing and extracting relevant features from EEG data. Using adaptive and fixed wavelet transforms alongside multi-domain optimization has notably improved the distinction between different cognitive load levels16,17. The redundant adaptive discrete wavelet transform (RADWT) has proven effective in increasing frequency resolution and detection precision18. Other techniques include variational mode decomposition (VMD)-based spectral analysis combined with feature selection via the LightGBM algorithm19, singular spectrum analysis (SSA) and circulant SSA paired with metaheuristic algorithms20, and traditional wavelet methods for assessing cognitive load21. Moreover, comprehensive analytical reviews have compared various EEG preprocessing techniques for cognitive load detection22, highlighting the importance of initial processing quality in achieving better classification results. Signal decomposition methods were also widely used for EEG classification, especially in motor imagery (MI) BCIs23,24, where multivariate VMD/EWT frameworks have reported strong accuracy by separating task-relevant oscillatory components. However, mental workload EEG often exhibits subtler, more nonstationary patterns distributed across bands, making direct comparisons with MI-focused results nontrivial. Methodologically, decomposition approaches explicitly split signals into a limited set of modes, requiring mode selection.

Alongside progress in feature extraction, machine learning, and deep learning have become key approaches for modeling cognitive load. Early on, statistical models like the hidden Markov model (HMM) were used to develop generalizable pipelines25. Later, more sophisticated models such as bidirectional long short-term memory (BLSTM)-long short-term memory (LSTM) networks, combined with evolutionary algorithms26, hybrid CNN-LSTM models27, and brain connectivity-based models1 were adopted. Multi-task deep networks, for example, EEGMeNet, showed strong results in joint learning and classification stability28. Additionally, unsupervised clustering methods29 and transfer learning approaches, especially for cross-session and cross-subject applications, have been introduced to better simulate real-world conditions2,30. Some studies have integrated EEG with other data, such as eye-tracking, which has led to enhanced classification accuracy31.

Research in this area often uses time-frequency representations and attention mechanisms to isolate critical signal information. For instance, the attention-based recurrent fuzzy network (ARFN) model achieved high accuracy in classifying cognitive load by utilizing fuzzy recurrent attention32. Transformer-based models such as MST-Net have demonstrated strong performance by leveraging multi-scale time-frequency features33. Applying BLSTM on these features has also improved the detection of load levels on an individual basis34. Furthermore, IoT-driven hybrid models35 and semi-supervised methods focusing on time-frequency analysis have been developed for real-time cognitive load assessment. Channel-wise feature optimization and spatial pattern techniques have enhanced model generalizability36,37. Additionally, functional brain connectivity analysis has facilitated multi-class classification of cognitive load38. In addition to these efforts, new methods have explored the theoretical aspects of cognitive load and how to incorporate them into educational system design15,39. The integration of AI-based tools with technologies such as functional near-infrared spectroscopy (fNIRS) has expanded opportunities to monitor and analyze cognitive load40. Furthermore, there is a focus on improving the reproducibility of results and developing clear evaluation standards within the EEG and cognitive load fields, which continue to pose significant challenges2.

Recent studies have shown that advanced information-theoretic features and attention-based deep learning models can significantly improve EEG analysis of cognitive and emotional states41. In particular, mutual information–based features have been effective at distinguishing complex mental conditions. For example, a framework combining normalized mutual information with a self-optimized Gaussian kernel radial-basis-function extreme learning machine was developed to decode different states, yielding strong results through adaptive hyperparameter tuning and meaningful feature extraction. These findings highlight the importance of capturing nonlinear relationships in EEG signals to improve classification accuracy. Beyond feature-driven methods, graph-based deep learning models are increasingly popular for explicitly modeling connections between EEG channels. Graph attention convolutional neural networks were used in42 to detect driver fatigue, where mutual information was used to construct connectivity-aware graphs that guide attention mechanisms in an end-to-end system. These approaches emphasize the importance of spatial dependencies and adaptive attention in representing distributed brain activity across channels. Additionally, reviews of EEG-based emotion and cognitive state detection highlight key challenges, including non-stationarity, feature robustness, and attention modeling43. Overall, there is a clear trend toward integrating advanced time–frequency analysis, attention mechanisms, and deep learning architectures to make EEG systems more accurate, generalizable, and easier to interpret.

Recent progress in EEG-based brain–computer interface (BCI) research has emphasized that achieving reliable subject-independent generalization remains a key challenge. The large-cohort studies, such as those by Sadiq et al.44 and Yu et al.45 reveal that effective BCI systems must handle significant inter-subject variability, nonstationarity, and differences in neural patterns among users. These studies indicate that scalable and adaptable BCI frameworks rely on thoughtfully designed signal representations and learning architectures, especially when tested on datasets involving many subjects using subject-independent protocols. While these studies offer valuable insights, it’s essential to recognize that classifying mental workload differs from motor and mental imagery paradigms, both in experimental setup and dataset availability. Unlike motor imagery research—where extensive datasets with 60 or more participants are available—public mental workload datasets with standardized protocols are scarce. Consequently, datasets like STEW and MAT have become standard benchmarks in this area and are frequently cited in recent research to facilitate fair comparisons and reproducibility.

Contributions

This paper outlines a framework for classifying mental workload using EEG signals. It combines multivariate time-frequency analysis with attention-based deep learning, offering several advancements over earlier methods. The key contributions of this work are as follows:

  • Development of a multi-stage framework for MWL classification: A comprehensive method was devised, consisting of four key phases: signal preprocessing, time-frequency analysis via the MSST, deep feature extraction with a CNN integrated with a TFAN module, and ultimately, optimized joint dimensionality reduction and classification. This approach improves the model’s accuracy and robustness across diverse data scenarios.

  • Application and extension of the MSST for EEG signals of WML: This study introduces a multivariate SST, unlike traditional methods like STFT, CWT, or single-channel SST, to simultaneously model spatial dependencies across EEG channels. This approach improves the clarity of the time-frequency representation and provides a deeper insight into inter-regional brain dynamics under various mental workload conditions.

  • Design of a time-frequency attention network (TFAN) module: A dual-branch attention module (temporal–spectral dual attention) is introduced, allowing the network to adaptively concentrate on the most important regions in time and frequency domains. This module comprises temporal and spectral attention branches, each detecting key patterns with distinct convolutional filters. Incorporating this module into the CNN architecture has significantly enhanced classification accuracy compared to models lacking attention mechanisms.

  • Optimization of the dimensionality reduction and classification: This study employs semi-supervised discriminant analysis (SDA) for reducing feature dimensions. Bayesian optimization was used to fine-tune the method’s parameters to maximize class separability. Ultimately, a support vector machine (SVM) served as the main classifier, showing the highest accuracy among the compared traditional algorithms.

  • Comprehensive evaluation on two Public datasets (STEW and MAT): The proposed model was tested on two validated EEG datasets. It achieved 97.1% accuracy on the STEW dataset and 98.6% on the MAT dataset. These results significantly outperform traditional time-frequency methods and deep learning models lacking attention mechanisms.

The rest of this paper is organized as follows. Section II describes the publicly used datasets in this research. The proposed method is explained in detail in Section III. Section IV contains the results obtained during performance analysis, and Section V concludes this paper.

Dataset

In this paper, two publicly available multivariate EEG datasets collected for cognitive load analysis are utilized. The MAT and STEW datasets are explained below.

STEW dataset

The open-access simultaneous task EEG workload (STEW) dataset is a valuable resource for studying multitasking workload and analyzing brain activity during different cognitive tasks46. Researchers can utilize this dataset to develop and evaluate algorithms and models for classifying and predicting mental workload. The multitasking workload experiment utilized the SIMKAP multitasking test, in which participants identified and marked matching items on two panels while simultaneously answering auditory questions of various types, such as arithmetic, comparison, or data retrieval. The experiment comprised two phases: in the first, participants remained inactive for 2.5 min, representing a “low” mental workload; in the second, they completed the SIMKAP test for 2.5 min and were monitored for brain activity, representing a “high” mental workload46.

This dataset includes multivariate EEG signals from 48 male subjects. The signals were recorded at a sampling rate of 128 Hz with 16-bit A/D resolution using the Emotiv EPOC EEG headset. It features 14 channels: AF3, F7, F3, FC5, T7, P7, O1, O2, P8, T8, FC6, F4, F8, and AF4, aligned with the 10–20 international system. Figure 1 presents some recordings from the STEW dataset46.

Fig. 1
figure 1

Recorded EEG signals from the STEW dataset in (a) low and (b) high mental work load scenarios.

MAT dataset

The National Technical University of Ukraine supplied the Mental Arithmetic Tasks (MAT) dataset, which focuses on arithmetic tasks involving the consecutive subtraction of two numbers. Researchers used this dataset to study brain activity across various cognitive neuroscience functions. This dataset features EEG recordings from 36 students aged 18 to 26. The EEG data were collected using 23 electrodes positioned across the scalp according to the 10–20 system17. Each recording contains artifact-free EEG segments. Resting state segments last 3 min, whereas mental counting segments are 1 min long47,48.

Proposed method

Here, we outline the proposed method for classifying mental workload. As shown in Fig. 2, the proposed method classifies the multivariate EEG signals in four steps: (1) preprocessing, (2) time-frequency analysis, (3) extracting deep features, and (4) joint feature reduction and classification. In the following, each step will be explained in detail.

Fig. 2
figure 2

The general steps of the proposed method for mental workload classification.

Preprocessing

In the preprocessing step, several filters are employed to reduce artifacts and produce clean signals. The high-pass filter, with a cutoff frequency of 1 Hz, removed line noise from the signals in the STEW dataset. Also, data averaging was implemented to reduce artifacts in the subspace reconstruction and facilitate re-referencing46. For the MAT dataset, high-pass, low-pass, and power-line notch filters with cutoff frequencies of 0.5 Hz, 45 Hz, and 50 Hz were applied to minimize artifacts19.

Beyond traditional filtering techniques, hybrid EEG denoising approaches, such as multiscale PCA (MSPCA)7,23,24,49, which integrate wavelet multiresolution analysis with PCA-based component selection, are widely employed to reduce multiscale artifacts while retaining task-related activity. MSPCA preprocessing has been shown to enhance robustness across various EEG classification systems, especially in motor imagery and clinical EEG contexts. In this study, we adhere to the standard preprocessing protocols specified for the STEW and MAT benchmarks to ensure consistency and facilitate a fair comparison with previous research.

Time-frequency representation

Linear projection-based time-frequency algorithms, such as the STFT and CWT, face challenges when analyzing nonlinear and nonstationary signals. Reassignment techniques can address this issue by enhancing the localization of the time-frequency representation (TFR). One transform derived from such TFRs, suitable for creating highly localized TFRs for nonlinear and nonstationary signals, is the synchrosqueezing transform (SST)50, which is based on the continuous wavelet transform. SST refines the representation by concentrating the energy of coefficients along the instantaneous frequency curves of the modulated oscillations. As a post-processing technique using the CWT, SST enhances time-frequency analysis of signals by focusing energy along instantaneous frequency trajectories. SST effectively reallocates the coefficients of the CWT, providing a more precise and more localized time-frequency representation. This approach overcomes the limitations of traditional Fourier analysis.

The multivariate SST (MSST) is designed to identify common oscillatory patterns across multiple data channels, which is crucial for signal analysis that involves understanding variable interactions. It calculates joint instantaneous frequencies for multivariate data, offering insights into the relationships among different signals. MSST is a powerful tool for analyzing complex multivariate signals, improving comprehension of their temporal behavior and dependencies. Its focus on relevant oscillatory features across channels makes it essential for many scientific and engineering fields. The MSST produces a concise time-frequency representation of multichannel signals. It enhances traditional SST by effectively analyzing multivariate signals with evolving oscillatory patterns. This makes it particularly useful in biomedical engineering, finance, or any domain dealing with multivariate time series data. For example, in biomedical signals like EEG or ECG, multiple channels represent various physiological measurements. MSST helps to better understand their temporal dynamics and interrelations. Its ability to highlight significant oscillatory features across channels makes it especially valuable for EEG analysis11.

The continuous wavelet transform for the nonlinear and nonstationary signal \(\:x\left(t\right)\), \(\:X\left(a,b\right)\), is defined as11:

$$\:X\left(a,b\right)=\int\:{a}^{-0.5}x\left(t\right)\psi\:\left(\frac{t-b}{a}\right)dt$$
(1)

where \(\:\psi\:\left(t\right)\) is the mother wavelet. The scale factor \(\:a\) shifts the mother wavelet in the frequency domain and changes its bandwidth. For the set of wavelet coefficients \(\:X\left(a,b\right)\), the SST with the frequency resolution \(\:{\Delta\:}\omega\:\), \(\:S\left({\omega\:}_{l},b\right)\), is defined as11:

$$\:S\left({\omega\:}_{l},b\right)=\sum\:_{{a}_{k}:\left|{\omega\:}_{x}\left({a}_{k},b\right)-{\omega\:}_{l}\right|\le\:{\Delta\:}\omega\:/2}X\left({a}_{k},b\right){a}^{-1.5}{\Delta\:}{a}_{k}$$
(2)

where the set of frequency bins is denoted by \(\:{a}_{k}\).

For a multivariate signal \(\:{x}_{N}\left(t\right)\) with \(\:N\) channels and the corresponding SST coefficients for each channel \(\:{S}_{n}\left({\omega\:}_{l},b\right)\), \(\:n=1,\dots\:,N\), (the SST coefficients \(\:{S}_{n}\left({\omega\:}_{l},b\right)\) have been normalised with the constant \(\:{R}_{\psi\:}\)), and a given set of oscillatory scales, \(\:\left\{{\omega\:}_{k}\right\},\:k=1,\dots\:,K\), obtained using a multivariate extension of a method proposed in11, the instantaneous frequency \(\:{{\Omega\:}}_{k}^{n}\left(b\right)\) for each frequency band, k, is given by11:

$$\:{{\Omega\:}}_{k}^{n}\left(b\right)=\frac{{\sum\:}_{\omega\:\in\:{\omega\:}_{k}}{\left|{S}_{n}\left(\omega\:,b\right)\right|}^{2}\omega\:}{{\sum\:}_{\omega\:\in\:{\omega\:}_{k}}{\left|{S}_{n}\left(\omega\:,b\right)\right|}^{2}}$$
(3)

Also, the instantaneous amplitude \(\:{A}_{k}^{n}\left(b\right)\) for each frequency band is calculated as11:

$$\:{A}_{k}^{n}\left(b\right)=\sqrt{\sum\:_{\omega\:\in\:{\omega\:}_{k}}{\left|{S}_{n}\left(\omega\:,b\right)\right|}^{2}}$$
(4)

To estimate the multivariate instantaneous frequency for a given frequency band \(\:k\), the instantaneous frequencies across the \(\:N\) channels are combined using the joint instantaneous frequency. As a result, the multivariate instantaneous frequency band \(\:{{\Omega\:}}_{k}^{multi}\left(b\right)\) is given by11:

$$\:{{\Omega\:}}_{k}^{multi}\left(b\right)=\frac{{\sum\:}_{n=1}^{N}{\left({A}_{k}^{n}\left(b\right)\right)}^{2}{{\Omega\:}}_{k}^{n}\left(b\right)}{{\sum\:}_{n=1}^{N}{\left({A}_{k}^{n}\left(b\right)\right)}^{2}}$$
(5)

Also, the instantaneous amplitude \(\:{\text{A}}_{k}^{multi}\left(b\right)\) for each frequency band is obtained as11:

$$\:{\text{A}}_{k}^{multi}\left(b\right)=\sqrt{{\sum\:}_{n=1}^{N}{\left({A}_{k}^{n}\left(b\right)\right)}^{2}}$$
(6)

After determining the joint instantaneous amplitude and frequency for each frequency band, the multivariate TFR, \(\:{\mathbf{T}}_{k}^{multi}\left(\omega\:,b\right)\), for each oscillatory scale \(\:k,\:k=1,\dots\:,K\), is calculated as11:

$$\:{\mathbf{T}}_{k}^{multi}\left(\omega\:,b\right)={\text{A}}_{k}^{multi}\left(b\right)\delta\:\left(\omega\:-{{\Omega\:}}_{k}^{multi}\left(b\right)\right)$$
(7)

where \(\:\delta\:\left(.\right)\) is the Dirac delta function. For a multivariate signal \(\:{x}_{N}\left(t\right)\) with \(\:N\) channels, the MSST can be summarised as follows:

  • Apply the SST channel-wise to obtain the coefficients \(\:{S}_{n}\left({\omega\:}_{l},b\right)\).

  • A set of partitions along the frequency axis for the time-frequency domain is determined, and the instantaneous frequency \(\:{{\Omega\:}}_{k}^{n}\left(b\right)\) and amplitude \(\:{A}_{k}^{n}\left(b\right)\) is calculated for each frequency bin \(\:k\).

  • The multivariate instantaneous frequency \(\:{{\Omega\:}}_{k}^{multi}\left(b\right)\) and amplitude \(\:{\text{A}}_{k}^{multi}\left(b\right)\) is calculated.

  • The multivariate synchrosqueezed coefficients, \(\:{\mathbf{T}}_{k}^{multi}\left(\omega\:,b\right)\), is calculated.

Deep feature extraction

Time-frequency attention module

The structure of the utilized time-frequency attention (TFATN) module is shown in Fig. 3. To enable the model to dynamically focus on the most salient regions of the time-frequency representation, we employ a dual-path attention mechanism. A time-attention branch and a frequency-attention branch concurrently process the input features. The time attention branch utilizes convolutional kernels elongated along the frequency axis, 3 × 3 and 3 × 5, and average pooling across the time dimension, PoolingT, to generate a vector of weights, identifying the importance of each frequency band. Symmetrically, the frequency attention branch uses kernels elongated along the time axis, 3 × 3 and 5 × 3, and average pooling across the frequency dimension, PoolingF, to determine the significance of each time step. These two attention vectors are then multiplied to form a comprehensive 2D time-frequency attention map. This map is applied to the original input via element-wise multiplication, effectively re-weighting the features to amplify key information and suppress noise. Finally, the attended feature map is concatenated with the original input and passed through a final convolution, producing a refined output that has learned to focus on the most critical spectro-temporal information.

The goal of the time-attention branch is to determine which frequency bands are most important, regardless of the specific time point. The input first passes through two parallel convolutional layers. The 3 × 5 kernel is wider than it is tall. This shape is effective at capturing features across different frequencies at a specific moment in time. The 3 × 3 kernel captures more general local features. Each is followed by batch normalization (BN) and a rectified linear unit (ReLU) activation function (BN-ReLU). The outputs from the two convolutional blocks are concatenated along the channel dimension. Then, a 1 × 1 convolution is used to efficiently reduce the number of channels and create a compact representation of the combined features. Pooling is the key step and is applied across the time dimension. This collapses the time dimension, yielding a single vector in which each element corresponds to a frequency band and represents its overall importance. The vector is passed through another 1 × 1 Conv and then a Sigmoid function—the sigmoid squashes the values between 0 and 1, producing the final attention weights. A value near 1 indicates the corresponding frequency band is critical, while a value near 0 indicates it’s less important. The mentioned procedure can be described mathematically as follows:

$$\:{X}_{time}=Concat\left({Conv}_{3\times\:3}\left(X\right),{Conv}_{3\times\:5}\left(X\right)\right)$$
(8)
$$\:{X}_{time,pool}={Pool}_{\left(1,none\right)}\left(BNReLU\left({Conv}_{1\times\:1}\left({X}_{time}\right)\right)\right)$$
(9)
$$\:{A}_{time,pool}=\sigma\:\left({Conv}_{1\times\:1}\left({X}_{time,pool}\right)\right)$$
(10)

Symmetrically, the goal of the frequency-attention branch is to figure out which time steps are most important, regardless of the specific frequency. The input passes through two parallel convolutional layers (Conv., 3 × 3 & Conv., 5 × 3). Here, the 5 × 3 kernel is taller than it is wide. This shape is effective at capturing features and patterns across different time steps within a specific frequency band. Same as the time branch, the outputs are concatenated and then reduced using a 1 × 1 convolution. Then, the pooling is applied across the frequency dimension. This collapses the frequency dimension, resulting in a single vector where each element corresponds to a time step and represents its overall importance across all frequencies. A Sigmoid function again creates the final attention weights (between 0 and 1) for each time step. The similar equations as (8)-(10) are valid for frequency-attention branch and \(\:{A}_{frequency,pool}\) is the output of this branch.

Fig. 3
figure 3

The structure of the utilized time-frequency attention (TFATN).

To apply the attention, the time attention vector, indicating which frequencies matter, and the frequency attention vector, indicating which times matter, are multiplied together. This creates a 2D time-frequency attention map. This map assigns a weight to each point in the original input, indicating its combined importance. This 2D attention map is then multiplied element-wise with the original time-frequency. This step re-weights the input; important time-frequency points are amplified, while unimportant ones are suppressed. The re-weighted (attended) input is concatenated with the original Input. This is a residual-style connection, ensuring that the model doesn’t lose original information while learning the attention. A final 1 × 1 convolution processes the concatenated data to produce the final, refined Output. This output is a feature map that has been enhanced to focus on the most relevant parts of the original signal.

$$\:{X}_{TFAN}={Conv}_{1\times\:1}\left(Concat\left(X\odot{X}_{time,pool}\odot{X}_{freq,pool},X\right)\right)$$
(11)

CNN-TFAN architecture

The architecture of the proposed CNN with time-frequency attention (CNN-TFAN) for deep feature extraction is illustrated in Fig. 4. This architecture consists of three sequential processing blocks that hierarchically refine and extract features. Each block in this architecture comprises three main components. A 2D convolutional layer identifies local patterns in the input feature maps using 3 × 3 kernels. A Time-Frequency Attention Network (TFATN) Module positioned immediately after the convolutional layer adaptively focuses on salient and essential information in both the time and frequency dimensions by reweighting different regions of the feature map. A Max-Pooling Layer using a 2 × 2 window reduces the spatial dimensions of the feature maps. This process decreases computational complexity and helps the model achieve spatial invariance, making it less sensitive to minor shifts in patterns. The input time-frequency map is passed sequentially through these three blocks. The number of filters in the convolutional layers progressively decreases throughout the network: starting with 32 in the first block, then 16 in the second, and 8 in the final block. This design creates an information bottleneck, compelling the network to learn a compact and efficient representation of the data. The output of the final block is a set of deep features, which constitutes a high-level, compressed representation of the key information present in the input signal. This feature vector obtained from the flatten layer is subsequently utilized for the final classification task.

Fig. 4
figure 4

The architecture of the proposed CNN-TFAN for deep feature extraction.

Optimized feature reduction and classification

Feature reduction

Linear discriminant analysis (LDA) is the supervised and linear version of SDA and considers only labeled samples. Its objective function is as follows51:

$$\:{\varvec{a}}_{opt}=\underset{\varvec{a}}{\text{a}\text{r}\text{g}\text{m}\text{a}\text{x}}\frac{{\varvec{a}}^{T}{\mathbf{S}}_{\varvec{b}}\varvec{a}}{{\varvec{a}}^{T}{\mathbf{S}}_{\varvec{w}}\varvec{a}}$$
(12)

where \(\:{\mathbf{S}}_{\varvec{w}}\) and \(\:{\mathbf{S}}_{\varvec{b}}\) denote the intra- and inter-class scatter matrices, respectively, and computed as51:

$$\:{\mathbf{S}}_{\varvec{b}}={\sum\:}_{k=1}^{{n}_{c}}{n}^{\left(k\right)}\left({\mu\:}^{\left(k\right)}-\varvec{\mu\:}\right){\left({\mu\:}^{\left(k\right)}-\varvec{\mu\:}\right)}^{T}\:$$
(13)
$$\:{\mathbf{S}}_{\varvec{w}}={\sum\:}_{k=1}^{{n}_{c}}{\sum\:}_{i=1}^{{n}^{\left(k\right)}}\left({x}_{i}^{\left(k\right)}-{\mu\:}^{\left(k\right)}\right){\left({x}_{i}^{\left(k\right)}-{\mu\:}^{\left(k\right)}\right)}^{T}$$
(14)

where \(\:{n}^{\left(k\right)}\) denotes the number of training samples for the class \(\:{\mathcal{C}}_{k}\), \(\:\varvec{\mu\:}\) and \(\:{\mu\:}^{\left(k\right)}\) are the total sample mean vector and the mean vector of class \(\:{\mathcal{C}}_{k}\), respectively. Also, \(\:{x}_{i}^{\left(k\right)}\) is the sample \(\:i\) in class \(\:{\mathcal{C}}_{k}\). The total scatter matrix \(\:{\mathbf{S}}_{\varvec{t}}\), is defined as \(\:{\mathbf{S}}_{\varvec{t}}={\sum\:}_{i=1}^{N}\left({x}_{i}-\varvec{\mu\:}\right){\left({x}_{i}-\varvec{\mu\:}\right)}^{T}\), hence we have\(\:\:{\mathbf{S}}_{\varvec{t}}={\mathbf{S}}_{\varvec{b}}+{\mathbf{S}}_{\varvec{w}}\). Thus, the objective function equals51:

$$\:{\varvec{a}}_{opt}=\underset{\varvec{a}}{\text{a}\text{r}\text{g}\text{m}\text{a}\text{x}}\frac{{\varvec{a}}^{T}{\mathbf{S}}_{\varvec{b}}\varvec{a}}{{\varvec{a}}^{T}{\mathbf{S}}_{\varvec{t}}\varvec{a}}$$
(15)

When training samples are insufficient, overfitting can happen. Regularizers are commonly employed to prevent this. The optimization problem in such cases is defined as follows51:

$$\:\text{m}\text{a}\text{x}\frac{{\varvec{a}}^{T}{\mathbf{S}}_{\varvec{b}}\varvec{a}}{{\varvec{a}}^{T}{\mathbf{S}}_{\varvec{t}}\varvec{a}+J\left(\varvec{a}\right)}$$
(16)

The regulation coefficient controls the balance between the model’s complexity and the empirical loss. Also, the \(\:J\left(\varvec{a}\right)\) denotes the learning complexity of the hypothesis family, and considering a natural regularizer, we have51:

$$\:J\left(\varvec{a}\right)={\sum\:}_{ij}{\left({\varvec{a}}^{T}{x}_{i}-{\varvec{a}}^{T}{x}_{j}\right)}^{2}{\mathbf{S}}_{ij}=2{\sum\:}_{i}{\varvec{a}}^{T}{x}_{i}{D}_{ii}{x}_{i}^{T}\varvec{a}-2{\sum\:}_{ij}{\varvec{a}}^{T}{x}_{i}{\mathbf{S}}_{ij}{x}_{j}^{T}\varvec{a}=2{\varvec{a}}^{T}\mathbf{X}\left(\mathbf{D}-\mathbf{S}\right){\mathbf{X}}^{T}\varvec{a}=2{\varvec{a}}^{T}\mathbf{X}\mathbf{L}{\mathbf{X}}^{T}\varvec{a}$$
(17)

Considering the \(\:{N}_{p}\left({x}_{i}\right)\) as the set of p nearest neighbors of\(\:\:{x}_{i}\), the weight matrix, \(\:\mathbf{S}\), is defined as51:

$$\:{S}_{ij}=\left\{\begin{array}{cc}1,&\:\text{if\hspace{0.17em}\hspace{0.17em}}{x}_{i}\in\:{N}_{p}\left({x}_{j}\right)\text{\hspace{0.17em}\hspace{0.17em}or\hspace{0.17em}\hspace{0.17em}}{x}_{j}\in\:{N}_{p}\left({x}_{i}\right)\\\:0,&\:\text{otherwise\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}}\end{array}\right.$$
(18)

The diagonal matrix D is defined as \(\:{D}_{ii}={\sum\:}_{j}{S}_{ij}\). Also, the Laplacian matrix is defined as \(\:\mathbf{L}=\mathbf{D}-\mathbf{S}\). Hence, the objective function of SDA can be formulated as51:

$$\:\underset{\varvec{a}}{\text{m}\text{a}\text{x}}\frac{{\varvec{a}}^{T}{\mathbf{S}}_{\varvec{b}}\varvec{a}}{{\varvec{a}}^{T}\left({\mathbf{S}}_{\varvec{t}}+\:\mathbf{X}\mathbf{L}{\mathbf{X}}^{T}\right)\varvec{a}}$$
(19)

The objective function reaches its maximum when using the projective vector \(\:\varvec{a}\), which is determined by the largest eigenvalue solution to the generalized eigenvalue problem51.

$$\:{\mathbf{S}}_{\varvec{b}}\varvec{a}=\lambda\:\left({\mathbf{S}}_{\varvec{t}}+\varvec{\alpha\:}\mathbf{X}\mathbf{L}{\mathbf{X}}^{T}\right)\varvec{a}$$
(20)

Considering\(\:\:\mathbf{A}=\left[{\varvec{a}}_{1},{\varvec{a}}_{2},\dots\:,{\varvec{a}}_{\varvec{n}\varvec{z}}\right]\), where \(\:\varvec{n}\varvec{z}\) is the number of non-zero eigenvalues, and samples are embedded as51:

$$\:\varvec{x}\to\:\varvec{z}={\mathbf{A}}^{\varvec{T}}\varvec{x}$$
(21)

The performance of SDA depends on the regulation parameter. This paper uses Bayesian optimization to identify the optimal value of this parameter, yielding the highest classification accuracy.

Classification

This paper assessed and reported the accuracy of various classifiers. The most commonly used classifiers, including SVM52, kNN53, decision tree54, and random forest55, are evaluated individually. Their performance varies with specific hyperparameters, so an optimization process was used to determine the optimal values. Table 1 outlines these hyperparameters and the optimizer used for their fine-tuning. It should be noted that for the decision tree, the split criterion is Gini’s index, and height-balancing is achieved when the heights of the subtrees differ by no more than one.

Table 1 The hyperparameters of classifiers.

Bayesian optimization

Bayesian optimization is a powerful approach for optimizing functions that are expensive to evaluate, particularly in machine learning and hyperparameter tuning. Unlike conventional techniques relying on exhaustive search or gradients, it builds a probabilistic model to guide the search process more efficiently. This method is beneficial when the objective function is nonconvex, costly to compute, or lacks analytical gradients. The Bayesian optimization process begins with evaluating an initial set of randomly chosen points on the objective function. A Gaussian Process is then fitted to model this function. An acquisition function guides the selection of the following sampling point, after which the objective function is evaluated. The model is updated with the new data, and the cycle repeats until the process converges or the evaluation budget is exhausted. This iterative approach allows Bayesian optimization to efficiently identify the best solution with fewer evaluations than brute-force methods. Its main advantage is its sample efficiency, making it suitable for expensive functions. It can also handle noisy and black-box functions, even when gradient information is unavailable or unreliable. By balancing exploration and exploitation through modeling uncertainty, it effectively optimizes complex search spaces56.

Results

Here, we present results demonstrating the effectiveness of the proposed schemes for classifying mental workload from EEG signals.

Simulation setup and performance metrics

The presented method is implemented and tested on a system configuration of an Intel Core i7 CPU and 32 GB of RAM. For both datasets, the recordings of 80% of subjects are used for training, and the remaining 20% are used to test the trained model. The parameters used in the tuning process for the CNN-TFAN mentioned are given in Table 2.

Table 2 Parameters used for tuning the CNNs.

The sensitivity, precision, specificity, and accuracy are used to evaluate the performance of the proposed method. These metrics are defined as follows19.

$$\:Sens.=\frac{TP}{TP+FN}$$
(22)
$$\:Prec.=\frac{TP}{TP+FP}$$
(23)
$$\:Spec.=\frac{TN}{TN+FP}$$
(24)
$$\:Acc.=\frac{TN+TP}{TN+TP+FN+FP}$$
(25)

where TP, TN, FP, and FN respectively denote the true positive, true negative, false positive, and false negative. The ‘no task’ or ‘rest’ class is the negative class, and the positive class denotes the ‘task’ class.

Performance analysis

Table 3 compares the performance of four classifiers, including kNN, SVM, decision tree, and random forest, on the STEW and MAT datasets. The results show that SVM outperformed the other models significantly on both datasets, achieving 97.1% accuracy on STEW and 98.6% on MAT. The overall ranking of the classifiers was consistent across both datasets, with random forest and kNN following SVM. The decision tree consistently performed the weakest, with accuracies of 94.1% and 95.5% for STEW and MAT, respectively. These results identify SVM as the most effective and suitable classifier for the proposed framework in mental workload classification.

Table 3 Classification accuracy of different classifiers. The terms “wo” and “woo” respectively denote “with optimization” and “without optimization”.

SVM’s exceptional performance mainly stems from two key attributes. It aims to find a hyperplane that not only separates clusters but also maximizes the margin, the distance between the nearest data points of each cluster (support vectors). This “margin maximization” technique enhances generalization and reduces overfitting. When EEG signals are processed through the SST transform and a deep network, their features often exhibit complex nonlinear relationships. The kernel trick allows SVM to map these features into a higher-dimensional space where linear separation is feasible. This makes SVM particularly effective for modeling intricate decision boundaries.

The second-ranked model, random forest, is an ensemble made up of many decision trees. By aggregating the outputs of these trees, each trained on a different subset of data and features, it effectively addresses the overfitting issue common to single decision trees. Nevertheless, its decision boundaries are a blend of axis-aligned boundaries from its individual trees, which may be less effective for specific data types than the hyperplane optimized by an SVM. The lower performance of kNN and decision tree classifiers is understandable. Decision trees tend to overfit the training data and rely on simple, axis-aligned boundaries that can’t effectively separate classes with complex relationships. Similarly, kNN struggles in high-dimensional spaces due to the curse of dimensionality, where measuring distance between samples becomes less meaningful. This reduces its ability to distinguish between samples, leading to decreased performance.

The classification accuracy comparison clearly demonstrates that the optimization step significantly improves all classifiers across both datasets. Specifically, optimization boosts accuracy by 3.1% to 4.1%, underscoring its crucial role. These results strongly suggest that the optimization process is a vital part of the proposed framework, essential for maximizing the models’ discriminant capacity and achieving optimal outcomes.

A comprehensive evaluation of the proposed classifier on the STEW dataset was conducted using the confusion matrix shown in Table 4. The analysis of the matrix indicates a well-balanced ability to distinguish between the two classes, correctly identifying 98.4% of true positives and 95.8% of true negatives. Error rates were minimal, with a false-positive rate of 4.2% and a false-negative rate of 1.6%. Key metrics support these results; a high sensitivity of 98.4% highlights excellent positive detection, while a specificity of 95.8% confirms accurate negative classification. The precision of 95.9% further validates the robustness of positive predictions. Overall, the high accuracy, balanced sensitivity and specificity, and low error rates demonstrate that the proposed model is a reliable and effective classifier for this task.

Table 4 Confusion matrix for the STEW dataset.

The model’s performance on the MAT dataset, as shown in Table 5, not only confirms the high performance observed on the STEW dataset but also shows significant improvements across all metrics. Specifically, the false positive rate has decreased from 4.2% to 1.9%, and the false negative rate has decreased from 1.6% to 0.9%. This reduction in errors has directly led to improvements in key metrics, including specificity from 95.8% to 98.1% and precision from 95.9% to 98.1%, while the model’s very high sensitivity has also increased to 99.1%. Overall, these results show that the model achieved more accurate and balanced classification on the MAT dataset and further demonstrated its reliability and robustness by significantly reducing errors of both types.

Table 5 Confusion matrix for the MAT dataset.

The accuracy of frequency bands

To assess how various EEG frequency bands contribute to mental workload classification, the model’s accuracy was calculated separately for the delta, theta, alpha, beta, and gamma bands, as well as for the entire signal across the STEW and MAT datasets. As shown in Fig. 5, the highest classification accuracy in both datasets was achieved using the full signal. This indicates that features from different frequency bands provide complementary and valuable information, and combining them is crucial for optimal performance.

The Alpha band emerged as the most effective and informative, achieving the highest accuracy among the bands. Conversely, the Gamma band offered the least relevant information for classification, resulting in the lowest accuracy across both datasets. The performance hierarchy, from strongest to weakest, is consistent across both datasets: alpha, beta, delta, theta, and gamma. Additionally, the analysis shows that the model performs better across all conditions on the MAT dataset than on the STEW dataset, demonstrating the robustness of the proposed model.

Fig. 5
figure 5

The contributions of different frequency bands of the EEG signal to the MWL classification.

The effect of deep network architecture

To identify the most effective architecture for the CNN-TFAN network and assess the impact of the TFAN module, the model’s accuracy was evaluated across different numbers of convolutional blocks. The findings, shown in Table 6, indicate that increasing the network depth by adding more blocks enhances classification accuracy. Specifically, in both the STEW and MAT datasets, the three-block architecture consistently outperforms the one- and two-block models. The beneficial influence of the TFAN module is also evident, as its inclusion in all structures significantly boosts accuracy. For example, in the three-block setup, applying the TFAN module increased accuracy from 94.5% to 97.1% on the STEW dataset and from 95.2% to 98.6% on the MAT dataset. These results demonstrate the TFAN module’s effectiveness in discriminant feature extraction and confirm that a deeper architecture with three convolutional blocks provides superior performance for this task.

Table 6 The effect of the number of convolutional blocks on the accuracy of mental workload classification.

The effect of time-frequency analysis

To determine the best feature extraction technique, we evaluated the performance of ten different time-frequency analysis methods, with full results shown in Table 7. This comparison reveals a distinct performance ranking. Traditional methods such as the short-time Fourier transform (STFT), spectrograms, and the continuous wavelet transform (CWT) achieved classification accuracies between 88% and 92%—meanwhile, SST-based methods performed notably better, surpassing 93% accuracy. Ultimately, MSST was identified as the top performer, attaining the highest accuracy across both datasets.

The apparent advantage of SST-based methods comes from their functional approach to addressing the limitations of traditional techniques. Classic methods like STFT and CWT face an inherent trade-off between time and frequency resolution due to the uncertainty principle, leading to ‘energy smearing’ and a blurry signal representation. The synchrosqueezing technique, a robust post-processing method, sharpens this blurry image by reallocating energy in the time-frequency plane to reflect accurate instantaneous frequencies. Univariate versions, such as WSST, process each EEG channel individually. In contrast, MSST’s strength lies in its multivariate approach; it disambiguates across all channels simultaneously, allowing it to model and extract the brain network’s interdependencies and spatiotemporal dynamics. This information, often overlooked by univariate methods, is vital for accurate classification of complex brain activity.

Table 7 Performance comparison between different time-frequency methods.

The effect of feature reduction

To better understand how feature reduction impacts classification results, we tested several dimensionality reduction techniques on deep features from the CNN-TFAN network. Specifically, we examined PCA, LDA, KPCA (RBF), UMAP, t-SNE, and compared them with SDA. Using the same SVM classifier and evaluation, the accuracy on the STEW and MAT datasets is shown in Table 8. LDA generally outperforms PCA by improving accuracy from 95.4% to 96.1% on STEW and from 97.1% to 97.6% on MAT, because PCA only considers variance, not class labels, whereas LDA separates classes after projection. However, LDA doesn’t match SDA, which leverages the data’s structure beyond labeled data. Nonlinear methods like KPCA with RBF kernels score 95.9% on STEW and 97.3% on MAT, showing benefits but sensitivity to kernel setup. UMAP outperforms others with 96.2% on STEW and 97.7% on MAT, likely due to maintaining local relationships. t-SNE, which focuses on visualization, performs worse at classification, with scores of 95.6% on STEW and 96.5% on MAT. Overall, SDA consistently produces the most effective reduced feature space, achieving 97.1% on STEW and 98.6% on MAT. This is because SDA preserves class separation by maintaining discriminative structure and leveraging the data’s geometry through graph-based regularization.

Table 8 Performance comparison between different time-frequency methods.

Comparison with other works

To validate the proposed model’s effectiveness, a thorough performance comparison was performed against several recent state-of-the-art MWL classification methods. Details of this comparison are shown in Table 9. The proposed approach, integrating MSST, CNN, TFAN, SDA, and SVM, outperforms other methods on both the MAT and STEW datasets.

Sharma et al.57 reported an accuracy of 94% using an SWT and Optimized-KNN approach. This was improved by Yedukondalu and Sharma58 who achieved 95.28% with the Ci-SSA and kNN methods. Baygin et al.59 reached 96.42% using a pooling function with SVM, while Yedukondalu and Sharma60 attained 96.88% by combining Ci-SSA and BHHO with kNN. More recently, Jain et al.19 achieved a 97.22% accuracy with VMD and a LightGBM classifier. In contrast, the model proposed here significantly surpasses all previous methods, setting a new accuracy record of 98.86% on the MAT dataset.

Similarly, on the STEW dataset, the proposed model outperforms existing techniques—previous studies by Zhu et al.61 and Safari et al.1 reported accuracies of 89.6% and 89.53%, respectively, using graph features and effective connectivity with SVM classifiers. More recent methods have achieved higher accuracy, such as the VMD-based method with the LightGBM model19, which reached 95.51%, and the method proposed in60, which achieved 96.88%. The proposed model exceeds these results, attaining an accuracy of 97.1%. These findings consistently show that the proposed framework, i.e., MSST + CNN + TFAN + SDA + SVM, presents a more robust and accurate solution for MWL classification.

Table 9 Performance comparison with other works in WML classification.

Statistical analysis

To assess the strength and reliability of our proposed framework, we conducted a statistical analysis of the classification results on both the STEW and MAT datasets. Since EEG-based deep learning models can be sensitive to how data is split and initialized, we repeated the entire training and evaluation process multiple times with different random seeds, keeping the same subject-wise data split. For each run, we recorded the classification accuracy, which was then used for further statistical analysis.

The proposed method, which combines MSST, CNN-TFAN, SDA, and SVM, achieved an average classification accuracy of 97.1% ± 0.5% on the STEW dataset and 98.6% ± 0.3% on the MAT dataset. The minor standard deviations suggest that the model’s performance is consistent across different runs and isn’t heavily influenced by initial settings or data partitioning. To see if these improvements are statistically meaningful, the proposed approach was compared to two leading alternatives: (i) the top existing time–frequency method based on WSST, and (ii) the same CNN architecture without the TFAN module. Since the data didn’t fully meet the assumptions of normality, we used the Wilcoxon signed-rank test, a nonparametric method, for these comparisons. The results showed that the proposed method significantly outperformed both baselines across both datasets, with p-values < 0.01 in all cases. This indicates that the improvements are unlikely to be due to mere chance. Effect size analysis using rank-biserial correlation showed significant effects (over 0.6), confirming that these gains are both statistically and practically important. To better understand the uncertainty, we calculated 95% confidence intervals for the average accuracy: [96.0%, 98.2%] for STEW and [97.8%, 99.3%] for MAT. These intervals further support the consistent performance of our framework across multiple tests. Overall, this analysis demonstrates that the improvements brought by the MSST–CNN-TFAN approach are reliable, statistically significant, and reproducible, strengthening the credibility of our results.

Conclusion

This paper introduced a new framework for MWL classification using EEG signals. It employs MSST to generate precise time-frequency representations that incorporate spatial dependencies across EEG channels. Deep features are then extracted using a novel CNN architecture with an integrated time-frequency attention (TFAN) module, which enables the network to focus on key regions of the representation and extract more relevant information. Using SDA for semi-supervised feature reduction and SVM for classification, the model achieved accuracies of 97.1% on the STEW dataset and 98.6% on the MAT dataset. The results demonstrate that this comprehensive framework surpasses previous methods, with each component, particularly MSST and TFAN, effectively enhancing the model’s accuracy and robustness.

The next step is to evaluate the MSST–CNN-TFAN framework using more realistic protocols that don’t rely on individual subjects and, where possible, on larger datasets, such as leave-one-subject-out or cross-session validation. This will help us better understand how well the system generalizes. To make the model more robust against differences between subjects and sessions, future efforts could include domain adaptation and transfer learning techniques, such as fine-tuning pre-trained CNNs on EEG time–frequency data or using feature alignment methods to reduce distribution gaps. Since signal quality can affect EEG modeling, exploring advanced preprocessing methods—such as MSPCA/wavelet–PCA denoising, ICA artifact removal, and automated artifact detection—could improve reliability, especially in noisier real-world settings. On the modeling front, the current two-step approach (SDA + SVM) might evolve into a more integrated learning framework that incorporates supervised contrastive learning, metric learning, or transformer-based classifiers, while preserving the interpretability of time–frequency attention. Improving interpretability further by visualizing attention maps over time and frequency, and analyzing the importance of specific channels and frequencies, can help connect model decisions to neurophysiological patterns of workload. Finally, for practical use, future research should aim to enhance computational efficiency through lightweight attention modules, pruning, quantization, and streaming inference, making real-time mental workload monitoring in human–machine interactions more feasible.