Introduction

Sex-related biases in medical and mental health research and practice demand urgent attention. This imperative is underscored by the US Food and Drug Administration’s suspension of ten prescription drugs, with eight of these medications presenting disproportionately higher risks for women. A clear bias favouring males across various research stages contributes to this issue1. This pattern highlights the critical need for sex-specific considerations in drug development, from preclinical studies to clinical trials and therapeutic applications. A pervasive bias favouring male subjects across various research stages contributes substantially to this problem. The implications of this bias extend beyond pharmacology, affecting diverse areas of medical research. For instance, biological sex has been identified as a crucial biological variable in dementia research2. Similarly, in the field of neurodevelopmental disorders, the diagnosis of Autism Spectrum Disorder is reported to be up to four times more frequent in males than in females, raising questions about potential sex-based differences in presentation or diagnostic criteria3. These examples highlight the importance of recognizing biological sex as a fundamental biological variable in primary and preclinical research. Therefore, identifying sex-related biases is essential for ensuring research outcomes’ accuracy, reproducibility, and clinical relevance, ultimately leading to more effective and equitable diagnostic and treatment strategies.

Understanding the complex interplay between brain function and biological sex is fundamental to advancing our knowledge of mental health4. EEG signals capture the brain’s electrical activity and provide a unique window into sex-related neural patterns. When combined with large-scale datasets, machine learning techniques offer a powerful approach to deciphering these intricate neurological phenomena. This research holds significant promise for personalized medicine and the development of tailored mental health interventions. By leveraging EEG analysis and machine learning algorithms on extensive datasets, we may enhance our ability to detect, diagnose, and treat various neurological and psychiatric disorders at earlier stages5.

The role of patient biological sex in neurological disorders remains an understudied area, with sex-based differences often relegated to the status of confounding variables rather than potential drivers of pathophysiology6,7. However, recent investigations have begun to challenge this paradigm. For instance8, reported that event-related potential markers in Autism Spectrum Disorder may not be confounded by biological sex differences, suggesting a more nuanced relationship. Conversely9, demonstrated sex-specific variations in functional brain connectivity patterns among individuals with depression. These contrasting findings underscore the complexity of sex-based influences on neurological conditions. Our study aims to further elucidate this relationship by investigating whether patient biological sex significantly impacts the detection and characterization of neurological pathologies.

While previous studies in the field primarily focused on differences in brain size and static features10,11,12, a comprehensive research argued that most structural differences are minor13 when normalized to whole brain volume14. To address this gap, we propose leveraging EEG, providing insights into brain dynamics and activity patterns. As EEG has a better temporal resolution than functional magnetic resonance imaging (fMRI), it can complement the findings for the fMRI study of biological sex differences15. However, a significant challenge in utilizing EEG data is its intrinsic noise. Despite the promising potential of brain imaging and machine learning in mental health research to classify sex-specific markers, a significant hurdle arises from often small and limited datasets used in these studies. For example16, evaluated deep learning classifiers on a small number of participants with Major Depressive Disorder (MDD), and17,18 applied them to a mid-size dataset on healthy participants (see Table 1 for a comparison). Relying on insufficient sample sizes can lead to incomplete and biased conclusions13, hampering the generalizability and reliability of findings. This issue is particularly critical in understanding the intricate connections between brain function and mental health, where individual variations and complexities necessitate comprehensive datasets.

Additionally, machine learning algorithms applied to EEG data face significant challenges that can impede their generalization performance. Two primary issues are distribution shifts and artifacts in EEG signals. Distribution shift occurs when the test data’s statistical properties diverge from the training data’s, potentially compromising the model’s accuracy on unseen samples19. This phenomenon is particularly relevant in EEG analysis, where inter-subject and inter-session variability can lead to substantial differences between training and deployment conditions20. Concurrently, EEG signals are susceptible to various artifacts and noise sources that can distort the underlying neural activity. These unwanted variations may arise from diverse factors, including suboptimal electrode placement, cardiac electrical activity, muscle contractions, ocular movements, and environmental electromagnetic interference21. Such artifacts can significantly influence model performance by introducing spurious patterns or obscuring relevant neurophysiological features17. These artifacts and noise sources pose a considerable challenge for machine learning models in extracting meaningful features and patterns from EEG data. Consequently, these factors can substantially impact the quality and reliability of the derived insights, necessitating robust preprocessing techniques and adaptive learning algorithms to mitigate their effects22. Addressing these challenges is crucial for advancing the field of EEG-based machine learning, as it would enhance the generalizability and reliability of models across diverse experimental settings and subject populations.

To overcome the limitations inherent in structural brain imaging and small sample sizes, we propose leveraging large-scale datasets to enhance the signal-to-noise ratio, thereby improving the reliability and accuracy of findings. By integrating EEG data into our analysis, we aim to elucidate the brain’s dynamic processes and associations with biological sex differences and behavioural outcomes. In this study, we employed machine learning algorithms on varying-sized datasets, encompassing both Non-Pathological and Pathological populations. Additionally, we investigated classifier performance under distribution shifts on unseen data to study model generalization and robustness. Ultimately, we explored features that were important for the models of biological sex and pathology detection in different subgroups, contributing to more robust and applicable insights for targeted and personalized mental health interventions.

Table 1 A comparison of previous studies on EEG biological sex detection. The table shows the name of the study, the dataset used, the number of participants and recordings in the dataset and in (train, test) splits, participants’ conditions, and the data availability.

Recent advances in neuro-AI have begun to shift toward the development of general-purpose foundation models trained on large-scale EEG datasets23,24. These models aim to capture transferable neural representations that can support a range of downstream tasks, from cognitive state decoding to clinical prediction. However, as these models grow in size and complexity, they inherit the limitations and biases of the data they are trained on. In the context of medical AI, this raises urgent concerns about the representativeness and fairness of training datasets, particularly when key demographic attributes (such as biological sex) are imbalanced or under-annotated25. EEG datasets often suffer from such imbalances, yet their impact remains underexplored at scale26. Our study addresses this gap by systematically analyzing sex detectability and its influence on downstream EEG-based classification tasks across three diverse datasets. By doing so, we contribute to the foundational understanding necessary for developing robust, fair, and clinically meaningful neuro-AI models in the age of large-scale, multimodal brain modeling.

In more detail, we investigate sex-specific patterns in EEG signals using Artificial Neural Networks (ANN) approaches, specifically dominant Convolutional Neural Networks (CNNs), to understand their implications for neurological diagnostics. We evaluate biological Sex Detectability (SD) across three large-scale datasets: the Temple University Hospital EEG Corpus (TUEG), Temple University Hospital Abnormal EEG Corpus (TUAB), and NUST-MH-TUKL EEG (NMT) dataset, encompassing both healthy participants and those with various pathological conditions. Our methodology evaluates model performance using Balanced Accuracy (BAcc) metrics to account for dataset imbalances. To interpret the neural network’s decision-making process, we utilize Amplitude Gradients Analysis (AGA) for feature visualization across different frequency bands. The following sections present our experimental results, demonstrating biological sex detectability in EEG signals, followed by investigations into the impact of biological sex imbalances on pathology detection, feature importance analysis, and comprehensive discussions of our findings. Detailed methodological approaches are provided in the subsequent Materials and Methods section.

Results

One of the objectives of this study is to investigate whether the sex of the subjects is detectable from their scalp EEG recordings, which are functional brain imaging. This question is relevant for understanding the sex-specific differences in brain activity and their implications for the diagnosis and treatment of various neurological and psychiatric disorders. Moreover, this question is also essential for evaluating the potential biases and limitations of machine learning classifiers trained on EEG data.

This section presents our comprehensive analysis of sex-specific patterns in EEG signals across multiple datasets and experimental conditions. We begin by examining the fundamental question of biological sex detectability in EEG signals across different populations, followed by an evaluation of model generalization capabilities on unseen data through zero-shot performance assessments. Subsequently, we investigate the potential impact of biological sex imbalances on pathology detection accuracy, and conclude with a detailed feature importance analysis to understand the neural mechanisms underlying our findings.

Biological sex detectability in EEG

To explore the detectability of biological sex in EEG signals, we conducted experiments using diverse datasets, including the TUEG EEG dataset, recognized as the current most extensive open-source corpus of EEG data. Our analysis encompassed both Normal and Abnormal populations. For classification, we employed a simple, yet effective shallow CNN, known for achieving competitive accuracy in predicting pathology from EEG data, as demonstrated in previous studies26,27.

Fig. 1
figure 1

Biological sex classification performance across three populations using EEG signals: (A) Normal, (B) Abnormal, and (C) All participants. Bars represent balanced accuracy (BAcc), and error bars indicate the standard error across ten random seeds. The TUEG dataset does not include pathology labels; therefore, results are only available for the combined population (C). Across datasets, BAcc ranged from 65% to 80%, with slightly higher performance observed in the Normal population compared to the Abnormal population. A notable drop in accuracy is observed when models are tested out-of-distribution, reflecting the challenge of generalizing across heterogeneous EEG datasets.

The outcomes of our experiments are summarized in Figure 1 and Table 2, showcasing the BAcc of the CNN classifier across each dataset. The figure illustrates BAcc when training and testing the model on the same dataset, representing an in-distribution scenario. Additionally, it displays the BAcc of a model trained on one dataset and tested on another, reflecting out-of-distribution performance. The results suggest that subjects’ biological sex is discernible from their EEG recordings in all populations, yielding accuracy rates ranging from \(65\%\) to \(80\%\). Furthermore, the findings indicate slightly superior biological sex detection performance for the Normal population compared to the Abnormal population. Notably, the TUEG dataset lacks pathology labels (Normal/Abnormal), and as a result, results for Normal and Abnormal participants are not presented. Importantly, the results also show a clear drop in performance under distribution shift (i.e., when the model is evaluated on a dataset different from the one it was trained on), highlighting the challenge of generalizing sex-specific EEG features across heterogeneous data sources.

Table 2 Comparison of BAcc between previous work on the TUAB dataset and ours. Values show mean±SD over 10 randomly initialized models.

Table 2 compares our models and a previous study focusing on biological sex detection in the TUAB dataset. Notably, the TUAB dataset, with a moderate sample size compared to the two other datasets in our study, serves as the benchmark. The outcomes reveal that ShallowNet outperforms the previous model, particularly excelling in the in-distribution scenario.

Performance on unseen data (zero-shot)

We conducted zero-shot performance assessments across various datasets to evaluate the model’s generalization to unseen data. Zero-shot performance means that the model can predict the class of a sample from an unseen dataset without having seen any examples from that dataset during training. The lowest accuracies were observed when the model was evaluated across the TUH datasets and the NMT Scalp EEG Dataset.

Our investigation extended beyond the original training and testing datasets to explore out-of-distribution accuracy, mainly focusing on the Abnormal population. Strikingly, the model exhibited higher accuracy in out-of-distribution scenarios when dealing with the Abnormal population. To comprehensively gauge the generalization of learned features, each model was tested on other datasets to evaluate zero-shot performance. This analysis provided insights into how well the model leverages learned features when confronted with entirely new data.

These results demonstrate that the sex of the subjects is a significant factor that machine learning methods could capture. However, these results also may imply that the biological sex of the subjects should be taken into account when developing and evaluating machine learning classifiers for EEG pathology detection, as the biological sex distribution of the training and testing data may affect the generalization and robustness of the models.

Biological sex imbalance’s impact on EEG pathology detection

As we see in the previous sections (SD and Zero-Shot), biological sex is detectable from the EEG signals and is an important biological factor that can influence human brain activity and behaviour. Therefore, considering it in the analysis is essential, especially when the datasets are imbalanced. In this section, we aim to investigate the effect of biological sex on pathology detection from EEG signals using the NMT Scalp EEG Dataset. The NMT dataset has a significant biological sex imbalance, as male samples are two times more frequent than female samples in the dataset. This raises the question of whether biological sex imbalance and the SD in EEG signals can affect the performance of the pathology detection models.

To address this question, we conducted several experiments using different deep-learning architectures (see Sect. "Hyper-parameters selection" for more details). We first verified that biological sex is detectable from the EEG signals using a simple CNN that achieved a good accuracy on the biological sex classification task on several EEG datasets with different sample sizes (see Table 2 and Fig. 1). We then evaluated the pathology detection models on the NMT dataset for different subgroups.

Figure 6 shows how the biological sex imbalance in the NMT dataset does not affect the pathology detection performance. We conducted an Independent Samples T-Test to assess the significance of differences between male and female groups for pathology detection. The results indicate a non-significant difference (\(t(38) = -0.047\), \(p = 0.962\)) in test accuracy between the two groups. Therefore, based on our analysis, we did not find a significant distinction in pathology detection performance between male and female subjects. Although the NMT dataset has twice as many male samples as female samples, as shown in panel A. However, this does not lead to a significant difference in the accuracy of the pathology detection models for the male and female subgroups, as shown in panel B. This suggests that the biological sex imbalance in the NMT dataset does not hurt the pathology detection quality.

Feature importance

Fig. 2
figure 2

AGA of different frequency bands of biological sex classifiers on NMT dataset. (A) Abnormal, (B) Normal, and (C) The difference between Abnormal and Normal. Red indicates a stronger relation with the female class, while blue indicates a stronger relation with the male class.

Figure 2 shows the gradient amplitude analysis of Female/Male classifiers of EEG signals across different frequency bands for Normal and Abnormal subjects. The colour gradient represents the gradient amplitude of the EEG signals. The Amplitude Gradients Analysis (AGA) reveals distinct patterns of feature importance across different conditions and frequency bands. The network strongly prefers features in the theta, alpha, and beta bands, as evidenced by the pronounced gradient values in these regions. This suggests that the network relies heavily on these frequencies when classifying between female and male classes. Figure 2C shows the difference between the Abnormal and Normal groups; these networks show a noticeable difference in these bands.

In addition to exploring the patterns of feature importance in the biological sex classifiers, our focus extended to discerning how specific frequencies in different brain areas contribute to the detection of pathology, distinguishing between Abnormal and Normal conditions, and whether female and male groups utilize different features. To achieve this, we employed AGA on these classifiers. The results were further stratified into distinct groups based on sex, specifically the Female and Male groups. This gender-based categorization allowed us to investigate potential variations in frequency contributions between sexes. Additionally, we calculated the difference in the AGA results between the Female and Male groups, providing insights into the gender-specific nuances of Abnormal condition detection. This approach not only broadened our understanding of the classifiers’ behaviour but also unveiled gender-specific nuances in the frequency dynamics of brain areas implicated in the pathology detection task.

Fig. 3
figure 3

AGA of different frequency bands of Pathology classifiers on NMT dataset. (A) Female (B) Male (C) The difference between the Female and Male classes. Red indicates a stronger relation with the female class, while blue indicates a stronger relation with the male class.

Figure 3 reveals that pathological conditions distinctly influence brain activity patterns across all frequency bands, particularly in the lower range (0–12 Hz). This underscores the classifiers’ heightened sensitivity to anomalies in these lower frequency bands when distinguishing between Abnormal and Normal states, emphasizing their relevance in pathology detection. Interestingly, in Fig. 3C, notable differences emerge in the features employed by the pathology classifiers in Female and Male subjects.

Hyper-parameters selection

In addition to ShallowNet, we included EEGNet28, ShallowNet29, Deep4Net29, and Temporal Convolutional Network (TCN-EEG)30 in our experiments to evaluate the model-independence of biological sex and pathology classification in EEG data. These models differ in depth, architecture, and inductive biases, yet they consistently achieved above-chance performance across datasets. This consistency supports our central claim that EEG-based classification of biological sex and pathology is a robust signal, not tied to a specific model architecture. Among these, ShallowNet was chosen as the primary model for its simplicity, interpretability, and competitive performance, which aligns well with the goals of our study.

In our hyperparameter search for the neural network models, we consider a search space for learning rate, weight decay, dropout probability, and data augmentation. The hyperparameters are explored on all four well-established models. The Table 3 displays the search space for hyperparameters.

We adapt Randaugment31 to randomly select two transformations from a predefined pool in each epoch for data augmentation. Inspired by prior work on augmenting EEG data32, we employ four augmentation methods, each with a probability of 0.4 during training: SignFlip: Randomly flips the sign of the EEG signals, simulating a change in electrode polarity. ChannelsDropout: Randomly drops out some channels of the EEG signals, simulating a loss of contact or electrode malfunction, with a dropout probability of 0.2. FrequencyShift: Randomly shifts the frequency spectrum of the EEG signals, simulating a change in the sampling rate or frequency drift, with a maximum frequency shift of 2 Hz. SmoothTimeMask: Randomly masks some time segments of the EEG signals with a smooth transition, simulating a temporary occlusion or signal distortion, with a mask length of 600 samples. Additionally, we include two other methods, BandstopFilter and ChannelsShuffle, with the same probability and aimed at simulating noise reduction or electrode placement changes.

Table 3 Hyper-Parameters (HP) search space and the HPs selected for experiments in the paper.
Fig. 4
figure 4

Hyperparameter search results. BAcc for all models.

Performance comparison across models. Table 4 and Fig. 4 summarize the classification performance of four neural models across three EEG datasets. ShallowNet consistently achieved the highest mean balanced accuracy (BAcc), particularly on TUAB and TUEG, while also exhibiting lower standard deviation compared to the other models. This stable and superior performance supports our choice of ShallowNet as the primary model for the main experiments. Although Deep4Net, TCN, and EEGNet achieved above-chance results, their comparatively lower mean scores and higher variability suggest less consistent performance across datasets. These findings confirm that sex-related signals in EEG can be robustly extracted using diverse neural network architectures, reinforcing the generalizability of our results. The specific hyperparameter settings used for training ShallowNet are detailed in Table 3. Notably, our experiments showed no significant improvement when using data augmentation; therefore, we conducted all main experiments using ShallowNet without augmentation.

Table 4 Balanced Accuracy (BAcc %) across datasets and models for biological sex classification. Values are reported as mean ± standard deviation.

Discussion and conclusion

Historically documented biological sex differences in EEG patterns and the successful application of machine learning for automatic biological sex detection suggest that sex-related patterns can act as confounders in machine learning-based EEG assessments16,17. In our investigation of potential confounding factors within the NMT dataset, we explored a scenario involving an imbalance in male and female participants. Our findings indicate that, in this dataset, biological sex does not function as a confounder due to an equal distribution of pathological participants in the male/female splits. However, as demonstrated in the SD section, we show that biological sex remains detectable. Consequently, acknowledging biological sex as a factor is essential for precision medicine in mental health.

A key takeaway from an extensive review spanning three decades of research on human brain biological sex differences is that, despite observable behavioural distinctions between men and women, differences in brain structure and function are minimal and inconsistent when controlling for individual brain size and accounting for inadequate sample sizes13. In contrast, our study employs EEG, which has high temporal but low spatial resolution, to assess functional brain activity. Our findings reveal distinct patterns across datasets with varying subject numbers, highlighting the unique insights provided by EEG in uncovering differences.

Our experiments on SD, particularly ShallowNet, demonstrate superior performance in biological sex detection on the TUAB dataset in both in-distribution and zero-shot scenarios. Notably, ShallowNet outperforms previous models by a margin in the zero-shot scenario, showcasing its robust generalization capabilities. The substantial improvement can be attributed to utilizing the TUEG dataset, which offers several advantages: a considerably larger sample size, seven times more unique participants, and a data distribution closely aligned with TUAB. This enrichment in the training data contributes to a noteworthy enhancement in ShallowNet’s performance on the TUAB dataset, achieving an improvement. Importantly, our focus is not solely on surpassing previous benchmarks; we believe a thorough exploration of architecture and hyperparameter settings could further elevate the model’s performance, providing avenues for future refinement and optimization of the biological sex detection task. However, performance is weaker than that of pathology detection26.

Furthermore, in a related experiment33, the investigation provided additional insights into the relationship between biological sex prediction and participants’ diagnoses. Noteworthy findings revealed significant distinctions in only one condition, namely Parkinson’s Disease. However, the authors acknowledged that this condition comprises one of the most minor groups, and its influence on the overall accuracy of the models is likely inconsequential. One potential explanation for the absence of differences could be the variations in prevalence between male and female subtypes within disorders. Specifically, Alzheimer’s disease dementia tends to be more prevalent in females, while males face a higher risk of developing vascular dementia2.

Frequency bands are widely recognized as critical features in quantitative EEG analysis. Despite their prominence, the significance of these features in biological sex detection remains unclear34. Some studies assert that brain rhythms exhibit sex-specific patterns18,35,36,37, while others argue that none of the traditional frequency bands play a particularly crucial role in biological sex detection17. Several studies18,37 demonstrated that a primary distinguishing characteristic lies in the beta activity and its spatial distribution; our results show a similar pattern. However, our results show that the theta and alpha bands also contribute to biological sex classification. Moreover, features in the beta band are similar but are different in the theta and alpha bands between Abnormal and Normal groups. This might indicate that the beta band is more robust. In pathology detection, Fig. 3 in comparison to Fig. 2 highlights the importance of low-frequency bands over higher bands in pathology detection, with a notable difference between Female and Male groups in 3C, the distinct visual representation emphasizes the need to consider biological sex in pathology detection applications. These findings underscore gender-specific nuances in feature utilization, highlighting the importance of a nuanced understanding of classifier dynamics within different demographic groups and enhancing our comprehension of pathology classification intricacies.

By comparing Fig. 2 and Supplementary Figure 1, it becomes evident that our model exhibits consistency in the utilization of specific channels within frequency bands for biological sex classification across the TUAB and NMT datasets. The similarities in the AGA patterns for biological sex classification in the two datasets (TUAB and NMT) suggest that certain frequency-related features play a central role in the model’s decision-making process for sex-related distinctions. Conversely, notable differences emerge when examining pathology classification in Fig. 3 and Supplementary Figure 2. The network appears to leverage distinct features and channels within frequency bands when discerning pathology, highlighting dataset-specific nuances in the neural network’s learning patterns. This observation underscores the model’s adaptability, indicating its ability to tailor feature utilization based on each dataset’s specific characteristics and complexities. Such insights from the AGA visualizations contribute to a deeper understanding of the neural network’s behaviour and its capacity to extract relevant information for biological sex and pathology classification across diverse datasets.

Brain connectivity and topography research has yielded diverse and sometimes conflicting perspectives, providing a rich field for future investigations. A seminal study by38, involving 949 youths, revealed distinct patterns in supratentorial connections between males and females. Their findings suggest that male brains exhibit enhanced connectivity between perception and coordinated action, while female brains are structured to facilitate communication between analytical and intuitive processing modes. Specifically, they observed stronger intrahemispheric connections in males and stronger interhemispheric connections in females. Advancements in neuroimaging techniques have further expanded our understanding of brain topography17. demonstrated the significance of EEG topographies in biological sex detection, revealing that even with disrupted waveforms, biological sex could be accurately identified. This finding highlights the potential of EEG topographies as a robust biomarker for sex-specific brain characteristics. However, the field is not without controversy and methodological challenges. As such16, observed that the incorporation of multivariate classification models did not consistently improve performance in brain signal analysis. This finding underscores the need for careful consideration of analytical methods in brain connectivity research. Moreover13, presents a critical perspective on the longstanding belief in sex-based brain lateralization. Despite decades of research examining biological sex effects on lateralized brain function, they argue that no substantial evidence supports the widely held belief that male brains are significantly more lateralized than female brains. This challenge to established notions highlights the importance of rigorous, unbiased research. The diversity of findings in the literature underscores the complexity of brain connectivity and topography, making it an intriguing and promising avenue for future research. One potential direction for future studies could be to examine which connections trained neural networks prefer when classifying brain signals, potentially revealing new insights into the functional significance of specific connectivity patterns.

In conclusion, our comprehensive training and evaluation process demonstrated the model’s efficacy in classifying biological sex from EEG signals. We rigorously assessed its generalization to unseen data, analyzed detectability and transferability across varied conditions, and explored its utility for pathology detection in a heterogeneous and imbalanced dataset. These analyses provide a nuanced understanding of the model’s strengths and its potential clinical applications. Our findings contribute to the broader effort to characterize brain connectivity and topography through neural signal processing.

Future work should focus on resolving remaining inconsistencies in the literature, refining methodological approaches, and leveraging emerging technologies to further explore the complex relationships between brain structure, function, and individual differences. As large-scale and general-purpose foundation models for neuroimaging data continue to emerge24, it is crucial to address dataset imbalance, particularly in sensitive applications such as medical classification. The development and deployment of such models require careful attention to demographic and clinical representation to avoid encoding or amplifying biases. This consideration is especially important when subtle physiological differences, such as those related to biological sex or pathology, may influence clinical decisions or scientific conclusions. A particularly promising direction involves investigating the preferential connections utilized by the trained neural network during EEG classification. This approach could reveal the most salient features underlying sex-based differences in brain activity and enhance our understanding of the neurophysiological mechanisms that distinguish individuals. Such insights may ultimately support the development of more personalized and targeted applications in neuroscience and clinical practice.

Materials and methods

This section provides a comprehensive overview of our experimental methodology and analytical approaches. We first describe the three large-scale EEG datasets utilized in this study, along with our preprocessing pipeline for signal preparation and artifact removal. Following this, we detail our training and evaluation procedures, including model selection and performance metrics. An overview of the complete experimental pipeline is provided in Fig. 5, offering a high-level summary of the data, models, tasks, and evaluation procedures described in detail below. We then outline our specific experimental design for investigating biological sex detectability and pathology detection across different population subgroups. Finally, we present our visualization techniques for interpreting neural network decision-making processes through gradient-based analysis methods.

Fig. 5
figure 5

Overview of the experimental pipeline. The figure summarizes the core components of our methodology, including the EEG datasets used, model architecture, and the two primary tasks: biological sex classification and pathology detection.

Datasets and preprocessing techniques

We analyzed three publicly available EEG datasets, each characterized by distinct sample sizes and conditions, to explore the impact of biological sex and pathology on EEG signals. The utilized datasets are as follows:

TUEG (Temple University Hospital EEG Corpus): This extensive open-source EEG data corpus encompasses over 69, 000 recordings from 14, 987 subjects, with a cumulative duration of 27, 062 hours. The recordings are de-identified and annotated with clinical information, including age and biological sex39,40. Table 1 provides an overview of these datasets, presenting comprehensive information about each dataset.

TUAB (Temple University Hospital Abnormal EEG Corpus): A subset of the TUEG corpus, this dataset consists of 1, 985 recordings from 1, 652 subjects, totalling 453 hours. Expert neurologists have labelled the recordings as normal or abnormal, and demographic information such as biological sex and age is provided39,40,41. It is important to note the overlap of participants between the TUAB and TUEG datasets; therefore, we do not present cross-dataset results between the two.

NMT (NUST-MH-TUKL EEG): Comprising 2, 417 recordings from Normal and Abnormal subjects, this dataset spans a total duration of 625 hours. Expert neurologists have labelled the recordings as either normal or abnormal (The term “Normal/Abnormal” is originally used by the datasets to describe EEG recordings with Abnormal features. This term does not imply judgment but reflects the condition of the EEG signal). Demographic information, including biological sex and age, is also included42.

Data Preprocessing: Patient biological sex or pathology information, encoded as 0 or 1, served as our neural network target. We focused on biological sex rather than gender due to the dataset’s clinical origin, assuming records reflected assigned birth sex. Preprocessing steps included selecting 21 common channels across datasets, cropping between 1–20 minutes, resampling to 100 Hz, Artifact Subspace Reconstruction (ASR)43 for artifact removal, re-referencing to average and z-scoring EEG signals to each channel’s statistics. Data Splitting: Predefined test sets were used to report model accuracy, with \(15\%\) of training splits reserved for model selection.

Training and evaluation

We used ShallowNet29 as our model for all experiments. Given its relative computational simplicity and fewer non-linearities, ShallowNet emerges as a strategic choice for our experiments (see Sect. "Hyper-parameters selection" for more details). ShallowNet’s architecture, comprising only one convolutional layer followed by a fully connected layer, mitigates the computational cost associated with deeper networks, making it particularly appealing for our study. The streamlined design of ShallowNet also offers an advantage in model explainability. The simplicity of the architecture implies that explainability methods are likely to provide more precise insights into what the model has captured, especially concerning biological sex differences. In our experiments, we implemented ShallowNet using BrainDecode44 and trained it with the AdamW optimizer, utilizing a learning rate of 0.000625, weight decay of 0, drop probability of 0.5, and a batch size of 64. The training process comprised 35 epochs, and model selection was based on performance using the BAcc metric on the validation set. The BAcc45, calculated as the arithmetic mean of sensitivity and specificity, offers a more reliable performance assessment, particularly in the context of imbalanced data similar to our datasets. It is strongly advised as a robust evaluation metric for applications involving brain decoding with imbalanced data46. Notably, it is equivalent to standard accuracy in scenarios with balanced data.

Model

The ShallowNet architecture consists of temporal and spatial convolutional layers, followed by a squaring non-linearity and mean pooling. Given an EEG input \(X \in \mathbb {R}^{C \times T}\), where C is the number of channels and T is the number of time points, the model applies:

  • A temporal convolution: \(X' = \textrm{Conv}_\text {temp}(X)\)

  • A spatial convolution: \(X'' = \textrm{Conv}_\text {spat}(X')\)

  • Element-wise squaring: \(X''' = (X'')^2\)

  • Mean pooling over time and flattening

  • A fully connected layer with softmax activation

For classification, the model uses the categorical cross-entropy loss:

$$\begin{aligned} \mathcal {L} = -\sum _{i=1}^{K} y_i \log (\hat{y}_i) \end{aligned}$$

where \(y_i\) is the ground-truth label and \(\hat{y}_i\) is the predicted probability for class i.

Experiment design

The training and evaluation of the model were conducted with the primary objective of classifying biological sex and pathology from EEG signals. Our focus extended beyond the training dataset to include a comprehensive analysis of model performance on both the test split of the training dataset and other unseen datasets. The overarching goal was to assess the biological sex Detectability (SD) from EEG signals and evaluate the model’s robustness to distribution shifts in unseen data.

We investigated detectability and transferability under various conditions to examine the model’s capabilities. Specifically, we explored the model’s performance when trained and tested on subsets of the data, considering scenarios where only Normal participants were included, only Abnormal participants were included, or when the entire dataset was utilized. This approach allowed us to understand how well the model generalizes across different participant profiles.

To ensure robust training and evaluation of our models, we randomly sampled 2000 participants from the TUAB and NMT datasets under various conditions. This approach involved oversampling in scenarios with fewer than 2000 participants, such as the NMT Abnormal subset, and undersampling when the dataset exceeded 2000 participants, as observed in the case of the entire participant pool. This balanced sampling strategy aimed to mitigate potential biases from uneven dataset distributions. We selected 14,000 unique participants from the TUEG dataset for training, ensuring a diverse and representative training set for the neural network.

Furthermore, we conducted experiments to understand the impact of SD on pathology detection. To achieve this, we trained the model on the NMT dataset, which features imbalances in different aspects (see Fig. 6A). Our analysis focused on different subgroups within the dataset, including Male Normal, Female Normal, Male Abnormal, and Female Abnormal participants. We aimed to explore any potential associations between SD and pathology detection by examining the model’s performance on these subgroups. We ran each experiment with ten random seeds for all experiments. All error bars show the standard error of the metrics of the ten seeds. We used JASP47 for statistical analysis to conduct t-tests.

EEG signal visualization

We employed a visualization technique to interpret the deep neural network’s decision-making process. Specifically, we utilized AGA to gain insights into the network’s behaviour. To explain the internal mechanisms of the deep neural network, we conducted an AGA. This process involved computing the gradients of ten distinct models while classifying the target evaluation set derived from the TUAB dataset, focusing on the frequency domain. We examined the traditional EEG frequency bands: delta (0–4 Hz), theta (4–8 Hz), alpha (8–12 Hz), beta (12–30 Hz), and gamma (30–50 Hz).

The obtained gradients were systematically grouped to discern how changes in specific frequency bands influenced feature importance. This grouping facilitated a nuanced understanding of the neural network’s sensitivity to different aspects of the input signal. We plotted the resulting gradient values on a head diagram to present the spatial distribution of informative brain regions for the network’s predictions. Each region was colour-coded or annotated based on its importance, visually representing the neural network’s focus during the brain decoding task. The resulting patterns on the head diagram served as a key tool for interpreting the neural network’s behaviour. Regions with larger absolute gradient values indicated higher sensitivity to changes in the input signal concerning the output. This interpretation sheds light on areas crucial for the network’s decision-making process and provides valuable insights into the brain decoding task.

Fig. 6
figure 6

Effect of biological sex imbalances on pathology detection in the NMT dataset: (A) Distribution of male and female samples in the NMT dataset, with the number of male samples being twice as high as that of females. (B) Performance (accuracies) of subgroups. The discrepancy in sample numbers does not impact pathology detection.