From sleep staging to spindle detection: a case study on end-to-end automated sleep analysis

Grieger, Niklas; Mehrkanoon, Siamak; Ritter, Philipp; Bialonski, Stephan

doi:10.1038/s41598-026-53891-9

Download PDF

Article
Open access
Published: 23 May 2026

From sleep staging to spindle detection: a case study on end-to-end automated sleep analysis

Niklas Grieger^1,2,3,
Siamak Mehrkanoon²,
Philipp Ritter⁴ &
…
Stephan Bialonski^1,3

Scientific Reports volume 16, Article number: 16014 (2026) Cite this article

Subjects

Abstract

Automation of sleep analysis, including both macrostructural (sleep stages) and microstructural (e.g., sleep spindles) elements, promises to enable large-scale sleep studies and to reduce variance due to inter-rater incongruencies. While individual steps, such as sleep staging and spindle detection, have been studied separately, the feasibility of automating multi-step sleep analysis remains unclear. In this case study, we evaluate whether a fully automated analysis using validated machine learning models for sleep staging (RobustSleepNet) and subsequent spindle detection (SUMOv2) can replicate findings from an expert-based study of bipolar disorder. The automated analysis qualitatively reproduced key findings from the expert-based study, including significant differences in fast spindle densities between bipolar patients and healthy controls, accomplishing in minutes what previously took months to complete manually. While the results of the automated analysis differed quantitatively from the expert-based study, possibly due to biases between expert raters or between raters and the models, the models individually performed at or above inter-rater agreement for both sleep staging and spindle detection. Our results demonstrate that fully automated approaches have the potential to facilitate large-scale sleep research. We are providing public access to the tools used in our automated analysis by sharing our code and introducing SomnoBot, a privacy-preserving sleep analysis platform.

Introduction

Automating EEG sleep analysis has the potential to enable scalable and cost-effective studies that were previously impractical, thereby creating opportunities to gain new insights into a wide range of diseases that interact with sleep (such as affective disorders). Recent decades have seen substantial progress in automating various aspects of sleep analysis. These advancements range from the characterization of sleep macroarchitecture, involving the detection of sleep stages^1,2,3, to the identification of EEG graphoelements (sleep microarchitecture)⁴ such as sleep spindles⁵, K-complexes⁶, or rapid-eye movements⁷, which reflect intricate processes at faster temporal scales occurring in the brain during sleep. While early approaches for automated analyses relied upon defining features such as power in spectral bands to indicate particular sleep stages or sleep spindles, recent methods primarily leverage deep neural network models capable of automatically learning distinctive features to achieve state-of-the-art detection performances².

Assessing the performance of models is challenging since their sleep annotations need to be compared with expert annotations that are known to vary between experts (inter-rater variability) and within the same expert over time (intra-rater variability). Inter-rater variability can arise from subjective biases or differing practices in labs and clinics, despite experts’ adherence to annotation guidelines, such as the American Academy of Sleep Medicine (AASM) scoring rules⁸ or the Rechtschaffen and Kales (R&K) criteria⁹. Given this variability, one strategy for assessing a model’s performance is to compare it against consensus annotations created by a group of experts. Such a consensus can provide a more reliable reference for evaluating model performance, and training models to mimic a consensus has been shown to yield annotations that are comparable to or even better than the agreement between individual expert scorers and the consensus¹⁰. Another approach for evaluating model performance is based on comparing the agreement between a model and an expert scorer (model-expert agreement) to a distribution of agreements between pairs of experts (inter-rater agreement distribution). If the model-expert agreement is similar to or better than the average inter-rater agreement between pairs of expert scorers, the model’s performance can be considered robust and comparable to that of an expert.

Given these strategies to assess model performance, it has been shown that individual steps of sleep analysis like sleep staging and spindle detection can be automated with high accuracy^{11,12,13,14,15,16}. However, these steps have been evaluated separately, and it remains unclear whether a multi-step sleep analysis process can be automated end-to-end while still delivering reliable sleep metrics precise enough to test scientific hypotheses or even support clinical diagnostics. In such end-to-end analyses, various interdependencies between the individual steps must be taken into consideration. For instance, the automated detection of sleep spindles is commonly constrained to specific sleep stages (see Fig. 1), such as N2 (and occasionally N3) non-REM sleep, and thus relies on the accurate detection of sleep stages. This limitation arises because models for sleep spindle detection are usually trained on datasets (training sets) containing only those stages where spindles are biologically plausible under AASM scoring rules. When presented with stages outside their training set, such as REM sleep or wakefulness (Wake), that are present in usual overnight recordings, models will often detect spindles in sleep stages where such events should not occur (see Fig. 1A). This challenge extends across methods for analyzing sleep microstructure, including K-complexes, spindles, and rapid eye movements (REMS), highlighting the need for evaluations of multi-step sleep analyses end-to-end.

In this case study, we investigated whether an automated end-to-end sleep analysis using validated deep learning models for sleep staging and subsequent spindle detection can replicate findings from a previous study in which data was manually annotated by expert scorers. This prior study compared spindle metrics between individuals with bipolar disorder (BD) and healthy controls and contained data that the sleep analysis models have not encountered before, emulating a real-world scenario. Beyond replicating the expert-led findings, we validated each step of the automated analysis on separate datasets annotated by multiple experts to compare model-expert agreements with distributions of inter-rater agreements. Our findings indicate that automated multi-step analyses can qualitatively reproduce key differences and similarities in spindle characteristics between bipolar patients and healthy controls, consistent with the original expert-led study. These results highlight the potential of fully automated analyses to replicate expert-led analyses, paving the way for scalable, reproducible sleep research.

We have made the code used for the automated sleep analysis, including our enhanced spindle detection model (SUMOv2), publicly available online. Additionally, we are providing public access to SomnoBot (https://somnobot.fh-aachen.de), a privacy-preserving tool that allows researchers to automatically analyze their data without requiring programming skills or sharing data with third parties.

Results

Our automated end-to-end sleep analysis used two deep neural network models for sleep staging and spindle detection, respectively (see Fig. 1C). We evaluated each model by comparing its agreement with expert annotations to the distribution of inter-rater agreements, which we derived from EEG datasets annotated by multiple expert scorers. Finally, we leveraged the automated analysis process to retrospectively analyze a dataset on bipolar disorder (BD) and examine whether it could replicate findings from a previous expert-led study¹⁷, thus removing the need for manual scoring.

In that study, a human expert manually annotated sleep stages and sleep spindles following AASM guidelines in overnight polysomnography (PSG) recordings from 25 healthy controls and 23 bipolar disorder patients¹⁷. The study’s main findings included a significantly lower density of fast sleep spindles in bipolar subjects compared to healthy controls, suggesting that this feature may serve as a potential biomarker – an intriguing finding given the current lack of reliable biomarkers for bipolar disorder.

Sleep staging performance

We automatically detected sleep stages in the BD study recordings using the RobustSleepNet (RSN) model, which has previously demonstrated strong sleep staging performance¹¹. Notably, the RSN model has never been trained or evaluated on BD recordings, reflecting real-world scenarios, where machine learning models must analyze previously unseen patient data. To compare the RSN model to an expert scorer, we calculated the macro F1 (MF1) score, which ranges from 0 (indicating no agreement) to 1 (perfect agreement). MF1 agreement scores were calculated for each of the 48 BD recordings, obtaining values ranging from 0.68 (first quartile) to 0.79 (third quartile). We observed the average of the MF1 scores to be 0.73 with a standard deviation (SD) of 0.11.

To contextualize the obtained MF1 scores, we quantified inter-rater agreement on the IS-RC dataset, a publicly available dataset consisting of 69 overnight recordings that were scored independently by six experts¹⁸. For each pair of experts and each recording, we determined an MF1 score to measure the inter-rater agreement level and show the distribution of these scores in Fig. 2 (blue bars). Pairs of human expert scorers reached an average MF1 score of 0.61 (SD = 0.10), with MF1 values ranging from 0.54 (first quartile) to 0.68 (third quartile). Additionally, we calculated MF1 scores between the RSN model and the individual IS-RC experts, obtaining an average of 0.67 (SD = 0.09), significantly higher than the MF1 scores observed between expert pairs (\(p<0.001\), one-sided t-test with independent samples). In line with previous findings¹⁹, model performance increased when evaluating RSN against a consensus of multiple experts rather than against single scorers.

Sleep spindle detection performance

We detected sleep spindles with the SUMOv2 model (see Methods), which improves upon the publicly available machine learning model SUMO that achieved state-of-the-art detection performance in a previous study¹⁶. Developed without prior exposure to the BD recordings, SUMOv2 was evaluated for its agreement with an expert scorer by calculating the Intersection-over-Union F1 (IoU-F1) score, requiring at least 20% overlap between an expert- and model-identified spindle to be considered a valid agreement. The IoU-F1 score obtains values between 0 (indicating no agreement) and 1 (perfect agreement). To assess SUMOv2 independently of automated sleep staging, we evaluated its spindle detection performance in N2 sleep epochs identified by the expert scorer (see Fig. 1B), where it achieved an IoU-F1 score of 0.60 in the BD recordings. When sleep stages were automatically detected by the RSN model, thereby enabling a fully automated analysis process (Fig. 1C), the agreement between SUMOv2 and expert-detected spindles decreased slightly, yielding an IoU-F1 of 0.54.

To gain an understanding of the reliability of SUMOv2’s performance, we evaluated additional datasets annotated by different expert scorers: DREAMS (sleep spindles from 8 subjects, annotated by two experts²⁰) and MODA (artifact-free N2 sleep spindles from 180 subjects, annotated by 47 experts²¹). On DREAMS, SUMOv2 achieved IoU-F1 scores of 0.53 and 0.76 when evaluated against the first and second expert, respectively, demonstrating that model-expert agreement can vary substantially across individual raters.

The larger number of expert scorers in MODA allowed us to compare SUMOv2 to inter-rater agreement levels observed between expert pairs. On the MODA subset not used for developing SUMOv2, the model achieved an average IoU-F1 score of 0.72 (SD = 0.08, shaded area in Fig. 3) when evaluated against individual experts. On the same data, the average inter-rater agreement across expert pairs was lower, with a mean IoU-F1 of 0.59 (SD = 0.16), and expert pair scores ranging from 0.50 (first quartile) to 0.71 (third quartile; see histogram in Fig. 3). Furthermore, when evaluated against MODA’s consensus annotations—where spindles were only valid if identified independently by multiple experts—SUMOv2 achieved an IoU-F1 of 0.83.

Spindle characteristics of bipolar and healthy subjects

We performed a fully automated analysis of the BD recordings by first identifying sleep stages using the RSN model, followed by detecting sleep spindles in N2 sleep epochs from frontal and central channels with the SUMOv2 model (see Fig. 1C). Following the original analysis in the BD study¹⁷, we calculated for each detected spindle its duration, dominant frequency, and amplitude. Then we determined the average spindle characteristics and spindle density separately for fast spindles (dominant frequency \(>13\) Hz) and slow spindles (\(\le\) 13 Hz) in both the healthy and bipolar cohort (see Tables 1 and 2).

We observed significant differences in spindle densities of fast spindles between healthy controls and bipolar subjects (see Fig. 4 and Table 1). These differences were consistent across frontal and central EEG channels, with average spindle densities of 2.16 spindles per minute (SPM) for bipolar patients (BP) and 3.41 SPM for healthy controls (HC) in frontal channels, and 2.79 SPM (BP) and 4.37 SPM (HC) in central channels. Spindle densities varied substantially between individual subjects, both for healthy controls (standard deviations between 1.68 and 2.01) and bipolar patients (standard deviations between 1.39 and 1.98). We found no statistically significant differences in slow spindle densities or other characteristics between the two cohorts (see Tables 1 and 2).

When analyzing spindle characteristics in regard to channel placement, fast spindles showed higher densities and amplitudes in central channels compared to frontal channels, both in bipolar subjects (density: 2.79 vs. 2.16 SPM; amplitude: 8.40 vs. 7.03 \(\mu\)V) and healthy controls (density: 4.37 vs. 3.41 SPM; amplitude: 8.39 vs. 7.04 \(\mu\)V). In contrast, slow spindles demonstrated slightly higher densities and amplitudes in frontal channels for bipolar patients (density: 3.38 SPM vs. 2.75 SPM; amplitude: 9.14 \(\mu\)V vs. 8.79 \(\mu\)V) and healthy controls (density: 3.76 SPM vs. 2.80 SPM; amplitude: 9.37 \(\mu\)V vs. 8.63 \(\mu\)V). Spindle durations remained stable across channels, averaging 0.82–0.88 s, while average frequencies were consistent for both fast (13.88–13.98 Hz) and slow spindles (12.16–12.27 Hz).

Table 1 Characteristics of fast spindles (frequency > 13 Hz) detected by the SUMOv2 model in the BD recordings for healthy controls (HC) and patients with bipolar disorder (BP).

Full size table

Table 2 Characteristics of slow spindles (frequency \(\le\) 13 Hz) detected by the SUMOv2 model in the BD recordings for healthy controls (HC) and patients with bipolar disorder (BP).

Full size table

Discussion

We observed the automated analysis via deep learning models to qualitatively replicate key findings of a study on bipolar disorder by Ritter et al.¹⁷, achieving in minutes what had previously required months to complete manually. Our approach found significant differences in fast spindle densities between healthy controls and patients with bipolar disorder, consistent with the prior study¹⁷ and similar to observations made for schizophrenia²² (see Fig. 4). Our analysis found systematically higher spindle densities than those observed in the expert annotations¹⁷ (5.53±3.20 vs. 3.49±2.04 SPM for patients, 7.17±2.78 vs 4.23±1.79 SPM for controls). This discrepancy might reflect a bias between the model and the expert, akin to those found among individual expert scorers (see the broad distribution of inter-rater agreement in Fig. 3). However, we consider it more likely that the difference is explained by differences in the spindle aggregation methods used in our analysis of the BD dataset (see section “Automatic detection of sleep spindles”), a conclusion further supported by our model’s predictions on other single-channel EEG datasets, which yielded lower spindle densities (DREAMS: 2.50 ± 1.28 SPM, MODA: 3.88 ± 3.02 SPM).

Our automated analysis was also able to replicate the second finding of the original study that average spindle frequencies for both fast and slow spindles were slightly lower in bipolar subjects than in healthy controls. While the prior study found significant difference for frequencies of fast spindles in central channels (\(p<0.02\)), our analysis showed a similar trend but without statistical significance (\(p<0.19\), see Tab. 1), with the smallest p-values also observed in central channels.

We found differences in the absolute values of spindle durations, frequencies and amplitudes compared to the original study (see Tables 1 and 2). Our detected spindles were generally shorter (less than 0.9 s compared to more than 1 s in the original study), and showed slight variations in frequency (13.9–14.0 Hz in our results vs. 13.5–13.7 Hz for fast spindles, with comparable slow spindle frequencies) and amplitude (7.0–8.5 \(\mu\)V in our results vs. 8.3–10.4 \(\mu\)V for fast spindles, 8.6–9.5 \(\mu\)V vs. 9.5–10.3 \(\mu\)V for slow spindles). These discrepancies can likely be attributed to three main factors: differences in calculation methods for amplitude and frequency, known variability in spindle duration assessments between experts²³, and potential dataset bias, as our SUMOv2 model was trained on the MODA dataset where spindles tend to be shorter (0.75–0.79 s²¹) compared to those in the BD dataset.

Analyzing the model-expert agreement at each of the two analysis steps, we found strong agreement between automatically detected and expert-annotated sleep stages, consistent with previous studies across various datasets^{11,12,13,14,15}. The RSN model achieved a high macro F1 score of 0.73 on the BD dataset, which aligns with the range of inter-rater agreements observed between expert pairs on the IS-RC dataset (see Fig. 2). When we compared this inter-rater agreement with the agreement between RSN and the IS-RC experts on the same data, RSN reached significantly higher macro F1 scores than the expert pairs (\(p < 0.001\)). We further found modest variation in the agreements between expert pairs (first and third quartiles: 0.54–0.68), in line with previously reported inter-rater agreements of 0.54–0.86 (Cohen’s Kappa)^{24,25,26,27,28,29}.

For spindle detection, the SUMOv2 model achieved high IoU-F1 scores on the BD and DREAMS data, indicating that it can reliably identify spindles in datasets outside the training data, a capability that has rarely been studied in previous work^30,31,32. When sleep stages were provided by an expert (see Fig. 1B), SUMOv2 achieved IoU-F1 scores of 0.60 for the BD and 0.53 and 0.76 for the DREAMS dataset. When sleep stages were instead automatically detected by RSN (see Fig. 1C), SUMOv2’s agreement with the BD expert decreased slightly to an IoU-F1 of 0.54. In both cases, the model-expert agreements lay within the range of expert pair agreements observed on the MODA dataset (see Fig. 3) and within the typical agreements of 0.42–0.61 (IoU-F1 scores) reported in the literature^20,23,33. Although this comparison should be interpreted with caution due to differences in datasets and annotators, it provides a sense of how SUMOv2 performs in real-world scenarios with new data. In order to establish an unbiased comparison between SUMOv2 and the MODA expert pairs on a shared set of data and scorers, we also evaluated SUMOv2 against individual experts on a MODA test dataset not used for model training. On this dataset, SUMOv2 achieved substantially higher IoU-F1 scores than those observed between the expert pairs. While this comparison only partially reflects real-world scenarios due to the same experts annotating both training and test data, it is nevertheless an indicator of SUMOv2’s performance compared to typical inter-rater agreements one could expect in practice.

Despite the promising performance of our fully automated sleep analysis approach, our study has several limitations. First, publicly accessible datasets with jointly annotated sleep stages and spindles are scarce, limiting our study to those available. For this reason, we were able to reproduce only a single previously published study, rather than validating our approach across multiple datasets, which would have provided stronger evidence for generalizability. While we were able to include additional datasets in evaluating the individual analysis steps, these datasets were limited in size and diversity, and may not fully represent the broader population. This limitation could impact our comparisons of model-expert agreement with inter-rater agreement scores, as sleep macro- and microarchitecture can vary substantially with factors such as age, sex, and various physiological or pathological conditions³⁴, all of which may influence the achievable level of agreement among experts. Future research could benefit from expanding investigations to more diverse populations, including patients with specific disorders or recordings from mobile EEG devices, and exploring performance stratification by demographic factors. Achieving these goals will require the availability of large, well-annotated datasets and the establishment of a common benchmark to support the development and evaluation of fully automated sleep analysis approaches within the research community. Second, our analysis approach did not incorporate explicit handling of EEG artifacts, which we suggest as a consideration for future research. Since artifacts can distort EEG recordings, integrating an automated artifact detection mechanism could enhance the robustness of our sleep analysis approach. We note, however, that most state-of-the-art sleep analysis models are trained on EEG data that already includes artifacts, potentially enabling them to adapt and mitigate their effects implicitly. Third, our study focused on a sequential analysis approach in which sleep staging precedes spindle detection rather than an end-to-end modeling approach that jointly optimizes both tasks. While such an integrated approach could offer advantages, it requires extensive datasets with jointly annotated sleep stages and spindles, which are currently not publicly available as noted above.

Our study advances automated sleep analysis by demonstrating that fully automated sleep staging and spindle detection can match expert-level performance. To enable broader access to automated sleep analysis, we are sharing our code, publishing our novel spindle detection model, SUMOv2, and releasing our privacy-preserving sleep analysis tool, SomnoBot (https://somnobot.fh-aachen.de), which enables researchers to use our analysis approach without requiring programming expertise. We hope that this work and similar efforts will facilitate large-scale, long-term sleep studies, enabling new insights into sleep-related health and disease.

Methods

Automatic detection of sleep stages

Datasets: The IS-RC dataset contains one PSG recording each of 70 women, recorded in a research study investigating sleep disordered breathing in women in midlife (40–57 years old)¹⁸. All 70 recordings were annotated by ten expert scorers following the guidelines of the American Academy of Sleep Medicine (AASM)⁸, and the annotations of six experts are publicly available. We aggregated these annotations into a consensus following the approach outlined by Stephansen et al.¹⁹. One recording was discarded due to a mismatch between the filenames of the annotations and the corresponding EEG recording.

The BD (Bipolar EEG Dataset) dataset contains one PSG recording each of 25 healthy controls and 23 patients with bipolar disorder¹⁷. The 48 recordings were annotated by a single expert following the AASM guidelines. Due to differences in channel setup, we focused our analyses on the set of EEG channels that were common to all recordings: F3-A2, F4-A1, C3-A2, C4-A1, O1-A2, O2-A1, and A1-A2. Sampling rates varied between 100 Hz, 200 Hz, or 500 Hz depending on the recording and channel.

All recordings were divided into 30-second epochs and labeled as either Wake, N1, N2, N3, or REM sleep stages by the expert annotators.

Model: We used the RobustSleepNet (RSN) model for automated sleep staging which is a deep learning model that was designed to be invariant to the number, type, or order of PSG montages¹¹. Guillot et al. provide several checkpoints for RSN that were trained on EEG, EOG, and EMG data from different datasets³⁵. We used the checkpoint trained on 659 recordings from the MESA³⁶, MrOS³⁷, SHHS³⁸, DODO³⁹, DODH³⁹, and MASS⁴⁰ datasets. The model accepts input sequences of 21 sleep epochs (i.e., 10.5 minutes) to ensure sufficient context for the classification of each epoch. Given an input sequence, the model outputs probabilities for each of the five sleep stages for each epoch. Due to the model’s architecture, the input sequences could have an arbitrary number of channels.

Following standard procedures for the use of RSN¹¹, each recording was preprocessed before being scored by the model using a 4th order Butterworth bandpass filter (0.3–30 Hz), downsampled to 60 Hz, and normalized by subtracting the median and dividing by the interquartile range. Amplitudes outside -20 to 20 were clipped to these bounds. We buffered the beginning and the end of each preprocessed recording with 20 sleep epochs of zeros to prevent the model from making predictions based on incomplete sequences. We then created the input sequences to the model by sliding a window of 21 epochs over the buffered recording with a step size of 1 epoch (i.e., each epoch was part of 21 input sequences). The resulting 21 predicted probabilities for each sleep epoch were aggregated by calculating the geometric mean to obtain the final predictions for each epoch¹¹.

Evaluation: Given two sets of annotations for a recording, we determined the agreement between the two sets by calculating the Macro F1 score as follows. For each sleep stage, we first counted the number of epochs that matched in both annotations as true positives (TP). False positives (FP) were defined as epochs labeled as a given sleep stage in one annotation but assigned a different stage in the other annotation. Conversely, false negatives (FN) were epochs that were assigned a different stage in one annotation but labeled as the given stage in the other annotation. Precision and recall for a stage was calculated as TP / (TP + FP) and TP / (TP + FN), respectively, and the F1 score for that stage was then given by 2 \(\times\) (precision \(\times\) recall) / (precision + recall). Finally, the Macro F1 score was calculated by averaging the stage-specific F1 scores.

Automatic detection of sleep spindles

We detected sleep spindles using SUMOv2, an enhanced version of the publicly available SUMO model¹⁶. Our improvements focused on increasing robustness to variations in amplitude scales, ensuring more reliable spindle detection across diverse datasets.

Datasets: The MASS dataset contains 200 PSG recordings from 200 mostly healthy subjects (15 subjects were diagnosed with mild cognitive impairment) and sampled at 256 Hz⁴⁰. The MODA dataset provides spindle annotations for selected sections of 180 recordings from MASS^21,41. Each of these recording was divided into 10 (30 recordings) or 3 (150 recordings) blocks of 115 s of artifact-free N2 sleep. The blocks were annotated for spindles by up to seven human experts, who used either the C3-A2 or C3-LE EEG channel for annotations, depending on the recording. In total, the MODA dataset contains 749 blocks annotated by 47 experts (one block was not presented to any experts). Annotations were aggregated into a consensus based on the experts’ confidence in their annotations²¹. For simplicity, we refer to the combination of the MASS and MODA datasets as the MODA dataset.

The DREAMS dataset comprises eight 30 minutes long segments of EEG data from eight subjects with various pathologies (dysomnia, restless legs syndrome, insomnia, apnoea/hypopnoea syndrome) that was sampled at 50, 100, or 200 Hz²⁰. The segments were extracted from whole-night recordings without regard to the underlying sleep stages or the presence of artifacts. Each segment was annotated for spindles by two human experts, except for the last two segments that were only annotated by the first expert. Depending on the segment, the experts were presented with the C3-A1 or CZ-A1 EEG channel and were not given any information about the sleep stages identified by a different expert. Based on the sleep stages annotated by the separate expert, we removed all spindle annotations outside N2 sleep.

The BD dataset (see also section “Automatic detection of sleep stages”) includes spindle annotations for artifact-free N1, N2, and N3 sleep stages¹⁷. As in the original study¹⁷, we focused on spindles in N2 sleep in our analyses. The spindle annotations were created by an expert, with a second expert verifying annotations in case of uncertainty. While the BD dataset does not specify which EEG channels were used for the annotations, it is reasonable to assume that the expert analyzed the same channels as the ones investigated in the study presenting the dataset: F3-A2, F4-A1, C3-A2, and C4-A1¹⁷.

Model: Following the SUMO study, we considered the spindle detection task as a segmentation problem for single EEG channels¹⁶. We adopted the SUMO model architecture for SUMOv2, which is a U-Net with two encoder and two decoder blocks. SUMOv2 accepts input sequences of arbitrary length and outputs two segmentation masks that indicate for each data point in the input sequence whether a spindle is present (maximum in the first mask) or not (maximum in the second mask). We joined consecutive indications of spindle presence to form the final spindle annotations, consisting of a starting sample and a duration.

To detect spindles using SUMOv2 or to train the model, we preprocessed the data by splitting it into contiguous blocks of N2 sleep according to sleep stage annotations. Each block of N2 sleep was then filtered with a 20th order Butterworth high-pass filter at 0.3 Hz, followed by a 20th order Butterworth low-pass filter at 30 Hz, downsampled to 100 Hz, and normalized by subtracting the median amplitude and dividing by the interquartile range. Amplitudes outside the range of -20 to 20 were clipped to the respective boundary.

Training: For training SUMOv2, we split the MODA dataset into a training and test set. The test set consisted of the data of 36 subjects with 3 blocks of 115 s each (see the SUMO study for further details on how test subjects were selected¹⁶). The training set was further split into six cross validation folds, each containing the data of five subjects with 10 blocks of 115 s each and 19 subjects with 3 blocks of 115 s each. After hyperparameter optimization on this cross validation split, we retrained the final model on the entire training set with 10% of the data reserved for early stopping. The DREAMS and BD datasets were not part of the training or the validation process and were used for testing purposes only.

We trained the model using the Adam optimizer, a batch size of 12, a learning rate of 0.005, and a generalized Dice loss, which is a variant of the Dice loss that is more robust to class imbalance⁴². To prevent overfitting, we trained the model until the IoU-F1 score (see next section) on the validation set did not improve for 300 consecutive training epochs or until 800 training epochs were reached.

We found the original SUMO model to be sensitive to variations in EEG amplitudes, posing challenges for datasets with differing amplitude distributions, such as those from patient groups with less pronounced spindle activity or other recording setups. To address this issue with SUMOv2, we used data augmentation by random rescaling of EEG amplitudes during training to make the model more robust. Each sample was randomly chosen to be either upscaled (multiplied by a random factor between 1 and 2) or downscaled (multiplied by a random factor between 0.5 and 1) to ensure adaptability to varying amplitude distributions.

Evaluation: When evaluating SUMOv2 on the BD dataset, we applied the model separately to each of the four EEG channels present in the data and then aggregated the detected spindles across channels by taking the union of annotations (i.e., overlapping annotations in different channels were merged and non-overlapping annotations were kept separate).

Following standard procedures outlined in the MODA study²¹, we postprocessed detected spindles for all datasets by merging spindles shorter than 0.3 s and separated by less than 0.1 s, and subsequently removing spindles with a duration of less than 0.3 s or longer than 2.5 s.

To evaluate the performance of the model, we used the Intersection-over-Union (IoU) F1 score. Given two sets of spindle annotations, the IoU-F1 score was calculated on a by-spindle basis. Each spindle in the first set was matched with the temporally closest spindle in the second set. If the overlap between two matched spindles divided by the duration of the combined spindles was greater than 20%, the spindles were considered a true positive (TP). Spindles in the first and second sets that did not have a match meeting the threshold were considered false positives (FP) and false negatives (FN), respectively. TPs, FPs, and FNs were summed over all jointly annotated recordings or data segments (i.e., when calculating the IoU-F1 score between two experts, we summed TPs, FPs, and FNs over all recordings annotated by both experts). The IoU-F1 score was then calculated as \(2 \cdot \text {TP} / (2 \cdot \text {TP} + \text {FP} + \text {FN})\).

For the calculation of pairwise inter-rater agreement levels on the MODA dataset, we only considered expert pairs with at least five jointly annotated blocks equivalent to roughly 9.5 minutes of EEG data (280 expert pairs met this criterion).

Spindle characteristics: For the detected spindles, we computed spindle density, duration, frequency, and amplitude. Spindle density (spindles per minute, SPM) was determined by dividing the number of detected spindles in N2 sleep by the total duration of N2 sleep. Spindle duration was measured as the time between the first and last sample of each spindle. To calculate the frequency and amplitude of a spindle, we first applied a 4th-order Butterworth band-pass filter (10–16 Hz) to the unprocessed EEG signal. Spindle frequency was calculated as the average of the instantaneous frequencies that were determined as half the reciprocals of zero-crossing intervals of the band-pass filtered signal. Spindle amplitude was determined as the mean absolute value of the Hilbert-transformed filtered signal.

Data availability

The IS-RC dataset¹⁸ is available from Stephansen et al.¹⁹ and can be accessed at https://stanfordmedicine.app.box.com/s/r9e92ygq0erf7hn5re6j51aaggf50jly. The DREAMS dataset²⁰ is publicly available on Zenodo (https://doi.org/10.5281/zenodo.2650141). The MODA dataset^21,41 is publicly available on the Open Science Framework (https://osf.io/8bma7/). The MASS dataset⁴⁰, which contains the EEG recordings, is publicly available and can be obtained from the Montreal Archive of Sleep Studies web page (http://ceams-carsm.ca/mass/). The BD dataset is not publicly available due to patient privacy concerns but may be made available from PR (Philipp.Ritter@ukdd.de) on request.

Code availability

The underlying code and definitions of training/validation/test datasets for this study are available on GitHub and can be accessed via this link https://github.com/dslaborg/sumov2. The code and model file for the RobustSleepNet model are available at https://github.com/Dreem-Organization/RobustSleepNet/tree/main/pretrained_model/0dfcee73-055a-4c4d-929c-8fdf630e14f1.

References

Gaiduk, M., Serrano Alarcón, A., Seepold, R. & Martínez Madrid, N. Current status and prospects of automatic sleep stages scoring: Review. Biomed. Eng. Lett. 13, 247–272. https://doi.org/10.1007/s13534-023-00299-3 (2023).
Article PubMed PubMed Central Google Scholar
Phan, H. & Mikkelsen, K. Automatic sleep staging of EEG signals: Recent development, challenges, and future directions. Physiol. Meas. 43, 04TR01. https://doi.org/10.1088/1361-6579/ac6049 (2022).
Article Google Scholar
Fiorillo, L. et al. Automated sleep scoring: A review of the latest approaches. Sleep Med. Rev. 48, 101204. https://doi.org/10.1016/j.smrv.2019.07.007 (2019).
Article PubMed Google Scholar
Hermans, L. W. et al. Representations of temporal sleep dynamics: Review and synthesis of the literature. Sleep Med. Rev. 63, 101611. https://doi.org/10.1016/j.smrv.2022.101611 (2022).
Article PubMed Google Scholar
Coppieters ’t Wallant, D., Maquet, P. & Phillips, C. Sleep spindles as an electrographic element: Description and automatic detection methods. Neural Plast. 2016, 1–19. https://doi.org/10.1155/2016/6783812 (2016).
Article Google Scholar
Tapia, N. I. & Estevez, P. A. RED: Deep recurrent neural networks for sleep eeg event detection. In 2020 International Joint Conference on Neural Networks (IJCNN), 1–8, https://doi.org/10.1109/ijcnn48605.2020.9207719 (IEEE, 2020).
Yetton, B. D. et al. Automatic detection of rapid eye movements (rems): A machine learning approach. J. Neurosci. Meth. 259, 72–82. https://doi.org/10.1016/j.jneumeth.2015.11.015 (2016).
Article Google Scholar
Berry, R. B. et al. The AASM manual for the scoring of sleep and associated events: Rules, terminology and technical specifications, Version 2.6 (American Academy of Sleep Medicine, Darien, Illinois, 2020).
Rechtschaffen, A. et al. A Manual of Standardized Terminology, Techniques and Scoring System for Sleep Stages of Human Subjects (Public Health Service, U.S. Government Printing Office, 1968).
Google Scholar
Bakker, J. P. et al. Scoring sleep with artificial intelligence enables quantification of sleep stage ambiguity: hypnodensity based on multiple expert scorers and auto-scoring. Sleep 46. https://doi.org/10.1093/sleep/zsac154 (2022).
Article PubMed PubMed Central Google Scholar
Guillot, A. & Thorey, V. RobustSleepNet: Transfer learning for automated sleep staging at scale. IEEE T. Neur. Sys. Reh. 29, 1441–1451. https://doi.org/10.1109/tnsre.2021.3098968 (2021).
Article Google Scholar
Perslev, M. et al. U-Sleep: Resilient high-frequency sleep staging. npj Digit. Medicine 4, https://doi.org/10.1038/S41746-021-00440-5 (2021).
Olesen, A. N., Jørgen Jennum, P., Mignot, E. & Sorensen, H. B. D. Automatic sleep stage classification with deep residual networks in a mixed-cohort setting. Sleep 44. https://doi.org/10.1093/sleep/zsaa161 (2020).
Article PubMed PubMed Central Google Scholar
Vallat, R. & Walker, M. P. An open-source, high-performance tool for automated sleep staging. eLife 10. https://doi.org/10.7554/elife.70092 (2021).
Article PubMed PubMed Central Google Scholar
Hanna, J. & Flöel, A. An accessible and versatile deep learning-based sleep stage classifier. Front. Neuroinform. 17. https://doi.org/10.3389/FNINF.2023.1086634 (2023).
Article PubMed PubMed Central Google Scholar
Kaulen, L., Schwabedal, J. T. C., Schneider, J., Ritter, P. & Bialonski, S. Advanced sleep spindle identification with neural networks. Sci. Rep. 12. https://doi.org/10.1038/s41598-022-11210-y (2022).
Article PubMed PubMed Central Google Scholar
Ritter, P. S. et al. Sleep spindles in bipolar disorder – a comparison to healthy control subjects. Acta Psychiat. Scand. 138, 163–172. https://doi.org/10.1111/acps.12924 (2018).
Article CAS PubMed Google Scholar
Kuna, S. T. et al. Agreement in computer-assisted manual scoring of polysomnograms across sleep centers. Sleep 36, 583–589. https://doi.org/10.5665/sleep.2550 (2013).
Article PubMed PubMed Central Google Scholar
Stephansen, J. B. et al. Neural network analysis of sleep stages enables efficient diagnosis of narcolepsy. Nat. Commun. 9. https://doi.org/10.1038/s41467-018-07229-3 (2018).
Article PubMed PubMed Central Google Scholar
Devuyst, S., Kerkhofs, M. & Dutoit, T. The DREAMS databases and assessment algorithm https://doi.org/10.5281/ZENODO.2650141 (2005).
Lacourse, K., Yetton, B., Mednick, S. & Warby, S. C. Massive online data annotation, crowdsourcing to generate high quality sleep spindle annotations from EEG data. Sci. Data 7, 190. https://doi.org/10.1038/s41597-020-0533-4 (2020).
Article PubMed PubMed Central Google Scholar
Ferrarelli, F. Sleep abnormalities in schizophrenia: State of the art and next steps. Am. J. Psychiat. 178, 903–913. https://doi.org/10.1176/appi.ajp.2020.20070968 (2021).
Article PubMed PubMed Central Google Scholar
Wendt, S. L. et al. Inter-expert and intra-expert reliability in sleep spindle scoring. Clin. Neurophysiol. 126, 1548–1556. https://doi.org/10.1016/j.clinph.2014.10.158 (2015).
Article PubMed Google Scholar
Danker-Hopfe, H. et al. Interrater reliability between scorers from eight European sleep laboratories in subjects with different sleep disorders. J. Sleep Res. 13, 63–69. https://doi.org/10.1046/j.1365-2869.2003.00375.x (2004).
Article PubMed Google Scholar
Danker-Hopfe, H. et al. Interrater reliability for sleep scoring according to the Rechtschaffen & Kales and the new AASM standard. J. Sleep Res. 18, 74–84. https://doi.org/10.1111/j.1365-2869.2008.00700.x (2009).
Article PubMed Google Scholar
Silber, M. H. et al. The visual scoring of sleep in adults. J. Clin. Sleep Med. 03, 121–131. https://doi.org/10.5664/jcsm.26814 (2007).
Article Google Scholar
Lee, Y. J., Lee, J. Y., Cho, J. H. & Choi, J. H. Interrater reliability of sleep stage scoring: A meta-analysis. J. Clin. Sleep Med. 18, 193–202. https://doi.org/10.5664/jcsm.9538 (2022).
Article PubMed PubMed Central Google Scholar
Zhang, X. et al. Process and outcome for international reliability in sleep scoring. Sleep Breath 19, 191–195. https://doi.org/10.1007/s11325-014-0990-0 (2014).
Article PubMed Google Scholar
Basner, M., Griefahn, B. & Penzel, T. Inter-rater agreement in sleep stage classification between centers with different backgrounds. Somnologie 12, 75–84. https://doi.org/10.1007/s11818-008-0327-y (2008).
Article Google Scholar
Chambon, S., Thorey, V., Arnal, P., Mignot, E. & Gramfort, A. DOSED: A deep learning approach to detect multiple sleep micro-events in EEG signal. J. Neurosci. Meth. 321, 64–78. https://doi.org/10.1016/j.jneumeth.2019.03.017 (2019).
Article CAS Google Scholar
Kulkarni, P. M. et al. A deep learning approach for real-time detection of sleep spindles. J. Neural Eng. 16, 036004. https://doi.org/10.1088/1741-2552/ab0933 (2019).
Article ADS PubMed PubMed Central Google Scholar
You, J., Jiang, D., Ma, Y. & Wang, Y. SpindleU-Net: An adaptive U-Net framework for sleep spindle detection in single-channel EEG. IEEE T. Neur. Sys. Reh. 29, 1614–1623. https://doi.org/10.1109/tnsre.2021.3105443 (2021).
Article Google Scholar
Tamamoto, Y., Fujie, T., Umimoto, K. & Nakamura, H. Factors affecting discrepancies between scorers in manual sleep spindle detections in single-channel electroencephalography in young adult males. Front. Sleep 3. https://doi.org/10.3389/frsle.2024.1427540 (2024).
Article PubMed PubMed Central Google Scholar
Kocevska, D. et al. Sleep characteristics across the lifespan in 1.1 million people from the Netherlands, United Kingdom and United States: A systematic review and meta-analysis. Nat. Hum. Behav. 5, 113–122. https://doi.org/10.1038/s41562-020-00965-x (2020).
Article PubMed Google Scholar
Guillot, A. & Thorey, V. Source code of the model presented in Guillot et al., “RobustSleepNet: Transfer learning for automated sleep staging at scale”. https://github.com/Dreem-Organization/RobustSleepNet (2021).
Chen, X. et al. Racial/Ethnic differences in sleep disturbances: The multi-ethnic study of atherosclerosis (MESA). Sleep https://doi.org/10.5665/sleep.4732 (2015).
Article PubMed PubMed Central Google Scholar
Blackwell, T. et al. Associations between sleep architecture and sleep-disordered breathing and cognition in older community-dwelling men: The osteoporotic fractures in men sleep study. J. Am. Geriatr. Soc. 59, 2217–2225. https://doi.org/10.1111/j.1532-5415.2011.03731.x (2011).
Article PubMed PubMed Central Google Scholar
Quan, S. F. et al. The sleep heart health study: Design, rationale, and methods. Sleep 20, 1077–1085. https://doi.org/10.1093/sleep/20.12.1077 (1997).
Article CAS PubMed Google Scholar
Guillot, A., Sauvet, F., During, E. H. & Thorey, V. Dreem open datasets: Multi-scored sleep datasets to compare human and automated sleep staging. IEEE T. Neur. Sys. Reh. 28, 1955–1965. https://doi.org/10.1109/tnsre.2020.3011181 (2020).
Article Google Scholar
O’Reilly, C., Gosselin, N., Carrier, J. & Nielsen, T. Montreal archive of sleep studies: An open-access resource for instrument benchmarking and exploratory research. J. Sleep Res. 23, 628–635. https://doi.org/10.1111/jsr.12169 (2014).
Article PubMed Google Scholar
Yetton, B., Lacourse, K., Delfrate, J., Mednick, S. & Warby, S. The MODA sleep spindle dataset: A large, open, high quality dataset of annotated sleep spindles https://doi.org/10.17605/OSF.IO/8BMA7 (2022).
Sudre, C. H., Li, W., Vercauteren, T., Ourselin, S. & Cardoso, M. J. Generalised Dice overlap as a deep learning loss function for highly unbalanced segmentations. In: Cardoso, M. J. et al. (eds.) Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support - 3rd Int. Workshop, DLMIA 2017, and 7th Int. Workshop, ML-CDS 2017, Held in Conjunction with MICCAI 2017, vol. 10553 of Lecture Notes in Computer Science, 240–248, https://doi.org/10.1007/978-3-319-67558-9_28 (Springer, Québec City, QC, Canada, 2017).

Download references

Funding

Open Access funding enabled and organized by Projekt DEAL. This study was in part funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation), Project-ID 521379614 – SFB/TRR 393 and Project-ID 454245598 – IRTG 2773. The funder played no role in study design, data collection, analysis and interpretation of data, or the writing of this manuscript. Open Access funding enabled and organized by Projekt DEAL.

Author information

Authors and Affiliations

Department of Medical Engineering and Technomathematics, FH Aachen University of Applied Sciences, 52428, Jülich, Germany
Niklas Grieger & Stephan Bialonski
Department of Information and Computing Sciences, Utrecht University, Utrecht, The Netherlands
Niklas Grieger & Siamak Mehrkanoon
Institute for Data-Driven Technologies, FH Aachen University of Applied Sciences, 52428, Jülich, Germany
Niklas Grieger & Stephan Bialonski
Department of Psychiatry and Psychotherapy, University Hospital Carl Gustav Carus, Technische Universität Dresden, 01307, Dresden, Germany
Philipp Ritter

Authors

Niklas Grieger
View author publications
Search author on:PubMed Google Scholar
Siamak Mehrkanoon
View author publications
Search author on:PubMed Google Scholar
Philipp Ritter
View author publications
Search author on:PubMed Google Scholar
Stephan Bialonski
View author publications
Search author on:PubMed Google Scholar

Contributions

NG and SB conceived the experiments; NG conducted the experiments; NG, SM, PR, and SB analyzed and discussed the results; NG and SB wrote the first draft of the manuscript; NG, SM, PR, and SB reviewed the manuscript.

Corresponding authors

Correspondence to Niklas Grieger or Stephan Bialonski.

Ethics declarations

Ethical approval

The collection and analysis of the BD dataset was approved by the Institutional Review Board (IRB00001473 and IORG0001076) at the University Hospital Carl Gustav Carus, Dresden. All other datasets were acquired from third-party databases and handled according to the relevant data sharing agreements.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Grieger, N., Mehrkanoon, S., Ritter, P. et al. From sleep staging to spindle detection: a case study on end-to-end automated sleep analysis. Sci Rep 16, 16014 (2026). https://doi.org/10.1038/s41598-026-53891-9

Download citation

Received: 28 May 2025
Accepted: 14 May 2026
Published: 23 May 2026
Version of record: 23 May 2026
DOI: https://doi.org/10.1038/s41598-026-53891-9