Evaluating the performance of wearable EEG sleep monitoring devices: a meta-analysis approach

Markov, Karmen; Elgendi, Mohamed; Menon, Carlo

doi:10.1038/s44385-025-00034-w

Download PDF

Research
Open access
Published: 09 October 2025

Evaluating the performance of wearable EEG sleep monitoring devices: a meta-analysis approach

Karmen Markov¹^na1,
Mohamed Elgendi^2,3^na1 &
Carlo Menon¹

npj Biomedical Innovations volume 2, Article number: 33 (2025) Cite this article

395 Accesses
Metrics details

Subjects

Abstract

Wearable EEG sleep monitoring devices (wEEGs) are increasingly popular in both clinical and consumer applications. However, their performance compared to polysomnography (PSG), the gold standard, remains under study. This meta-analysis of 43 validation studies assessed wEEGs against PSG, analyzing the influence of study design and device characteristics. The results revealed moderate to substantial agreement between wEEGs and PSG, with performance varying across sleep stages. The N1 stage posed significant classification challenges, while N3 (Deep Sleep) was most reliably detected. Manually scored wEEG data outperformed automatic scoring for N1 detection, and a higher electrode count was associated with improved N3 classification. This study proposes a standardized framework with balanced metrics like MCC and κ to address stage-specific performance variabilities, enhancing device comparability. The findings highlight the strengths and weaknesses of wEEGs and guide future research to refine automatic staging, contributing to their optimization for clinical and consumer applications.

EEG-based headset sleep wearable devices

Article Open access 01 October 2024

Validation of sleep-staging accuracy for an in-home sleep electroencephalography device compared with simultaneous polysomnography in patients with obstructive sleep apnea

Article Open access 12 February 2024

A deep transfer learning approach for wearable sleep stage classification with photoplethysmography

Article Open access 15 September 2021

Introduction

The accurate detection of sleep stages is essential for understanding, diagnosing, and treating sleep disorders. Alterations in sleep stages have been linked to conditions like obstructive sleep apnea (N1 changes)¹, restless legs syndrome (decreased rapid eye movement (REM) and N2 stages)², obstructive sleep apnea-hypopnea syndrome (N3 changes)³, and narcolepsy (disrupted N2/N3 to REM transition)⁴. Moreover, sleep disturbances are commonly associated with neurodegenerative diseases^5,6,7,8,9,10, epilepsy¹¹, chronic pain^12,13, depression^14,15 and anxiety¹⁶. These conditions exhibit a bidirectional relationship with sleep, suggesting that addressing sleep disturbances could potentially delay the onset or ameliorate the symptoms, thereby improving patients’ quality of life^15,17. Beyond clinical disorders, a recent meta-analysis demonstrated that greater proportions of N2 and REM sleep are positively associated with cognitive performance in healthy older adults, reinforcing the broader importance of accurate sleep staging¹⁸.

The American Academy of Sleep Medicine (AASM)¹⁹ provides the most widely used framework for sleep staging, grouping sleep into non-REM (NREM), comprising N1, N2 and N3 stages, and REM sleep. In sleep staging, 30-second record segments, known as epochs, are classified based on the most dominant characteristics: alpha rhythms for Wake, theta waves for N1, K complexes and sleep spindles for N2, and delta waves for N3 or deep sleep. REM sleep is characterized by mixed frequency electroencephalography (EEG), rapid eye movements, and sawtooth waves¹⁹.

Polysomnography (PSG) is the gold standard for clinical sleep assessment and diagnostics, capturing EEG, electrooculogram (EOG), and electromyogram (EMG) activity, as well as breathing effort, airflow, pulse, and blood oxygen saturation. However, PSG studies are expensive, time-consuming, require professional supervision and manual labelling, and suffer from high inter-rater variability (82.6% average agreement)²⁰. Recent advancements in automatic sleep staging algorithms, as reviewed by Gaiduk et al.²¹ have shown high agreement with expert scorers and are increasingly being developed for clinical integration. These tools can accelerate diagnosis, reduce manual workload, and enable responsive interventions, such as real-time monitoring in sleep labs. Automated scoring may also help mitigate inter-rater variability, contributing to more consistent and objective assessments. Despite these advances, PSG remains uncomfortable for many patients, often requiring overnight stays in unfamiliar environments. Moreover, one-night PSG studies may not accurately reflect a patient’s typical sleep patterns and fail to capture night-to-night variability^22,23. Recent advancements in dry electrodes have facilitated the development of wearable EEG sleep monitoring devices (wEEGs)²⁴. These devices, using electrodes placed on the forehead, ear, or neck to acquire EEG signals, can offer a practical alternative to traditional PSG. Suitable for home use, these devices enable continuous sleep monitoring without the need for professional supervision.

Despite the growing popularity of wEEGs, their performance in accurately classifying sleep stages remains a subject of ongoing research. Prior systematic reviews have identified EEG as the most accurate sensing modality for detecting all sleep stages, whereas PPG-based systems, while more user-friendly, are limited in their staging accuracy²⁵. A recent meta-analysis of EEG-based wearable devices has further highlighted the growing number of validated systems available, as well as considerable variability in their measurement properties, including accuracy, target population, and device features²⁶. Nevertheless, the lack of standardized evaluation frameworks and inconsistent reporting of stage-specific performance metrics continue to hinder direct comparison between studies and devices. While overall per-epoch accuracy (ACC) is commonly reported, it may not adequately reflect the true performance of wEEGs due to the inherent class imbalance across sleep stages.

This meta-analysis evaluates the overall and sleep-stage-specific performance of wEEGs compared to PSG. Furthermore, we propose the use of Matthews correlation coefficient (MCC) as an alternative balanced evaluation metric. By providing a comprehensive evaluation of the wEEGs’ performance across sleep stages, this meta-analysis contributes valuable insights into their strengths and areas for further development, ultimately contributing to the optimization of wEEGs for clinical and consumer applications.

Results

This section presents the findings from 43 validation studies that compare the performance of wEEGs in sleep stage classification with reference methods. 32 studies^{22,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57} classified five stages—Wake, N1, N2, N3, and REM (Table 1). Seven studies^{58,59,60,61,62,63,64} classified four stages—Wake, Light (combining N1 and N2), Deep, and REM Sleep (Table 2). The remaining four studies used a three-stage classification (Wake, NREM, REM)^65,66 or non-standard staging systems—one classified Wake, N1, N2 & N3, and REM⁶⁷, and another reported Wake, N1, N2, and N3 without REM⁶⁸ (Table 3).

Table 1 Overview of validation studies classifying five stages

Full size table

Table 2 Overview of validation studies classifying 4 stages

Full size table

Table 3 Overview of validation studies using other than five- or four-stage systems

Full size table

Overview of study characteristics

Figure 1 summarizes key characteristics of the 43 validation studies included in this review. Most validation studies involved single-night sleep recordings (n = 34), were conducted in controlled environments such as sleep laboratories (n = 32), and primarily included healthy participants (n = 29). Home-based studies (n = 11) were generally set up by professionals, with the exception of one study⁴⁷, where participants fitted the device themselves. Only a few studies included clinical populations (n = 8; e.g., obstructive sleep apnea^{29,36,39,44,59}, epilepsy⁵⁰, Parkinson’s disease⁵⁸, sleep disorders⁶⁷ and six studies involved mixed cohorts, including both healthy and clinical participants—for example, one study⁵⁸ primarily enrolled healthy individuals (n = 50) but also included a subgroup of patients with Alzheimer’s disease (n = 12). The number of participants per study ranged from 1 to 197 (mean = 30 ± 37). The volume of data analyzed ranged widely, from 293 to 195,349 epochs (mean = 29,969 ± 37,072), although epoch counts were not available for four studies. Gender representation was approximately balanced among those reporting it (57% male, 43% female), though five studies did not provide gender breakdowns. Participant age was skewed toward young to middle-aged adults, with limited inclusion of adolescents and older adults. wEEGs used across studies differed in electrode placement, form factor, and electrode type. Of the 43 included studies, 19 evaluated commercially available wEEG devices—most frequently Dreem^28,32,58 and Sleep Profiler^22,34,67, and Somfit^37,38 and Zeo^60,61, while 24 assessed prototypes. Electrode placement was most commonly on the forehead (n = 26), followed by the ear (n = 16), with one study⁶⁴ using a neck-worn device. Most devices with forehead electrodes were headbands (n = 15), but other designs included sheet-like or patch-based systems (n = 8) and sleep masks (n = 3). Electrode usage patterns in forehead devices also revealed strong clustering around a few standard positions. As shown in Fig. 2, Fp1 and Fp2 were the most commonly used electrodes, followed by Fpz, F7, F8, AF7, and AF8. The three most frequent electrode combinations corresponded to well-known devices: the Dreem headband⁶⁹ (Fpz, F7, F8, O1, O2), the Sleep Profiler⁷⁰ (Fpz, AF7, AF8), and the Zeo headband⁷¹ (Fp1, Fpz, Fp2). Notably, the Fp1–Fp2–Fpz configuration appeared in five studies, including two that used the Zeo device and three others^30,62,65. Among ear-based devices, the majority were in-ear designs (n = 12), with a smaller number of around-the-ear configurations (n = 4). The number of electrodes ranged from 1 to 16 (mean = 6 ± 4). Dry electrodes were used in 24 studies, while 19 employed wet electrodes. Sleep staging methods applied to the wEEG data varied between machine learning (n = 11), manual scoring (n = 10), proprietary algorithms (n = 9), and deep learning (n = 8). Five studies^{22,27,31,34,41} used a combination of methods, meaning they reported results from more than one approach (e.g., deep learning alongside manual scoring by an expert). To evaluate sleep staging accuracy, most studies used expert-scored PSG as the reference standard, adhering to AASM guidelines (n = 40). Ground truth data were typically scored by a single rater (n = 25). Reported evaluation metrics varied across studies: overall device performance was most commonly assessed using Cohen’s Kappa (n = 37) and accuracy (n = 24), while stage-specific metrics such as sensitivity (n = 33) and positive predictive value (n = 18) were frequently used. Confusion matrices were provided in 34 studies (Fig. 3).

**Fig. 1: Characteristics of the 43 included validation studies.**

**Fig. 2: Overview of electrode use across forehead-EEG studies.**

**Fig. 3: Evaluation metrics reported across included studies.**

Analysis of validation studies

The overview of overall and sleep-stage-specific evaluation metrics, along with the proportions of each sleep stage for both five- and four-stage studies, are presented in Figs. 4 and 5, respectively. The data for these figures can be found in Supplementary Tables S1–S4. Additionally, the Bland–Altman plots, which offer insights into variations in sleep stage duration and proportion estimations by wEEGs compared to PSG, are detailed in Supplementary Figs. S1 and S2.

**Fig. 4: Evaluation metrics for studies classifying five sleep stages using wEEGs.**

**Fig. 5: Evaluation metrics for studies classifying four sleep stages using wEEGs.**

Studies classifying Wake, N1, N2, N3, REM

On average, the studies showed an overall ACC of 0.78 ± 0.07 and an overall κ of 0.68 ± 0.10, suggesting substantial agreement between wEEGs and PSG. MCC (0.70 ± 0.08) and κ were significantly lower compared to ACC. Significant differences were also observed between micro- and macro-averaged ACC, SE, PPV and F1, highlighting the effect of class imbalance on evaluation outcomes. The top three overall MCC results were reported by Li et al.²⁷ (WPSG-I headband; MCC = 0.899 for manual scoring and 0.795 for automatic scoring) and Borup et al.⁴⁶ (in-ear prototype⁷²; MCC = 0.784) (Fig. 4a, b).

The Wake stage, representing 18% of epochs, showed strong agreement between wEEGs and PSG with a mean κ of 0.75 ± 0.12, MCC of 0.76 ± 0.12, and SE of 0.79 ± 0.13. It had the highest accuracy among all stages. Bland–Altman analysis revealed a slight underestimation by wearable devices in duration (–3.20 min; LoA: –21.13 to +14.73 min) and in proportion (–0.69%; LoA: –4.67 to +3.30%). High ICCs (1.00 for duration, 0.99 for proportion) further supported the robustness of these measurements. The best Wake-stage MCCs were reported by Li et al.²⁷ (WPSG-I headband; MCC = 0.950 and 0.906) and Borup et al.⁴⁶ (in-ear prototype⁷²; MCC = 0.883) (Fig. 4c).

The N1 stage, comprising just 7% of all epochs, showed the weakest agreement between wEEGs and PSG. It had the lowest mean κ (0.31 ± 0.15), MCC (0.33 ± 0.15), and SE (0.36 ± 0.18) of all stages, and exhibited high variability across studies, reflecting inconsistent detection. It also had the lowest average PPV. Duration was marginally overestimated (+0.64 min; LoA: –33.83 to +35.12 min), and proportion slightly overestimated (+0.32%; LoA: –7.83% to +8.47%). ICCs were 0.75 and 0.72 for duration and proportion, respectively, indicating moderate-to-good agreement. Top-performing results for N1-stage MCC were reported by Li et al.²⁷ (WPSG-I headband; MCC = 0.646), Seol et al.²⁹ (Insomnograf K2 headband⁷³; MCC = 0.562), and Borges et al.⁴⁴ (in-ear prototype^74,75; MCC = 0.558) (Fig. 4d).

The N2 stage, the most prevalent (41% of epochs), demonstrated strong classification performance with a mean SE of 0.82 ± 0.08, while κ and MCC were slightly lower (both 0.69 ± 0.09). It showed the highest SE and F1 across stages, but the lowest ACC, SP, and NPV. wEEGs slightly overestimated N2 duration (+1.51 min; LoA: –46.75 to +49.78 min) and proportion (+0.23%; LoA: –11.09% to +11.55%). ICCs were 0.79 and 0.73 for duration and proportion, respectively, indicating good agreement. Top MCC results were reported by Li et al.²⁷ (WPSG-I headband; MCC = 0.906 and 0.794 for manual and automatic scoring, respectively) and Matsumori et al.⁴² (sheet-like prototype device; MCC = 0.829) (Fig. 4e).

The N3 stage, also known as Deep Sleep, representing 17% of epochs, had the highest κ (0.76 ± 0.10), MCC (0.78 ± 0.07), and SP, PPV, and NPV among all stages, with a high SE of 0.79 ± 0.10. wEEGs slightly overestimated N3 duration (+1.93 min; LoA: –21.37 to +25.70 min) and proportion (+0.38%; LoA: –4.95 to +4.20%). ICCs of 0.93 for both duration and proportion reflect excellent agreement. Leading results came from Li et al.²⁷ (WPSG-I headband; MCC = 0.912), Tabar et al.⁴⁷ (in-ear prototype; MCC = 0.864), and Mikkelsen et al.⁵² (in-ear prototype⁷²; MCC = 0.863) (Fig. 4f).

The REM stage, also accounting for 17% of epochs, showed strong performance with SE = 0.79 ± 0.13, κ = 0.74 ± 0.13, and MCC = 0.75 ± 0.13. Duration and proportion were slightly overestimated (+1.90 min; LoA: –15.56 to +19.36 min; +0.36%; LoA: –4.95% to +4.20%). ICCs were 0.95 and 0.93, indicating excellent reliability. The highest REM MCCs were observed in Kwon et al.⁴¹ (sheet-like patch prototype; MCC = 0.917), Jørgensen et al.⁵⁰ (in-ear prototype⁷²; MCC = 0.903), and Borges et al.⁴⁴ (in-ear prototype^74,75; MCC = 0.894) (Fig. 4g).

In the Wake, N1, N3, and REM stages, significant differences were observed between ACC, SP, PPV, NPV and the balanced metric MCC, indicating that high values of these metrics may mask poor classification performance in minority classes.

Studies classifying Wake, Light Sleep, Deep Sleep and REM

Studies classifying four stages showed an average overall ACC of 0.80 ± 0.08, and κ and MCC of 0.70 ± 0.11, indicating substantial agreement between wEEGs and PSG. Significant difference was observed between micro- and macro-averaged ACC (Fig. 5a, b).

The Wake stage, comprising 12% of all epochs, showed notable variability across studies. It had a mean κ of 0.68 ± 0.14, MCC of 0.69 ± 0.16, and the lowest SE among all stages (0.68 ± 0.18), but achieved the highest NPV. According to Bland–Altman analysis, Wake duration was slightly overestimated (+2.75 min; LoA: –31.56 to +26.07 min), with near-zero bias in proportion (+0.01%; LoA: –0.07% to +0.07%). ICCs were 0.90 for duration and 0.86 for proportion, indicating strong agreement (Fig. 5c).

The Light Sleep stage, constituting 53% of all epochs, but showed the lowest mean κ (0.61 ± 0.11), MCC (0.65 ± 0.11), ACC, SP, and NPV. In contrast, it had the highest PPV and a solid SE (0.77 ± 0.09). Bland-Altman analysis showed a small underestimation in duration (–1.22 min; LoA: –33.39 to +30.95 min) and proportion (–0.03%; LoA: –0.14 to +0.09%). Duration agreement was excellent (ICC = 0.98), but proportion agreement was poor (ICC = –0.07), suggesting inconsistency in proportional estimates (Fig. 5 d).

The Deep Sleep stage (N3), making up 17% of epochs, showed the highest κ (0.77 ± 0.13), MCC (0.75 ± 0.13), ACC and SE (0.80 ± 0.07). Duration was slightly overestimated (+0.28 min; LoA: –17.03 to +17.56 min), with minimal bias in proportion (–0.01%; LoA: –0.07% to +0.07%). ICCs were strong for both for duration (ICC = 0.97) and proportion (ICC = 0.86) (Fig. 5e).

The REM stage, accounting for 18% of epochs, had high SE (0.78 ± 0.14), MCC (0.72 ± 0.10), and κ (0.65 ± 0.18), but the lowest PPV, suggesting more frequent false positives. Duration was slightly overestimated (+1.28 min; LoA: –17.45 to +20.02 min), with minimal bias in proportion (+0.03%; LoA: –0.04% to +0.13%). Agreement on duration was good (ICC = 0.80), but poor for proportion (ICC = –0.31) (Fig. 5f).

In the Wake, Deep Sleep and REM stages, significant differences between ACC, SP, NPV and MCC were observed.

Influence of methodological variabilities on wEEG performance

The impact of methodological variabilities on the wEEG’s performance in detecting sleep stages were analyzed, focusing on five-stage classification studies due to the limited sample size of four-stage studies. Several significant associations between study design features and evaluation metrics suggest that methodological choices can affect reported performance.

wEEG performance varied by participant health status: studies involving clinical or mixed populations showed significantly higher performance across macro-averaged and Wake, N1, N3, and REM metrics compared to studies only including healthy participants (Fig. 6I). N3 F1 performance was negatively associated with the number of participants, possibly reflecting greater heterogeneity in larger samples (Fig. 6II). In contrast, Wake-stage κ, F1, and MCC were positively associated with the number of epochs, indicating more robust classification with larger datasets (Fig. 6III). Study environment also influenced performance: home-based studies showed significantly higher N1 ACC and N3 κ, SE, F1, and MCC, while controlled environments yielded higher REM-stage ACC and NPV (Fig. 6IV). However, limited results for clinical, mixed, and home-based subgroups may restrict the generalizability of these findings.

**Fig. 6: Study-level factors significantly associated with variations in wEEG performance.**

Device characteristics had mixed effects on wEEG performance, with few statistically significant group differences. No significant differences were found between commercial and prototype devices (Supplementary Fig. S5, Table S7). The only significant effect of electrode placement was higher REM-stage NPV for forehead-based systems (Fig. 7I). Dry electrodes showed significantly higher N2 PPV and N3 NPV than wet electrodes (Fig. 7II). The scoring method had the most consistent impact: machine learning was significantly outperformed by deep learning (macro κ, F1, MCC; N1 κ, F1, MCC; N2 F1) and by manual scoring (macro F1; N1 κ, F1, MCC; REM ACC, SP). Proprietary algorithms also outperformed machine learning in Wake PPV, N2 NPV, N3 SP, and REM NPV (Fig. 7III), though the small number of studies (n = 5) limits interpretation. Importantly, in the five studies^{22,27,31,34,41} that evaluated multiple scoring methods on the same data, manual scoring consistently outperformed automatic scoring, and deep learning outperformed classical machine learning (Table 1). Electrode count showed significant positive correlations with N3 κ, SE, MCC, and F1, supporting the value of increased spatial coverage for slow-wave sleep detection, while REM NPV showed a weaker but significant negative correlation (Fig. 7IV). In summary, while most device features had limited impact, electrode count and scoring method significantly influenced N3 and REM-stage performance (Supplementary Figs. S6–S8; Tables S8–S10).

**Fig. 7: Device- and electrode-related factors associated with significant differences in wEEG performance.**

Sensitivity analysis

To assess robustness, a leave-one-out sensitivity analysis was conducted across all 36 results from 32 studies, classifying five stages. No individual study exceeded the pre-defined ±0.03 threshold for altering pooled values; the largest absolute shifts were ≤0.012 and relative changes stayed under 3.5%. The most sensitive metrics were in the N1 stage (κ, SE, F1, and MCC), where omitting the most high-performing studies (e.g., Li et al.²⁷) slightly raised scores. Notably, Mikkelsen et al.⁵³ appeared most frequently as the study with the highest influence across multiple metrics, particularly in overall and REM-stage results. Thus, the meta-analytic results can be considered robust when aggregating across all five-stage studies (Supplementary Fig. S9, Supplementary Table S11).

Subgroup-specific analyses revealed less robustness, with some comparisons highly sensitive to individual studies. The N1 stage was most affected, reflecting the inherent difficulty of classifying this stage and highlighting its susceptibility to methodological and sample-related variation. Metrics emphasizing class imbalance and agreement (κ, F1, MCC, PPV) fluctuated more than overall accuracy. Instability was concentrated in subgroups with few studies—especially clinical (n = 5), mixed-population (n = 7), home-based (n = 10), and proprietary-algorithm (n = 5) groups. In contrast, estimates from healthy cohorts, controlled lab settings, and devices using forehead electrode placement or dry electrodes were robust. A few studies—particularly Ravindran et al.²⁸, Li et al.²⁷, and Borges et al.⁴⁴—consistently drove subgroup results: excluding Ravindran et al.²⁸ raised averages, while omitting Li et al.²⁷ or Borges et al.⁴⁴ lowered them (Supplementary Table S12). In summary, while aggregated findings are reliable, differences between methodological groups should be interpreted with caution, as they are often driven by just one or two influential studies.

Discussion

The rising popularity of wEEGs underscores the need for a standardized assessment of their ability to classify individual stages. Misclassifications can compromise sleep architecture analysis, and, consequently, affect therapeutic decisions and sleep quality evaluations. Both five- and four-stage studies revealed variability in wEEGs’ performance across studies. Our analysis identified the sleep stages where wEEGs’ classifications were most and least aligned with PSG.

Overall performance results from 32 studies classifying five sleep stages showed clear discrepancies between overall accuracy and more balanced metrics like MCC and κ, highlighting limitations in wEEGs’ ability to classify individual stages. These discrepancies were amplified by differences between micro- and macro-averaged metrics. Micro-averaging, which weighs each epoch equally, inflated SE, PPV, and F1 due to the dominance of the N2 stage, but masked underperformance in less frequent stages such as N1. In contrast, macro-averaging, which gives equal weight to each class, yielded a higher ACC by amplifying true negative counts in less frequent stages like N1. These findings underscore the need to prioritize balanced evaluation metrics when validating wEEG performance to account for class imbalances inherent in sleep stage distributions. The Wake stage showed the highest classification accuracy, with excellent agreement to PSG despite slight underestimation in duration. This confirms that wake-sleep differentiation remains a key strength of wEEGs. N1, the least frequent stage, emerged as the most challenging to classify, reflecting its transitional nature and less distinct EEG characteristics. Significant variability in its detection reliability across studies, shown by the largest standard deviations in various metrics and the greatest estimation variability in the Bland-Altman analysis, highlights the need for improved automatic detection of this stage by wEEGs. The most dominant N2, N2 stage yielded high SE, but significantly lower scores in balanced metrics MCC and κ, indicating frequent inclusion of N2 epochs, but not necessarily correct exclusion of others. These observations highlight concerns regarding over-reliance on SE, the most frequently reported metric for stage-specific performance, which potentially overlooks N2 classification challenges. N3 emerged as the most reliably detected stage. It consistently yielded high balanced scores with minimal inter-study variability and strong agreement in both duration and proportion, supporting its robust detection by wEEGs. REM sleep was also generally well classified but displayed higher variability across studies. This, along with slight overestimation in duration, indicates a need for further refinement, especially given the clinical importance of REM-related sleep disorders. Importantly, the sensitivity analysis confirmed that no single study significantly altered the pooled results, confirming the robustness of the overall meta-analytic findings.

When examining the top studies for overall and stage-specific MCC in the five-stage classification, several consistently stood out. Li et al. reported strong results using the WPSG-I headband with 6 dry electrodes in a controlled setting with 20 mixed (healthy and clinical) participants. Both manual and proprietary algorithmic scoring were evaluated, with manual scoring performing particularly well, though automatic results were also among the best. Borup et al.⁴⁶ achieved high MCCs using a dry in-ear prototype⁷² in a home setting with 20 healthy participants over four nights, applying a personalized deep learning model. Borges et al.⁴⁴ achieved comparable results using a wet in-ear prototype^74,75 on mostly OSA participants in a lab setting, with automatic scoring. While the results suggest strong device performance, they likely reflect not only device quality but also study design factors.

Several methodological variables significantly influenced wEEG performance across studies. While electrode placements (forehead or ear) and types (dry or wet) showed limited impact, the scoring method and electrode count consistently influenced stage-specific outcomes. Manual and deep learning-based scoring generally outperformed machine learning, highlighting the importance of algorithm complexity and human oversight, especially for difficult stages like N1. Electrode count was positively associated with N3 performance, reinforcing the importance of spatial resolution for detecting slow-wave sleep. Participant health status and recording environment also affected performance, though these findings should be interpreted cautiously given the limited sample sizes in some subgroups. While these results underscore the importance of methodological context when evaluating wEEG systems, many subgroup-level differences were highly sensitive to the inclusion or exclusion of just one or two studies. This was particularly true for underrepresented methodological groups, such as those involving clinical or mixed populations, home-based recordings, or deep learning and proprietary algorithms, where few studies were available. As a result, differences between these subgroups should be interpreted with caution, as they may reflect study-specific effects rather than consistent, generalizable patterns.

A smaller subset of studies reported results using a four-stage classification system, and although limited in number (n = 7), they offered additional insights that in part mirrored the five-stage findings. As with five-stage studies, ACC tended to overestimate performance relative to balanced metrics such as MCC and κ, again pointing to inconsistencies in individual stage classification. Differences between micro- and macro-averaged ACC were also present, though performance across SE, PPV, and F1 was more uniform, suggesting less skew from dominant stages. Wake stage showed substantial variability across studies, with the lowest SE but high NPV, and relatively strong agreement in duration and proportion, indicating wEEGs effectively detected its absence but often missed its presence. Light Sleep mirrored challenges seen with N2, with low MCC and κ, signaling misclassification issues. Deep Sleep was the most reliably detected, echoing five-stage results, with high MCC and minimal bias in proportion estimates. REM was generally well classified, though lower MCC and overestimation in duration again highlighted areas for refinement.

The significant variability in stage-specific performance across studies highlights the need for a consistent and robust approach to evaluating wEEG-based sleep staging. Commonly used metrics such as overall ACC, SP, and NPV tend to overstate performance, largely due to the disproportionate influence of TNs in class-imbalanced data. This was evident in our analysis, where these metrics consistently outperformed more balanced alternatives like MCC across both five- and four-stage classification systems. To ensure comparability and reliability in future research, we propose a standardized framework for validating wEEG systems against PSG:

1.
Use of balanced metrics (MCC or κ): For both overall and stage-specific evaluation, we recommend MCC and κ. Unlike overall accuracy, which can be misleading in imbalanced datasets, these metrics consider all elements of the confusion matrix and remain informative even when some stages (e.g., N1 or Wake) are underrepresented. MCC, in particular, offers a single, interpretable value that reflects the overall quality of classification and is increasingly recommended for imbalanced settings⁷⁶. Stage-specific evaluation should rely on per-class MCC or κ, which reflect true classification quality even for minority stages like N1, and be complemented by sensitivity (SE) to highlight detection ability.
2.
Reporting confusion matrices detailing labels from wEEGs and manually scored PSG should be included for detailed analysis.
3.
Adopting standardized staging systems, classifying sleep into Wake, N1, N2, N3 and REM or Wake, Light Sleep, Deep Sleep and REM. Three studies in this meta-analysis did not employ five- or four-stage classification systems, making them incomparable to other studies.

This meta-analysis, while comprehensive, faced certain limitations. First, potential language and publication bias may have influenced the included studies, as only English-language, peer-reviewed publications were considered. This may have excluded relevant non-English data, potentially skewing results. Additionally, four analyzed studies did not employ five- or four-stage classification systems, making them incomparable to others. Moreover, most studies included in this meta-analysis focused on participants in their mid-20s to mid-30s, leading to a notable age bias. This overrepresentation limits the generalizability of findings, particularly for older adults who experience age-related changes in sleep architecture, such as reduced N3 and REM sleep, increased fragmentation, and altered EEG signal patterns, that may affect wEEG accuracy. Clinical populations were also underrepresented. Most studies were conducted on healthy participants, despite the increasing interest in applying wEEG systems to monitor or diagnose sleep disorders. Altered sleep architecture, comorbidities, and medication use and atypical sleep patterns may alter EEG signals and algorithm performance in ways not captured by current validation studies.

Future research directions should focus on expanding the scope of wEEG validation studies. Dedicated studies involving both older adults and clinical cohorts are critical to establishing the real-world reliability and clinical applicability of wEEG systems. Longitudinal and ambulatory studies capturing night-to-night variability and real-life usage conditions would improve ecological validity. Studies should also report on device wearability and user adherence, especially for home-based applications, to assess long-term feasibility. In addition to staging accuracy, future work should also assess wEEG reliability in estimating other sleep parameters—such as sleep onset latency and wake after sleep onset. Comparative analyses examining the impact of specific electrode placement in different device types on sleep stage classification can yield critical insights into optimizing wEEGs. Incorporating alternative technologies like Photoplethysmography (PPG) or actigraphy can also enhance the breadth of comparisons across different sleep staging technologies. Ethical considerations, including data privacy and responsible handling of sleep health data, must also be addressed as wEEG use expands beyond research settings. Crucially, adherence to the standardized framework proposed in this analysis is fundamental for future research. Embracing this framework will aid in systematically assessing and refining wEEGs’ capabilities and limitations in classifying sleep stages.

The increased use of wEEGs in sleep monitoring presents significant potential in both clinical and consumer health. The analysis of 43 validation studies revealed a substantial overall agreement between wEEGs and PSG. However, their performance in accurately classifying specific sleep stages varied considerably. The N1 stage presented significant classification challenges, while the N3 or Deep Sleep stage was the most reliably identified stage. These discrepancies are critical, as misclassifications can lead to incorrect sleep architecture analysis, directly impacting clinical decisions and patient care. Furthermore, manual scoring generally outperformed automatic methods, particularly for the N1 stage, highlighting the need for improved algorithms. Additionally, electrode count was positively associated with N3 detection, suggesting that spatial resolution remains an important factor for optimizing wEEG performance. To address the variabilities in stage-specific performance, we propose a standardized evaluation framework emphasizing balanced metrics like MCC and κ, and consistency in reporting validation results. This approach aims to facilitate advancements in wEEG technology by enhancing comparisons across studies and devices. As wEEGs become more integrated into home-based sleep monitoring, ensuring their reliability becomes crucial for clinical adoption. This meta-analysis paves the way for future research to refine wEEGs, particularly for stages with less distinct features like N1, ensuring their reliability across diverse populations and settings. This is essential to fully realize their potential in enhancing sleep monitoring and analysis in both clinical and consumer applications.

Methods

Search strategy

A systematic literature search was conducted across four databases: PubMed, Scopus, IEEE Xplore, and Web of Science. The objective was to identify studies published within the past 10 years that investigated the use of wearable EEG devices for sleep staging in human subjects and were validated against PSG. The search strategy included the following keywords and Boolean operators: “sleep AND (wearable OR mobile OR headband) AND (EEG OR electroencephalogram OR electroencephalography)”. The search was limited to English-language publications. Additionally, the reference lists of all included articles were manually screened to identify any further relevant studies.

Study selection

A total of 1585 records were identified through database searching, and 45 additional records through manual searches of reference lists and websites of wEEGs. After removing duplicates, 903 records remained for the title and abstract screening. Studies were included if they met the following inclusion criteria: conducted within the last 10 years, conducted on human subjects, published in English, used wEEG for sleep staging, and compared the results to a gold standard such as PSG. Studies were excluded if they were reviews or conducted on non-human subjects. After the application of these criteria, 791 records were excluded. The full texts of the remaining 112 studies were assessed for eligibility. Of these, 69 studies were excluded because they either lacked validation on a wearable EEG device, did not include validation against PSG, or did not report sleep staging results. This resulted in the inclusion of 43 validation studies in the final analysis (Fig. 8).

Data extraction

Extracted information covered study design and device characteristics, including the device name, number of recording electrodes, their placements, and types (wet or dry). Participant-related data included the number of participants, age, and population type (e.g., healthy individuals or clinical populations). Additional details such as the study environment (e.g., home or controlled setting), number of recorded nights per participant, and total number of analyzed epochs were also recorded. For each study, the sleep staging method used by the wEEG (e.g., manual scoring, machine learning, or deep learning), the validation approach (e.g., comparison against PSG scored by an expert), and the sleep stage classification scheme (e.g., five-stage: Wake, N1, N2, N3, REM; or four-stage: Wake, Light, Deep, REM) were documented. Confusion matrices and reported evaluation metrics were collected as the primary data for quantitative analysis. When confusion matrices were not explicitly provided, they were reconstructed based on available information. All confusion matrices and details on their derivation are provided in the Supplementary Materials.

Data analysis

$${\kappa }_{{\rm{m}}{\rm{u}}{\rm{l}}{\rm{t}}{\rm{i}}{\rm{c}}{\rm{l}}{\rm{a}}{\rm{s}}{\rm{s}}}=\frac{{p}_{j}-{p}_{e}}{1-{p}_{e}}$$

(1)

where

$${p}_{j\,({\rm{m}}{\rm{u}}{\rm{l}}{\rm{t}}{\rm{i}}{\rm{c}}{\rm{l}}{\rm{a}}{\rm{s}}{\rm{s}})}=\frac{{\sum }_{i=1}^{n}{C}_{ii}}{N}$$

(2)

and

$${P}_{e\,(\mathrm{multiclass})}=\mathop{\sum }\limits_{i=1}^{n}\left(\frac{{C}_{i+}{\times C}_{+i}}{{N}^{2}}\right)$$

(3)

with

C: confusion matrix.
C_ii: the diagonal elements (true positives for each class).
n: the number of classes.
N: total number of epochs (sum of all elements in the confusion matrix).
C_i+: sum of elements in row i (true instances of class i).
C_+i: : sum of elements in column i (predicted instances of class i).

(1) The confusion matrices were used to derive true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN). Using these values, the overall agreement between the wEEG and the reference standard was evaluated with Cohen’s Kappa (κ), which is calculated as shown in Eq. (1). Κ values are interpreted as follows: <0.0 poor, 0.00–0.20 slight, 0.21–0.40 fair, 0.41–0.60 moderate, 0.61–0.80 substantial, and over 0.80 excellent agreement⁷⁷.

To provide a balanced performance measure that accommodates variations in class sizes, the multiclass Matthews correlation coefficient (MCC) was computed using Eq. (4); MCC scores range from 1 (indicating total disagreement), through 0 (random prediction), to +1 (perfect prediction).

$${{{MCC}}}_{multiclass}=\frac{{\sum }_{k}{\sum }_{l}{\sum }_{m}{V}_{klm}-{\sum }_{i}{\sum }_{j}({X}_{ij}^{2})}{\root{2}\of{{\chi }^{2}+{\sum }_{i}({X}_{i+}^{2})\times {\sum }_{i}({X}_{j+}^{2})}}$$

(4)

where

$${V}_{klm}={C}_{ik}{C}_{il}{C}_{+j}{C}_{jm}\,for\,k\ne 1\,{\rm{a}}{\rm{n}}{\rm{d}}1\ne m$$

(5)

and

$${Xij}={C}_{{ij}}{C}_{+j}-{C}_{j+}{C}_{i+}$$

(6)

with

C: confusion matrix.
C_ii: the diagonal elements (true positives for each class).
n: the number of classes.
N: total number of epochs (sum of all elements in the confusion matrix).
C_i+: sum of elements in row i (true instances of class i).
C_+i: : sum of elements in column i (predicted instances of class i).
k, l, m, i, j are classes.
C_ij: number of observations known to be in group i but predicted to be in group j.
χ²: is the chi-squared statistic relative to the confusion matrix.

For sleep-stage-specific evaluation, the following metrics were computed: ACC (Eq. (7)), κ (Eq. (8)), sensitivity (SE) (Eq. (11)), specificity (SP) (Eq. (12)), positive predictive value (PPV) (Eq. (13)), negative predictive value (NPV) (Eq. (14)), F1 score (Eq. (15)), and MCC (Eq. (16)).

$${\rm{ACC}}=\frac{{\rm{TN}}+{\rm{TP}}}{{\rm{TN}}+{\rm{TP}}+{\rm{FN}}+{\rm{FP}}}$$

(7)

$${\kappa }=\frac{{p}_{j}-{p}_{e}}{1-{p}_{e}}$$

(8)

where

$${p}_{j}=\frac{TN+TP}{N}$$

(9)

and

$${p}_{e}=\frac{({\rm{t}}{\rm{r}}{\rm{u}}{\rm{e}}\,{\rm{Y}}{\rm{E}}{\rm{S}}\times {\rm{p}}{\rm{r}}{\rm{e}}{\rm{d}}{\rm{i}}{\rm{c}}{\rm{t}}{\rm{e}}{\rm{d}}\,{\rm{Y}}{\rm{E}}{\rm{S}})+({\rm{t}}{\rm{r}}{\rm{u}}{\rm{e}}\,{\rm{N}}{\rm{O}}\times {\rm{p}}{\rm{r}}{\rm{e}}{\rm{d}}{\rm{i}}{\rm{c}}{\rm{t}}{\rm{e}}{\rm{d}}\,{\rm{N}}{\rm{O}})}{{N}^{2}}$$

(10)

with

true YES: number of times the ground truth identifies a particular class (row total in the confusion matrix).
predicted YES: number of times the device identifies a particular class (column total in confusion matrix).
N: total number of epochs (sum of all elements in the confusion matrix)
$${\rm{SE}}=\frac{{\rm{TP}}}{{\rm{TP}}+{\rm{FN}}}$$
(11)
$${\rm{SP}}=\frac{{\rm{TN}}}{{\rm{TN}}+{\rm{FP}}}$$
(12)
$${\rm{PPV}}=\frac{{\rm{TP}}}{{\rm{TP}}+{\rm{FP}}}$$
(13)
$${\rm{NPV}}=\frac{{\rm{TN}}}{{\rm{TN}}+{\rm{FN}}}$$
(14)
$${\rm{F}}1=\frac{2\times {\rm{SE}}\times {\rm{PPV}}}{{\rm{SE}}+{\rm{PPV}}}$$
(15)
$${\rm{MCC}}=\frac{{\rm{TP}}\times {\rm{TN}}-{\rm{FP}}\times {\rm{FN}}}{\sqrt{({\rm{TP}}+{\rm{FP}})({\rm{TP}}+{\rm{FN}})({\rm{TN}}+{\rm{FP}})({\rm{TN}}+{\rm{FN}})}}$$
(16)

Studies were categorized based on their use of either a five-stage classification system (i.e., N1, N2, N3, REM, Wake) or a four-stage system (e.g., Wake, Light, Deep, REM). In multiclass classifications, the above evaluation metrics are often averaged to reflect overall device performance. Micro-averaged metrics, which give equal weight to each epoch, were calculated by aggregating the counts of TPs, FPs, TNs, and FNs (Eqs. (17)–(22)).

$${\rm{A}}{\rm{C}}{\rm{C}}=\frac{{\sum }_{i=1}^{n}{C}_{ii}}{N}$$

(17)

where

C: confusion matrix.
C_ii: the diagonal elements (true positives for each class).
n: the number of classes.
N: total number of epochs (sum of all elements in the confusion matrix).
$${SE}_{micro}=\frac{\sum {\rm{T}}{\rm{P}}}{\sum {\rm{T}}{\rm{P}}+\sum {\rm{F}}{\rm{N}}}$$
(18)
$${SP}_{micro}=\frac{\sum {\rm{T}}{\rm{N}}}{\sum {\rm{T}}{\rm{N}}+\sum {\rm{F}}{\rm{P}}}$$
(19)
$${PPV}_{micro}=\frac{\sum {\rm{T}}{\rm{P}}}{\sum {\rm{T}}{\rm{N}}+\sum {\rm{F}}{\rm{P}}}$$
(20)
$${NPV}_{micro}=\frac{\sum {\rm{T}}{\rm{N}}}{\sum {\rm{T}}{\rm{N}}+\sum {\rm{F}}{\rm{N}}}$$
(21)
$${{F}{1}}_{micro}=\frac{2\times {{\rm{S}}{\rm{E}}}_{{\rm{m}}{\rm{i}}{\rm{c}}{\rm{r}}{\rm{o}}}\times {{\rm{P}}{\rm{P}}{\rm{V}}}_{{\rm{m}}{\rm{i}}{\rm{c}}{\rm{r}}{\rm{o}}}}{{{\rm{S}}{\rm{E}}}_{{\rm{m}}{\rm{i}}{\rm{c}}{\rm{r}}{\rm{o}}}+{{\rm{P}}{\rm{P}}{\rm{V}}}_{{\rm{m}}{\rm{i}}{\rm{c}}{\rm{r}}{\rm{o}}}}$$
(22)

Macro-averaged metrics, which assign equal importance to each class, were calculated by averaging the metrics computed for each stage (Eqs. (29)–(29)).

$${{A}{C}{C}}_{macro}=\frac{{ACC}_{1}+{ACC}_{2}+\ldots +{{A}{C}{C}}_{n}}{n}$$

(23)

$${\kappa }_{macro}=\frac{{\kappa }_{1}+{\kappa }_{2}+\ldots +{\kappa }_{n}}{n}$$

(24)

$${{S}{E}}_{macro}=\frac{{{\rm{S}}{\rm{E}}}_{1}+{{\rm{S}}{\rm{E}}}_{2}+\ldots +{{\rm{S}}{\rm{E}}}_{n}}{n}$$

(25)

$${{S}{P}}_{macro}=\frac{{{\rm{S}}{\rm{P}}}_{1}+{{\rm{S}}{\rm{P}}}_{2}+\ldots +{{\rm{S}}{\rm{P}}}_{n}}{n}$$

(26)

$${{N}{P}{V}}_{macro}=\frac{{{\rm{N}}{\rm{P}}{\rm{V}}}_{1}+{{\rm{N}}{\rm{P}}{\rm{V}}}_{2}+\ldots +{{\rm{N}}{\rm{P}}{\rm{V}}}_{n}}{n}$$

(27)

$${{F}{1}}_{macro}=\frac{{{\rm{F}}1}_{1}+{{\rm{F}}1}_{2}+\ldots +{{\rm{F}}1}_{n}}{n}$$

(28)

$${{M}{C}{C}}_{macro}=\frac{{{\rm{M}}{\rm{C}}{\rm{C}}}_{1}+{{\rm{M}}{\rm{C}}{\rm{C}}}_{2}+\ldots +{{\rm{M}}{\rm{C}}{\rm{C}}}_{n}}{n}$$

(29)

For each sleep stage and overall performance, the evaluation metrics were used to compute the mean and standard deviation across studies. When confusion matrices were available, the metrics were computed directly from these; otherwise, reported metrics from the original publications were used. The results of studies classifying five stages and four stages were analyzed separately, as each device’s scoring model was developed and validated for a specific staging structure; reclassification would not reflect intended device behavior or validation protocols.

(2) Significant differences between MCC and all other metrics for both, overall and sleep-stage-specific metrics were analyzed using the Mann–Whitney U-test, which is suitable for small sample sizes and non-normally distributed data.

(3) Bland–Altman plots were analyzed for differences in stage durations and proportions between device and ground truth scorings.

(4) Studies classifying five sleep stages were stratified by device characteristics like electrode placement (e.g. forehead or ear), electrode type, whether wEEG was a prototype or commercially available, participants’ health status, study environment, and device sleep staging method. The Mann-Whitney U-test was employed to highlight significant differences in evaluation metrics between the groups.

(5) Relationships between evaluation metrics and the number of epochs and participants were explored using the Pearson correlation in studies classifying five sleep stages.

(6) To evaluate the robustness of pooled performance estimates, a leave-one-out (LOO) sensitivity analysis was conducted on all studies using five-stage sleep classification. For each evaluation metric and sleep stage, pooled values were recalculated iteratively by omitting one study at a time, and the maximum absolute (|Δ|) and relative (Δ %) changes were recorded. Because all performance metrics are scaled 0–1, we considered an absolute shift of |Δ| > 0.03 (i.e., ≥3 percentage points) to indicate a potentially influential study. This analysis was also repeated within each subgroup defined by participant health status, study environment, electrode placement, electrode type, device type (prototype vs. commercial), and scoring method, to assess the stability of results within stratified comparisons.

All data analyses were conducted in Python.

Data availability

No datasets were generated or analysed during the current study.

References

Ng, A. K. & Cuntai G, Impact of obstructive sleep apnea on sleep–wake stage ratio. In 2012 Annual International Conference of the IEEE Engineering in Medicine and Biology Society. 4660–4663 https://doi.org/10.1109/EMBC.2012.6347006 (IEEE, 2012).
Geng, C., Yang, Z., Zhang, T., Xu, P. & Zhang, H. Polysomnographic nighttime features of Restless Legs Syndrome: a systematic review and meta-analysis. Front. Neurol. 13, 961136 (2022).
Article PubMed PubMed Central Google Scholar
Basunia, M. et al. Relationship of symptoms with sleep-stage abnormalities in obstructive sleep apnea–hypopnea syndrome. J. Community Hosp. Intern. Med. Perspect. 6, 32170 (2016).
Article PubMed Google Scholar
Liu, Y. et al. Altered sleep stage transitions of REM sleep: a novel and stable biomarker of narcolepsy. J. Clin. Sleep Med. 11, 885–894 (2015).
Article PubMed PubMed Central Google Scholar
Barone, P. et al. The PRIAMO study: a multicenter assessment of nonmotor symptoms and their impact on quality of life in Parkinson’s disease. Mov. Disord. 24, 1641–1649 (2009).
Article PubMed Google Scholar
Zhao, Q.-F. et al. The prevalence of neuropsychiatric symptoms in Alzheimer’s disease: systematic review and meta-analysis. J. Affect Disord. 190, 264–271 (2016).
Article PubMed Google Scholar
Braley, T. J. & Boudreau, E. A. Sleep disorders in multiple sclerosis. Curr. Neurol. Neurosci. Rep. 16, 50 (2016).
Article PubMed Google Scholar
Ju, Y.-E. S. et al. Slow wave sleep disruption increases cerebrospinal fluid amyloid-β levels. Brain 140, 2104–2111 (2017).
Article PubMed PubMed Central Google Scholar
Liguori, C. et al. Rapid eye movement sleep disruption and sleep fragmentation are associated with increased orexin—a cerebrospinal-fluid levels in mild cognitive impairment due to Alzheimer’s disease. Neurobiol. Aging 40, 120–126 (2016).
Article CAS PubMed Google Scholar
Peeraully, T., Yong, M.-H., Chokroverty, S. & Tan, E.-K. Sleep and Parkinson’s disease: a review of case-control polysomnography studies. Mov. Disord. 27, 1729–1737 (2012).
Article PubMed Google Scholar
Nobili, L. et al. Sleep and epilepsy: a snapshot of knowledge and future research lines. J. Sleep Res. 31, e13622 (2022).
Article PubMed PubMed Central Google Scholar
Woo, A. & Ratnayake, G. Sleep and pain management: a review. Pain Manag. 10, 261–273 (2020).
Article PubMed Google Scholar
Mathias, J. L., Cant, M. L. & Burke, A. L. J. Sleep disturbances and sleep disorders in adults living with chronic pain: a meta-analysis. Sleep Med. 52, 198–210 (2018).
Article CAS PubMed Google Scholar
Fang, H., Tu, S., Sheng, J. & Shao, A. Depression in sleep disturbance: a review on a bidirectional relationship, mechanisms and treatment. J. Cell. Mol. Med. 23, 2324–2332 (2019).
Article PubMed PubMed Central Google Scholar
Nutt, D., Wilson, S. & Paterson, L. Sleep disorders as core symptoms of depression. Dialogues Clin. Neurosci. 10, 329–336 (2008).
Article PubMed PubMed Central Google Scholar
Monti, J. M. & Monti, D. Sleep disturbance in generalized anxiety disorder and its treatment. Sleep Med. Rev. 4, 263–276 (2000).
Article PubMed Google Scholar
Bubu, O. M. et al. Sleep, cognitive impairment, and Alzheimer’s disease: a systematic review and meta-analysis. Sleep 40, zsw032 (2017).
Article Google Scholar
Qin, S., Leong, R. L. F., Ong, J. L. & Chee, M. W. L. Associations between objectively measured sleep parameters and cognition in healthy older adults: a meta-analysis. Sleep Med. Rev. 67, 101734 (2023).
Article PubMed Google Scholar
Berry, R. et al. The AASM Manual for the Scoring of Sleep and Associated Events: Rules, Terminology and Technical Specifications, Version 2.2 (American Academy of Sleep Medicine, 2015).
Rosenberg, R. S. & Van Hout, S. The American Academy of Sleep Medicine inter-scorer reliability program: sleep stage scoring. J. Clin. Sleep Med. 09, 81–87 (2013).
Article Google Scholar
Gaiduk, M., Serrano Alarcón, Á, Seepold, R. & Martínez Madrid, N. Current status and prospects of automatic sleep stages scoring: review. Biomed. Eng. Lett. 13, 247–272 (2023).
Article PubMed PubMed Central Google Scholar
Levendowski, D. J. et al. The accuracy, night-to-night variability, and stability of frontopolar sleep electroencephalography biomarkers. J. Clin. Sleep Med. 13, 791–803 (2017).
Article PubMed PubMed Central Google Scholar
Chouraki, A., Tournant, J., Arnal, P., Pépin, J.-L. & Bailly, S. Objective multi-night sleep monitoring at home: variability of sleep parameters between nights and implications for the reliability of sleep assessment in clinical trials. Sleep https://doi.org/10.1093/sleep/zsac319 (2022).
Leach, S., Chung, K., Tüshaus, L., Huber, R. & Karlen, W. A protocol for comparing dry and wet EEG electrodes during sleep. Front. Neurosci. 14, 586 (2020).
Article PubMed PubMed Central Google Scholar
Imtiaz, S. A. A systematic review of sensing technologies for wearable. Sleep Staging Sens. 21, 1562 (2021).
Google Scholar
de Gans, C. J. et al. Sleep assessment using EEG-based wearables—a systematic review. Sleep Med. Rev. 76, 101951 (2024).
Article PubMed Google Scholar
Li, X. et al. Exploring the potential of a new wearable sleep monitoring device for clinical application. Biomed. Signal Process. Control 99, 106856 (2025).
Article Google Scholar
G Ravindran, K. K. et al. Evaluation of Dreem headband for sleep staging and EEG spectral analysis in people living with Alzheimer’s and older adults. Sleep https://doi.org/10.1093/sleep/zsaf122 (2025).
Seol, J. et al. Validation of sleep-staging accuracy for an in-home sleep electroencephalography device compared with simultaneous polysomnography in patients with obstructive sleep apnea. Sci. Rep. 14, 3533 (2024).
Article CAS PubMed PubMed Central Google Scholar
Rusanen, M. et al. Generalizable deep learning-based sleep staging approach for ambulatory textile electrode headband recordings. IEEE J. Biomed. Health Inform. 1–12 https://doi.org/10.1109/JBHI.2023.3240437 (2023).
Casciola, A. A. et al. A deep learning strategy for automatic sleep staging based on two-channel EEG headband data. Sensors 21, 3316 (2021).
Article PubMed PubMed Central Google Scholar
Arnal, P. J. et al. The Dreem Headband compared to polysomnography for electroencephalographic signal acquisition and sleep staging. Sleep 43, zsaa097 (2020).
Article PubMed PubMed Central Google Scholar
Lin, C.-T. et al. Forehead EEG in support of future feasible personal healthcare solutions: sleep management, headache prevention, and depression treatment. IEEE Access 5, 10612–10621 (2017).
Article Google Scholar
Finan, P. H. et al. Validation of a wireless, self-application, ambulatory electroencephalographic sleep monitoring device in healthy volunteers. J. Clin. Sleep Med. 12, 1443–1451 (2016).
Article PubMed PubMed Central Google Scholar
Liang, S.-F. et al. Development of an EOG-based automatic sleep-monitoring eye mask. IEEE Trans. Instrum. Meas. 64, 2977–2985 (2015).
Article Google Scholar
Massie, F., Vits, S., Verbraecken, J. & Bergmann, J. The evaluation of a novel single-lead biopotential device for home sleep testing. Sleep 48, zsae248 (2025).
Article PubMed Google Scholar
Roach, G. D., Miller, D. J., Shell, S. J., Miles, K. H. & Sargent, C. Validation of a neurophysiological-based wearable device (Somfit) for the assessment of sleep in athletes. Sensors 25, 2123 (2025).
Article PubMed PubMed Central Google Scholar
Um, H.-K. et al. Mobile sleep stage analysis using multichannel wearable devices integrated with stretchable transparent electrodes. ACS Sens. https://doi.org/10.1021/acssensors.4c03602 (2025).
McMahon, M. et al. Performance investigation of Somfit sleep staging algorithm. Nat. Sci. Sleep 16, 1027–1043 (2024).
Article PubMed PubMed Central Google Scholar
Oz, S. et al. Monitoring sleep stages with a soft electrode array: Comparison against PSG and home-based detection of REM sleep without atonia. J. Sleep Res. 32, e13909 (2023).
Article PubMed Google Scholar
Kwon, S. et al. At-home wireless sleep monitoring patches for the clinical assessment of sleep quality and sleep apnea. Sci. Adv. 9, eadg9671 (2023).
Article CAS PubMed PubMed Central Google Scholar
Matsumori, S. et al. HARU sleep: a deep learning-based sleep scoring system with wearable sheet-type frontal EEG sensors. IEEE Access 10, 13624–13632 (2022).
Article Google Scholar
Myllymaa, S. et al. Assessment of the suitability of using a forehead EEG electrode set and chin EMG electrodes for sleep staging in polysomnography. J. Sleep Res. 25, 636–645 (2016).
Article PubMed Google Scholar
Borges, D. F. et al. A custom-built single-channel in-ear electroencephalography sensor for sleep phase detection: an interdependent solution for at-home sleep studies. J. Sleep Res. 34, e14368 (2025).
Article PubMed Google Scholar
Hammour, G. et al. From scalp to ear-EEG: a generalizable transfer learning model for automatic sleep scoring in older people. IEEE J. Transl. Eng. Health Med. 12, 448–456 (2024).
Article PubMed Google Scholar
Borup, K., Kidmose, P., Phan, H. & Mikkelsen, K. Automatic sleep scoring using patient-specific ensemble models and knowledge distillation for ear-EEG data. Biomed. Signal Process. Control 81, 104496 (2023).
Article Google Scholar
Tabar, Y. R. et al. At-home sleep monitoring using generic ear-EEG. Front. Neurosci. 17, 987578 (2023).
Article PubMed PubMed Central Google Scholar
Jørgensen, S. D. et al. Long-term ear-EEG monitoring of sleep—a case study during shift work. J. Sleep Res. https://doi.org/10.1111/jsr.13853 (2023).
Kjaer, T. W., Rank, M. L., Hemmsen, M. C., Kidmose, P. & Mikkelsen, K. Repeated automatic sleep scoring based on ear-EEG is a valuable alternative to manually scored polysomnography. PLoS Digit. Health 1, e0000134 (2022).
Article PubMed PubMed Central Google Scholar
Jørgensen, S. D., Zibrandtsen, I. C. & Kjaer, T. W. Ear-EEG-based sleep scoring in epilepsy: a comparison with scalp-EEG. J. Sleep Res 29, e12921 (2020).
Article PubMed Google Scholar
Nakamura, T., Alqurashi, Y. D., Morrell, M. J. & Mandic, D. P. Hearables: automatic overnight sleep monitoring with standardized in-ear EEG sensor. IEEE Trans. Biomed. Eng. 67, 203–212 (2020).
Article PubMed Google Scholar
Mikkelsen, K. B. et al. Accurate whole-night sleep monitoring with dry-contact ear-EEG. Sci. Rep. 9, 16824 (2019).
Article PubMed PubMed Central Google Scholar
Mikkelsen, K. B., Villadsen, D. B., Otto, M. & Kidmose, P. Automatic sleep staging using ear-EEG. Biomed. Eng. Online 16, 111 (2017).
Article PubMed PubMed Central Google Scholar
da Silva Souto, C. F. et al. Flex-printed ear-EEG sensors for adequate sleep staging at home. Front. Digit. Health 3, 688122 (2021).
Article PubMed PubMed Central Google Scholar
da Silva Souto, C. F., Pätzold, W., Paul, M., Debener, S. & Wolf, K. I. Pre-gelled electrode grid for self-applied EEG sleep monitoring at home. Front. Neurosci. 16, 883966 (2022).
Article PubMed PubMed Central Google Scholar
Mikkelsen, K. B. et al. Machine-learning-derived sleep-wake staging from around-the-ear electroencephalogram outperforms manual scoring and actigraphy. J. Sleep Res 28, e12786 (2019).
Article PubMed Google Scholar
Sterr, A. et al. Sleep EEG derived from behind-the-ear electrodes (cEEGrid) compared to standard polysomnography: a proof of concept study. Front. Hum. Neurosci. 12, 452 (2018).
Article PubMed PubMed Central Google Scholar
González, D. A. et al. Performance of the Dreem 2 EEG headband, relative to polysomnography, for assessing sleep in Parkinson’s disease. Sleep Health 10, 24–30 (2024).
Article PubMed Google Scholar
Chen, X. et al. Validation of a wearable forehead sleep recorder against polysomnography in sleep staging and desaturation events in a clinical sample. J. Clin. Sleep Med. 19, 711–718 (2023).
Article PubMed PubMed Central Google Scholar
Markwald, R. R., Bessman, S. C., Reini, S. A. & Drummond, S. P. A. Performance of a portable sleep monitoring device in individuals with high versus low sleep efficiency. J. Clin. Sleep Med. 12, 95–103 (2016).
Article PubMed PubMed Central Google Scholar
Cellini, N., McDevitt, E. A., Ricker, A. A., Rowe, K. M. & Mednick, S. C. Validation of an automated wireless system for sleep monitoring during daytime naps. Behav. Sleep Med. 13, 157–168 (2015).
Article PubMed Google Scholar
Rostaminia, S., Homayounfar, S. Z., Kiaghadi, A., Andrew, T. & Ganesan, D. PhyMask: robust sensing of brain activity and physiological signals during sleep with an all-textile eye mask. ACM Trans. Comput. Healthc.3, 1–35 (2022).
Article Google Scholar
Hsieh, T.-H., Liu, M.-H., Kuo, C.-E., Wang, Y.-H. & Liang, S.-F. Home-use and real-time sleep-staging system based on eye masks and mobile devices with a deep learning model. J. Med Biol. Eng. https://doi.org/10.1007/s40846-021-00649-5 (2021).
Article PubMed PubMed Central Google Scholar
Kaplan, R., Wang, Y., Loparo, K. & Kelly, M. Evaluation of an automated single-channel sleep staging algorithm. Nat. Sci. Sleep 101, https://doi.org/10.2147/NSS.S77888 (2015).
Zhang, R. et al. LANMAO sleep recorder versus polysomnography in neonatal EEG recording and sleep analysis. J. Neurosci. Methods 410, 110222 (2024).
Article PubMed Google Scholar
Palo, G. et al. Comparison analysis between standard polysomnographic data and in-ear-electroencephalography signals: a preliminary study. Sleep Adv. 5, zpae087 (2024).
Article PubMed PubMed Central Google Scholar
Lucey, B. P. et al. Comparison of a single-channel EEG sleep study to polysomnography. J. Sleep Res. 25, 625–635 (2016).
Article PubMed PubMed Central Google Scholar
Nakamura, T., Goverdovsky, V., Morrell, M. J. & Mandic, D. P. Automatic sleep monitoring using ear-EEG. IEEE J. Transl. Eng. Health Med. 5, 1–8 (2017).
Article Google Scholar
Beacon Biosignals. Dreem 3S https://beacon.bio/dreem-3s (2025).
Advanced Brain Monitoring. Sleep Profiler https://www.advancedbrainmonitoring.com/products/sleep-profiler (2025).
Shambroom, J. R., Fábregas, S. E. & Johnstone, J. Validation of an automated wireless system to monitor sleep in healthy adults. J. Sleep. Res. 21, 221–230 (2012).
Article PubMed Google Scholar
Kappel, S. L., Rank, M. L., Toft, H. O., Andersen, M. & Kidmose, P. Dry-contact electrode ear-EEG. IEEE Trans. Biomed. Eng. 66, 150–158 (2019).
Article PubMed Google Scholar
Suimin Co., Ltd. Insomnograf K2. https://www.suimin.co.jp/ (2025).
Goverdovsky, V., Looney, D., Kidmose, P. & Mandic, D. P. In-ear EEG from viscoelastic generic earpieces: robust and unobtrusive 24/7 monitoring. IEEE Sens. J. 16, 271–277 (2016).
Article CAS Google Scholar
Goverdovsky, V. et al. Hearables: multimodal physiological in-ear sensing. Sci. Rep. 7, 6948 (2017).
Article PubMed PubMed Central Google Scholar
Chicco, D. & Jurman, G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genom. 21, 6 (2020).
Article Google Scholar
Landis, J. R. & Koch, G. G. The measurement of observer agreement for categorical data. Biometrics 33, 159 (1977).
Article CAS PubMed Google Scholar
Neuphoria. Neuphoria x Monroe Institute: Unlock Deeper States of Consciousness https://neuphoria.io/products/neuphoria-x-monroe-institute-unlock-deeper-states-of-consciousness (2025).
Cognionics, Inc. Quick-20r v2 https://www.cgxsystems.com/quick-20r-v2 (2025).
Compumedics Ltd. Somfit https://www.compumedics.com.au/en/products/somfit/ (2025).
X-trodes Ltd. X-trodes https://xtrodes.com/ (2025).
Bittium Corporation. Bittium BrainStatus https://www.bittium.com/medical/bittium-brainstatus/ (2025).
Vallat, R. & Walker, M. P. An open-source, high-performance tool for automated sleep staging. Elife 10, e70092 (2021).
Article CAS PubMed PubMed Central Google Scholar
Zibrandtsen, I., Kidmose, P., Otto, M., Ibsen, J. & Kjaer, T. W. Case comparison of sleep features from ear-EEG and scalp-EEG. Sleep Sci. 9, 69–72 (2016).
Article CAS PubMed PubMed Central Google Scholar
Debener, S., Emkes, R., De Vos, M. & Bleichner, M. Unobtrusive ambulatory EEG using a smartphone and flexible printed electrodes around the ear. Sci. Rep. 5, 16743 (2015).
Article CAS PubMed PubMed Central Google Scholar
EEGSmart. UMindSleep http://www.eegsmart.com/en/UMindSleep.html (2025).
General Sleep Corporation. Zmachine Insight https://www.generalsleep.com/zmachine-insight.html (2025).
Rechtschaffen, A., Kales, A., University of California, L. Angeles. B. I. S. & NINDB Neurological Information Network (U.S.). A Manual of Standardized Terminology, Techniques and Scoring System for Sleep Stages of Human Subjects (eds Rechtschaffen, A. & Kales, A.) (U. S. National Institute of Neurological Diseases and Blindness, Neurological Information Network, 1968).
IDUN Technologies. IDUN Guardian https://iduntechnologies.com/idun-guardian (2025).

Download references

Acknowledgements

Open access funding was provided by the Swiss Federal Institute of Technology Zürich (ETH Zürich). M.E. acknowledges funding support from Khalifa University (grant number FSU-2025-001).

Funding

Open access funding provided by Swiss Federal Institute of Technology Zurich.

Author information

These authors contributed equally: Karmen Markov, Mohamed Elgendi.

Authors and Affiliations

Biomedical and Mobile Health Technology Lab, Department of Health Sciences and Technology, ETH Zurich, Zurich, Switzerland
Karmen Markov & Carlo Menon
Department of Biomedical Engineering and Biotechnology, Khalifa University of Science and Technology, Abu Dhabi, UAE
Mohamed Elgendi
Centre for Biotechnology (BTC), Khalifa University of Science and Technology, Abu Dhabi, UAE
Mohamed Elgendi

Authors

Karmen Markov
View author publications
Search author on:PubMed Google Scholar
Mohamed Elgendi
View author publications
Search author on:PubMed Google Scholar
Carlo Menon
View author publications
Search author on:PubMed Google Scholar

Contributions

K.M. and M.E. researched data for the article, made substantial contributions to discussions of the content, wrote the article, and reviewed and/or edited the manuscript before submission. M.E. led and designed the study. K.M., M.E., and C.M. provided substantial contributions to the discussion of content and reviewed and/or edited the manuscript before submission.

Corresponding authors

Correspondence to Mohamed Elgendi or Carlo Menon.

Ethics declarations

Competing interests

M.E. serves as Associate Editor for npj biosensing and had no role in the peer-review or decision to publish this manuscript.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Markov, K., Elgendi, M. & Menon, C. Evaluating the performance of wearable EEG sleep monitoring devices: a meta-analysis approach. npj Biomed. Innov. 2, 33 (2025). https://doi.org/10.1038/s44385-025-00034-w

Download citation

Received: 03 August 2024
Accepted: 16 July 2025
Published: 09 October 2025
DOI: https://doi.org/10.1038/s44385-025-00034-w