Introduction

The accurate detection of sleep stages is essential for understanding, diagnosing, and treating sleep disorders. Alterations in sleep stages have been linked to conditions like obstructive sleep apnea (N1 changes)1, restless legs syndrome (decreased rapid eye movement (REM) and N2 stages)2, obstructive sleep apnea-hypopnea syndrome (N3 changes)3, and narcolepsy (disrupted N2/N3 to REM transition)4. Moreover, sleep disturbances are commonly associated with neurodegenerative diseases5,6,7,8,9,10, epilepsy11, chronic pain12,13, depression14,15 and anxiety16. These conditions exhibit a bidirectional relationship with sleep, suggesting that addressing sleep disturbances could potentially delay the onset or ameliorate the symptoms, thereby improving patients’ quality of life15,17. Beyond clinical disorders, a recent meta-analysis demonstrated that greater proportions of N2 and REM sleep are positively associated with cognitive performance in healthy older adults, reinforcing the broader importance of accurate sleep staging18.

The American Academy of Sleep Medicine (AASM)19 provides the most widely used framework for sleep staging, grouping sleep into non-REM (NREM), comprising N1, N2 and N3 stages, and REM sleep. In sleep staging, 30-second record segments, known as epochs, are classified based on the most dominant characteristics: alpha rhythms for Wake, theta waves for N1, K complexes and sleep spindles for N2, and delta waves for N3 or deep sleep. REM sleep is characterized by mixed frequency electroencephalography (EEG), rapid eye movements, and sawtooth waves19.

Polysomnography (PSG) is the gold standard for clinical sleep assessment and diagnostics, capturing EEG, electrooculogram (EOG), and electromyogram (EMG) activity, as well as breathing effort, airflow, pulse, and blood oxygen saturation. However, PSG studies are expensive, time-consuming, require professional supervision and manual labelling, and suffer from high inter-rater variability (82.6% average agreement)20. Recent advancements in automatic sleep staging algorithms, as reviewed by Gaiduk et al.21 have shown high agreement with expert scorers and are increasingly being developed for clinical integration. These tools can accelerate diagnosis, reduce manual workload, and enable responsive interventions, such as real-time monitoring in sleep labs. Automated scoring may also help mitigate inter-rater variability, contributing to more consistent and objective assessments. Despite these advances, PSG remains uncomfortable for many patients, often requiring overnight stays in unfamiliar environments. Moreover, one-night PSG studies may not accurately reflect a patient’s typical sleep patterns and fail to capture night-to-night variability22,23. Recent advancements in dry electrodes have facilitated the development of wearable EEG sleep monitoring devices (wEEGs)24. These devices, using electrodes placed on the forehead, ear, or neck to acquire EEG signals, can offer a practical alternative to traditional PSG. Suitable for home use, these devices enable continuous sleep monitoring without the need for professional supervision.

Despite the growing popularity of wEEGs, their performance in accurately classifying sleep stages remains a subject of ongoing research. Prior systematic reviews have identified EEG as the most accurate sensing modality for detecting all sleep stages, whereas PPG-based systems, while more user-friendly, are limited in their staging accuracy25. A recent meta-analysis of EEG-based wearable devices has further highlighted the growing number of validated systems available, as well as considerable variability in their measurement properties, including accuracy, target population, and device features26. Nevertheless, the lack of standardized evaluation frameworks and inconsistent reporting of stage-specific performance metrics continue to hinder direct comparison between studies and devices. While overall per-epoch accuracy (ACC) is commonly reported, it may not adequately reflect the true performance of wEEGs due to the inherent class imbalance across sleep stages.

This meta-analysis evaluates the overall and sleep-stage-specific performance of wEEGs compared to PSG. Furthermore, we propose the use of Matthews correlation coefficient (MCC) as an alternative balanced evaluation metric. By providing a comprehensive evaluation of the wEEGs’ performance across sleep stages, this meta-analysis contributes valuable insights into their strengths and areas for further development, ultimately contributing to the optimization of wEEGs for clinical and consumer applications.

Results

This section presents the findings from 43 validation studies that compare the performance of wEEGs in sleep stage classification with reference methods. 32 studies22,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57 classified five stages—Wake, N1, N2, N3, and REM (Table 1). Seven studies58,59,60,61,62,63,64 classified four stages—Wake, Light (combining N1 and N2), Deep, and REM Sleep (Table 2). The remaining four studies used a three-stage classification (Wake, NREM, REM)65,66 or non-standard staging systems—one classified Wake, N1, N2 & N3, and REM67, and another reported Wake, N1, N2, and N3 without REM68 (Table 3).

Table 1 Overview of validation studies classifying five stages
Table 2 Overview of validation studies classifying 4 stages
Table 3 Overview of validation studies using other than five- or four-stage systems

Overview of study characteristics

Figure 1 summarizes key characteristics of the 43 validation studies included in this review. Most validation studies involved single-night sleep recordings (n = 34), were conducted in controlled environments such as sleep laboratories (n = 32), and primarily included healthy participants (n = 29). Home-based studies (n = 11) were generally set up by professionals, with the exception of one study47, where participants fitted the device themselves. Only a few studies included clinical populations (n = 8; e.g., obstructive sleep apnea29,36,39,44,59, epilepsy50, Parkinson’s disease58, sleep disorders67 and six studies involved mixed cohorts, including both healthy and clinical participants—for example, one study58 primarily enrolled healthy individuals (n = 50) but also included a subgroup of patients with Alzheimer’s disease (n = 12). The number of participants per study ranged from 1 to 197 (mean = 30 ± 37). The volume of data analyzed ranged widely, from 293 to 195,349 epochs (mean = 29,969 ± 37,072), although epoch counts were not available for four studies. Gender representation was approximately balanced among those reporting it (57% male, 43% female), though five studies did not provide gender breakdowns. Participant age was skewed toward young to middle-aged adults, with limited inclusion of adolescents and older adults. wEEGs used across studies differed in electrode placement, form factor, and electrode type. Of the 43 included studies, 19 evaluated commercially available wEEG devices—most frequently Dreem28,32,58 and Sleep Profiler22,34,67, and Somfit37,38 and Zeo60,61, while 24 assessed prototypes. Electrode placement was most commonly on the forehead (n = 26), followed by the ear (n = 16), with one study64 using a neck-worn device. Most devices with forehead electrodes were headbands (n = 15), but other designs included sheet-like or patch-based systems (n = 8) and sleep masks (n = 3). Electrode usage patterns in forehead devices also revealed strong clustering around a few standard positions. As shown in Fig. 2, Fp1 and Fp2 were the most commonly used electrodes, followed by Fpz, F7, F8, AF7, and AF8. The three most frequent electrode combinations corresponded to well-known devices: the Dreem headband69 (Fpz, F7, F8, O1, O2), the Sleep Profiler70 (Fpz, AF7, AF8), and the Zeo headband71 (Fp1, Fpz, Fp2). Notably, the Fp1–Fp2–Fpz configuration appeared in five studies, including two that used the Zeo device and three others30,62,65. Among ear-based devices, the majority were in-ear designs (n = 12), with a smaller number of around-the-ear configurations (n = 4). The number of electrodes ranged from 1 to 16 (mean = 6 ± 4). Dry electrodes were used in 24 studies, while 19 employed wet electrodes. Sleep staging methods applied to the wEEG data varied between machine learning (n = 11), manual scoring (n = 10), proprietary algorithms (n = 9), and deep learning (n = 8). Five studies22,27,31,34,41 used a combination of methods, meaning they reported results from more than one approach (e.g., deep learning alongside manual scoring by an expert). To evaluate sleep staging accuracy, most studies used expert-scored PSG as the reference standard, adhering to AASM guidelines (n = 40). Ground truth data were typically scored by a single rater (n = 25). Reported evaluation metrics varied across studies: overall device performance was most commonly assessed using Cohen’s Kappa (n = 37) and accuracy (n = 24), while stage-specific metrics such as sensitivity (n = 33) and positive predictive value (n = 18) were frequently used. Confusion matrices were provided in 34 studies (Fig. 3).

Fig. 1: Characteristics of the 43 included validation studies.
figure 1

a Electrode positions: Most studies used forehead EEG (n = 26), followed by ear-based devices (n = 16); one study used a neck-worn sensor. b Device type: Prototype devices (n = 24) were more common than commercially available systems (n = 19). c Electrode type: Dry electrodes were used in 24 studies, while 19 employed wet electrodes. d Participants’ health status: Most studies involved healthy participants (n = 29), with fewer including clinical populations (n = 8) or mixed cohorts (n = 6). e Study environment: The majority of studies were conducted in controlled environments (n = 32), while 11 were home-based. f Device scoring method: Machine learning (n = 11), manual scoring (n = 10), and proprietary algorithms (n = 9) were the most common, with some studies using deep learning (n = 8) or multiple methods (n = 5). (g) Age distribution: Most studies involved adults in their mid-20s to mid-30s. Mean age, standard deviation (black lines), and range (white bars) are shown. Older adults and adolescents were underrepresented.

Fig. 2: Overview of electrode use across forehead-EEG studies.
figure 2

a Electrode usage frequency: Scalp heatmap and bar chart showing how often each electrode was used in included studies. Scalp heatmap and bar chart showing the frequency of electrode positions used in studies with forehead-placed EEG, based on the international 10–20 system. The most frequently used electrodes were Fp1 and Fp2, followed by Fpz, F7, F8, AF7, and AF8. b Most common electrode combinations and corresponding devices: The three most common electrode combinations and their associated device types. Top: Dreem headband (Fpz, F7, F8, O1, O2); Middle: Sleep Profiler (Fpz, AF7, AF8); Bottom: Zeo headband (Fp1, Fpz, Fp2). While Dreem and Sleep Profiler were each used in three studies, the Fp1–Fp2–Fpz combination appeared in five studies, including two using the Zeo device.

Fig. 3: Evaluation metrics reported across included studies.
figure 3

a Reported evaluation metrics per study: Presence (green) or absence (red) of overall and sleep-stage-specific evaluation metrics reported in each study. b Frequency of reported metrics across studies. Total number of studies reporting each evaluation metric, grouped by overall vs. stage-specific metrics. For devices’ overall performance, κ (n = 37) and ACC (n = 24) were the most frequently reported metrics. For sleep-stage specific performance, SE (n = 33) and PPV (n = 18) were the most frequently reported metrics. Confusion matrices were provided in 34 studies. ACC: accuracy, CM: confusion matrix, κ: Cohen’s Kappa, MCC: Matthews correlation coefficient, NPV: negative predictive value, PPV: positive predictive value, SE: sensitivity, SP: specificity.

Analysis of validation studies

The overview of overall and sleep-stage-specific evaluation metrics, along with the proportions of each sleep stage for both five- and four-stage studies, are presented in Figs. 4 and 5, respectively. The data for these figures can be found in Supplementary Tables S1S4. Additionally, the Bland–Altman plots, which offer insights into variations in sleep stage duration and proportion estimations by wEEGs compared to PSG, are detailed in Supplementary Figs. S1 and S2.

Fig. 4: Evaluation metrics for studies classifying five sleep stages using wEEGs.
figure 4

Each panel (a, cg) includes a boxplot of metric distributions and a bar chart showing the top 20 studies by MCC score for that specific stage. Asterisks (*) indicate statistically significant differences (p < 0.05) from MCC (a, cg) or between micro- and macro-averaged versions of the same metric in (b). a Overall metrics: Results demonstrated that overall accuracy (ACC) was significantly higher than Cohen’s kappa (κ) and MCC. b Comparison of micro- and macro-averaged metrics: Macro-averaged ACC was significantly higher than micro-averaged ACC, while micro-averaged SE, PPV, and F1 were significantly higher than their macro-averaged counterparts, highlighting class imbalance effects. c Wake stage: Wake showed the highest SP and NPV but the lowest SE among stages, indicating reliable exclusion of wake but variable detection. d N1 stage: N1 was the most challenging stage, with the lowest scores across κ, SE, PPV, F1, and MCC. e N2 stage: Showed relatively high metrics overall. Being the most prevalent class, the differences between metrics emphasizing TPs (e.g., SE) and TNs (e.g., SP) were smaller compared to other stages. Nevertheless, the MCC and κ scores were noticeably lower, suggesting potential misclassifications. f N3 stage: N3 showed the highest κ and MCC, with strong SE and excellent agreement across studies. g REM stage: REM had the second-highest SE and MCC but the lowest PPV, indicating a tendency for false positives. h Mean proportional distribution of sleep stages across studies: Wake (18%), N1 (7%), N2 (41%), N3 (17%), REM (17%). ACC: accuracy; κ: Cohen’s kappa; MCC: Matthews correlation coefficient; SE: sensitivity; SP: specificity; PPV: positive predictive value; NPV: negative predictive value.

Fig. 5: Evaluation metrics for studies classifying four sleep stages using wEEGs.
figure 5

Each panel (a, cg) includes a boxplot of metric distributions and a bar chart showing the top studies by MCC score for that specific stage. Asterisks (*) indicate statistically significant differences (p < 0.05) from MCC (a, cg) or between micro- and macro-averaged versions of the same metric in (b). a Overall metrics: Studies showed an average ACC of 0.80 ± 0.08, significantly higher than κ (0.70 ± 0.11) and MCC (0.70 ± 0.11). b Comparison of micro- and macro-averaged metrics: Macro-averaged ACC was significantly higher than micro-averaged ACC. c Wake: Demonstrated κ = 0.68 ± 0.14, MCC = 0.69 ± 0.16, highest NPV and lowest SE (0.68 ± 0.18), suggesting strong ability to exclude Wake epochs but weaker sensitivity. d Light sleep: The most prevalent stage (53% of epochs) had the lowest κ (0.61 ± 0.11) and MCC (0.65 ± 0.11), along with the lowest ACC, SP, and NPV, but highest PPV. e Deep sleep: Showed the strongest performance overall, with the highest κ (0.77 ± 0.13), MCC (0.75 ± 0.13), ACC, and SE (0.80 ± 0.07), indicating robust detection. f REM: Achieved high SE (0.78 ± 0.14), κ (0.65 ± 0.18), and MCC (0.72 ± 0.10), but had the lowest PPV, suggesting higher false positives in REM classification. g Mean stage distribution across studies: Light sleep (53%), Deep sleep (17%), REM (18%), and Wake (12%). ACC: accuracy; κ: Cohen’s kappa; MCC: Matthews correlation coefficient; SE: sensitivity; SP: specificity; PPV: positive predictive value; NPV: negative predictive value.

Studies classifying Wake, N1, N2, N3, REM

On average, the studies showed an overall ACC of 0.78 ± 0.07 and an overall κ of 0.68 ± 0.10, suggesting substantial agreement between wEEGs and PSG. MCC (0.70 ± 0.08) and κ were significantly lower compared to ACC. Significant differences were also observed between micro- and macro-averaged ACC, SE, PPV and F1, highlighting the effect of class imbalance on evaluation outcomes. The top three overall MCC results were reported by Li et al.27 (WPSG-I headband; MCC = 0.899 for manual scoring and 0.795 for automatic scoring) and Borup et al.46 (in-ear prototype72; MCC = 0.784) (Fig. 4a, b).

The Wake stage, representing 18% of epochs, showed strong agreement between wEEGs and PSG with a mean κ of 0.75 ± 0.12, MCC of 0.76 ± 0.12, and SE of 0.79 ± 0.13. It had the highest accuracy among all stages. Bland–Altman analysis revealed a slight underestimation by wearable devices in duration (–3.20 min; LoA: –21.13 to +14.73 min) and in proportion (–0.69%; LoA: –4.67 to +3.30%). High ICCs (1.00 for duration, 0.99 for proportion) further supported the robustness of these measurements. The best Wake-stage MCCs were reported by Li et al.27 (WPSG-I headband; MCC = 0.950 and 0.906) and Borup et al.46 (in-ear prototype72; MCC = 0.883) (Fig. 4c).

The N1 stage, comprising just 7% of all epochs, showed the weakest agreement between wEEGs and PSG. It had the lowest mean κ (0.31 ± 0.15), MCC (0.33 ± 0.15), and SE (0.36 ± 0.18) of all stages, and exhibited high variability across studies, reflecting inconsistent detection. It also had the lowest average PPV. Duration was marginally overestimated (+0.64 min; LoA: –33.83 to +35.12 min), and proportion slightly overestimated (+0.32%; LoA: –7.83% to +8.47%). ICCs were 0.75 and 0.72 for duration and proportion, respectively, indicating moderate-to-good agreement. Top-performing results for N1-stage MCC were reported by Li et al.27 (WPSG-I headband; MCC = 0.646), Seol et al.29 (Insomnograf K2 headband73; MCC = 0.562), and Borges et al.44 (in-ear prototype74,75; MCC = 0.558) (Fig. 4d).

The N2 stage, the most prevalent (41% of epochs), demonstrated strong classification performance with a mean SE of 0.82 ± 0.08, while κ and MCC were slightly lower (both 0.69 ± 0.09). It showed the highest SE and F1 across stages, but the lowest ACC, SP, and NPV. wEEGs slightly overestimated N2 duration (+1.51 min; LoA: –46.75 to +49.78 min) and proportion (+0.23%; LoA: –11.09% to +11.55%). ICCs were 0.79 and 0.73 for duration and proportion, respectively, indicating good agreement. Top MCC results were reported by Li et al.27 (WPSG-I headband; MCC = 0.906 and 0.794 for manual and automatic scoring, respectively) and Matsumori et al.42 (sheet-like prototype device; MCC = 0.829) (Fig. 4e).

The N3 stage, also known as Deep Sleep, representing 17% of epochs, had the highest κ (0.76 ± 0.10), MCC (0.78 ± 0.07), and SP, PPV, and NPV among all stages, with a high SE of 0.79 ± 0.10. wEEGs slightly overestimated N3 duration (+1.93 min; LoA: –21.37 to +25.70 min) and proportion (+0.38%; LoA: –4.95 to +4.20%). ICCs of 0.93 for both duration and proportion reflect excellent agreement. Leading results came from Li et al.27 (WPSG-I headband; MCC = 0.912), Tabar et al.47 (in-ear prototype; MCC = 0.864), and Mikkelsen et al.52 (in-ear prototype72; MCC = 0.863) (Fig. 4f).

The REM stage, also accounting for 17% of epochs, showed strong performance with SE = 0.79 ± 0.13, κ = 0.74 ± 0.13, and MCC = 0.75 ± 0.13. Duration and proportion were slightly overestimated (+1.90 min; LoA: –15.56 to +19.36 min; +0.36%; LoA: –4.95% to +4.20%). ICCs were 0.95 and 0.93, indicating excellent reliability. The highest REM MCCs were observed in Kwon et al.41 (sheet-like patch prototype; MCC = 0.917), Jørgensen et al.50 (in-ear prototype72; MCC = 0.903), and Borges et al.44 (in-ear prototype74,75; MCC = 0.894) (Fig. 4g).

In the Wake, N1, N3, and REM stages, significant differences were observed between ACC, SP, PPV, NPV and the balanced metric MCC, indicating that high values of these metrics may mask poor classification performance in minority classes.

Studies classifying Wake, Light Sleep, Deep Sleep and REM

Studies classifying four stages showed an average overall ACC of 0.80 ± 0.08, and κ and MCC of 0.70 ± 0.11, indicating substantial agreement between wEEGs and PSG. Significant difference was observed between micro- and macro-averaged ACC (Fig. 5a, b).

The Wake stage, comprising 12% of all epochs, showed notable variability across studies. It had a mean κ of 0.68 ± 0.14, MCC of 0.69 ± 0.16, and the lowest SE among all stages (0.68 ± 0.18), but achieved the highest NPV. According to Bland–Altman analysis, Wake duration was slightly overestimated (+2.75 min; LoA: –31.56 to +26.07 min), with near-zero bias in proportion (+0.01%; LoA: –0.07% to +0.07%). ICCs were 0.90 for duration and 0.86 for proportion, indicating strong agreement (Fig. 5c).

The Light Sleep stage, constituting 53% of all epochs, but showed the lowest mean κ (0.61 ± 0.11), MCC (0.65 ± 0.11), ACC, SP, and NPV. In contrast, it had the highest PPV and a solid SE (0.77 ± 0.09). Bland-Altman analysis showed a small underestimation in duration (–1.22 min; LoA: –33.39 to +30.95 min) and proportion (–0.03%; LoA: –0.14 to +0.09%). Duration agreement was excellent (ICC = 0.98), but proportion agreement was poor (ICC = –0.07), suggesting inconsistency in proportional estimates (Fig. 5 d).

The Deep Sleep stage (N3), making up 17% of epochs, showed the highest κ (0.77 ± 0.13), MCC (0.75 ± 0.13), ACC and SE (0.80 ± 0.07). Duration was slightly overestimated (+0.28 min; LoA: –17.03 to +17.56 min), with minimal bias in proportion (–0.01%; LoA: –0.07% to +0.07%). ICCs were strong for both for duration (ICC = 0.97) and proportion (ICC = 0.86) (Fig. 5e).

The REM stage, accounting for 18% of epochs, had high SE (0.78 ± 0.14), MCC (0.72 ± 0.10), and κ (0.65 ± 0.18), but the lowest PPV, suggesting more frequent false positives. Duration was slightly overestimated (+1.28 min; LoA: –17.45 to +20.02 min), with minimal bias in proportion (+0.03%; LoA: –0.04% to +0.13%). Agreement on duration was good (ICC = 0.80), but poor for proportion (ICC = –0.31) (Fig. 5f).

In the Wake, Deep Sleep and REM stages, significant differences between ACC, SP, NPV and MCC were observed.

Influence of methodological variabilities on wEEG performance

The impact of methodological variabilities on the wEEG’s performance in detecting sleep stages were analyzed, focusing on five-stage classification studies due to the limited sample size of four-stage studies. Several significant associations between study design features and evaluation metrics suggest that methodological choices can affect reported performance.

wEEG performance varied by participant health status: studies involving clinical or mixed populations showed significantly higher performance across macro-averaged and Wake, N1, N3, and REM metrics compared to studies only including healthy participants (Fig. 6I). N3 F1 performance was negatively associated with the number of participants, possibly reflecting greater heterogeneity in larger samples (Fig. 6II). In contrast, Wake-stage κ, F1, and MCC were positively associated with the number of epochs, indicating more robust classification with larger datasets (Fig. 6III). Study environment also influenced performance: home-based studies showed significantly higher N1 ACC and N3 κ, SE, F1, and MCC, while controlled environments yielded higher REM-stage ACC and NPV (Fig. 6IV). However, limited results for clinical, mixed, and home-based subgroups may restrict the generalizability of these findings.

Fig. 6: Study-level factors significantly associated with variations in wEEG performance.
figure 6

This figure presents only those comparisons where significant differences (p < 0.05) in evaluation metrics were observed. Full analyses across all stages and variables—including non-significant results—are available in the Supplementary figures. Asterisks (*) indicate statistically significant group differences based on Mann–Whitney U tests. I Influence of participants’ health status on wEEG performance: Boxplots show statistically significant differences between healthy, clinical, and mixed populations in macro-averaged metrics (Ia), Wake (Ib), N1 (Ic), N3 (Id), and REM (Ie) stages. Clinical and mixed-population studies generally reported higher performance, especially for κ, SE, F1, and MCC. II Influence of number of participants: Pearson correlation analysis revealed a significant negative correlation between the number of participants and N3 F1 score, suggesting reduced consistency in larger, more heterogeneous cohorts. III Influence of number of epochs: Significant positive correlations were found between the number of epochs and Wake-stage κ, F1 and MCC, indicating improved Wake classification performance with greater data volume. IV Influence of study environment: Boxplots comparing controlled (e.g., sleep lab or hospital) and home-based settings show significantly better performance in home studies for N1 ACC (IVa) and N3-stage κ, SE, F1, and MCC (IVb). Controlled settings yielded higher REM-stage ACC and NPV (IVc). ACC: accuracy; κ: Cohen’s kappa; MCC: Matthews correlation coefficient; SE: sensitivity; SP: specificity; PPV: positive predictive value; NPV: negative predictive value; F1: F1 score.

Device characteristics had mixed effects on wEEG performance, with few statistically significant group differences. No significant differences were found between commercial and prototype devices (Supplementary Fig. S5, Table S7). The only significant effect of electrode placement was higher REM-stage NPV for forehead-based systems (Fig. 7I). Dry electrodes showed significantly higher N2 PPV and N3 NPV than wet electrodes (Fig. 7II). The scoring method had the most consistent impact: machine learning was significantly outperformed by deep learning (macro κ, F1, MCC; N1 κ, F1, MCC; N2 F1) and by manual scoring (macro F1; N1 κ, F1, MCC; REM ACC, SP). Proprietary algorithms also outperformed machine learning in Wake PPV, N2 NPV, N3 SP, and REM NPV (Fig. 7III), though the small number of studies (n = 5) limits interpretation. Importantly, in the five studies22,27,31,34,41 that evaluated multiple scoring methods on the same data, manual scoring consistently outperformed automatic scoring, and deep learning outperformed classical machine learning (Table 1). Electrode count showed significant positive correlations with N3 κ, SE, MCC, and F1, supporting the value of increased spatial coverage for slow-wave sleep detection, while REM NPV showed a weaker but significant negative correlation (Fig. 7IV). In summary, while most device features had limited impact, electrode count and scoring method significantly influenced N3 and REM-stage performance (Supplementary Figs. S6S8; Tables S8S10).

Fig. 7: Device- and electrode-related factors associated with significant differences in wEEG performance.
figure 7

This figure presents only those comparisons where significant differences (p < 0.05) in evaluation metrics were observed. Full analyses across all stages and variables—including non-significant results—are available in the Supplementary figures. Asterisks (*) indicate statistically significant group differences based on Mann–Whitney U tests. I Electrode position: Panel Ia shows a significant difference in REM-stage NPV, which was higher in studies using forehead electrodes. II Electrode type: Panels IIa and IIb show significantly higher N2 PPV and REM NPV in studies using dry electrodes. III Scoring method: Deep learning outperformed machine learning in macro-averaged κ, F1, and MCC, as well as N1-stage κ, F1, and MCC and N2 F1. Manual scoring showed significantly better performance than machine learning in macro-averaged F1 and several Wake- and REM-stage metrics. Proprietary algorithms yielded significantly higher performance than machine learning in Wake-stage PPV, N2-stage NPV, N3-stage SP, and REM-stage NPV. IV Number of electrodes: Pearson correlation analysis revealed significant positive associations between the number of recording electrodes and N3-stage performance metrics, including κ, SE, MCC, and F1. A significant negative correlation was observed for REM-stage NPV. ACC: accuracy; κ: Cohen’s kappa; MCC: Matthews correlation coefficient; SE: sensitivity; SP: specificity; PPV: positive predictive value; NPV: negative predictive value; F1: F1 score.

Sensitivity analysis

To assess robustness, a leave-one-out sensitivity analysis was conducted across all 36 results from 32 studies, classifying five stages. No individual study exceeded the pre-defined ±0.03 threshold for altering pooled values; the largest absolute shifts were ≤0.012 and relative changes stayed under 3.5%. The most sensitive metrics were in the N1 stage (κ, SE, F1, and MCC), where omitting the most high-performing studies (e.g., Li et al.27) slightly raised scores. Notably, Mikkelsen et al.53 appeared most frequently as the study with the highest influence across multiple metrics, particularly in overall and REM-stage results. Thus, the meta-analytic results can be considered robust when aggregating across all five-stage studies (Supplementary Fig. S9, Supplementary Table S11).

Subgroup-specific analyses revealed less robustness, with some comparisons highly sensitive to individual studies. The N1 stage was most affected, reflecting the inherent difficulty of classifying this stage and highlighting its susceptibility to methodological and sample-related variation. Metrics emphasizing class imbalance and agreement (κ, F1, MCC, PPV) fluctuated more than overall accuracy. Instability was concentrated in subgroups with few studies—especially clinical (n = 5), mixed-population (n = 7), home-based (n = 10), and proprietary-algorithm (n = 5) groups. In contrast, estimates from healthy cohorts, controlled lab settings, and devices using forehead electrode placement or dry electrodes were robust. A few studies—particularly Ravindran et al.28, Li et al.27, and Borges et al.44—consistently drove subgroup results: excluding Ravindran et al.28 raised averages, while omitting Li et al.27 or Borges et al.44 lowered them (Supplementary Table S12). In summary, while aggregated findings are reliable, differences between methodological groups should be interpreted with caution, as they are often driven by just one or two influential studies.

Discussion

The rising popularity of wEEGs underscores the need for a standardized assessment of their ability to classify individual stages. Misclassifications can compromise sleep architecture analysis, and, consequently, affect therapeutic decisions and sleep quality evaluations. Both five- and four-stage studies revealed variability in wEEGs’ performance across studies. Our analysis identified the sleep stages where wEEGs’ classifications were most and least aligned with PSG.

Overall performance results from 32 studies classifying five sleep stages showed clear discrepancies between overall accuracy and more balanced metrics like MCC and κ, highlighting limitations in wEEGs’ ability to classify individual stages. These discrepancies were amplified by differences between micro- and macro-averaged metrics. Micro-averaging, which weighs each epoch equally, inflated SE, PPV, and F1 due to the dominance of the N2 stage, but masked underperformance in less frequent stages such as N1. In contrast, macro-averaging, which gives equal weight to each class, yielded a higher ACC by amplifying true negative counts in less frequent stages like N1. These findings underscore the need to prioritize balanced evaluation metrics when validating wEEG performance to account for class imbalances inherent in sleep stage distributions. The Wake stage showed the highest classification accuracy, with excellent agreement to PSG despite slight underestimation in duration. This confirms that wake-sleep differentiation remains a key strength of wEEGs. N1, the least frequent stage, emerged as the most challenging to classify, reflecting its transitional nature and less distinct EEG characteristics. Significant variability in its detection reliability across studies, shown by the largest standard deviations in various metrics and the greatest estimation variability in the Bland-Altman analysis, highlights the need for improved automatic detection of this stage by wEEGs. The most dominant N2, N2 stage yielded high SE, but significantly lower scores in balanced metrics MCC and κ, indicating frequent inclusion of N2 epochs, but not necessarily correct exclusion of others. These observations highlight concerns regarding over-reliance on SE, the most frequently reported metric for stage-specific performance, which potentially overlooks N2 classification challenges. N3 emerged as the most reliably detected stage. It consistently yielded high balanced scores with minimal inter-study variability and strong agreement in both duration and proportion, supporting its robust detection by wEEGs. REM sleep was also generally well classified but displayed higher variability across studies. This, along with slight overestimation in duration, indicates a need for further refinement, especially given the clinical importance of REM-related sleep disorders. Importantly, the sensitivity analysis confirmed that no single study significantly altered the pooled results, confirming the robustness of the overall meta-analytic findings.

When examining the top studies for overall and stage-specific MCC in the five-stage classification, several consistently stood out. Li et al. reported strong results using the WPSG-I headband with 6 dry electrodes in a controlled setting with 20 mixed (healthy and clinical) participants. Both manual and proprietary algorithmic scoring were evaluated, with manual scoring performing particularly well, though automatic results were also among the best. Borup et al.46 achieved high MCCs using a dry in-ear prototype72 in a home setting with 20 healthy participants over four nights, applying a personalized deep learning model. Borges et al.44 achieved comparable results using a wet in-ear prototype74,75 on mostly OSA participants in a lab setting, with automatic scoring. While the results suggest strong device performance, they likely reflect not only device quality but also study design factors.

Several methodological variables significantly influenced wEEG performance across studies. While electrode placements (forehead or ear) and types (dry or wet) showed limited impact, the scoring method and electrode count consistently influenced stage-specific outcomes. Manual and deep learning-based scoring generally outperformed machine learning, highlighting the importance of algorithm complexity and human oversight, especially for difficult stages like N1. Electrode count was positively associated with N3 performance, reinforcing the importance of spatial resolution for detecting slow-wave sleep. Participant health status and recording environment also affected performance, though these findings should be interpreted cautiously given the limited sample sizes in some subgroups. While these results underscore the importance of methodological context when evaluating wEEG systems, many subgroup-level differences were highly sensitive to the inclusion or exclusion of just one or two studies. This was particularly true for underrepresented methodological groups, such as those involving clinical or mixed populations, home-based recordings, or deep learning and proprietary algorithms, where few studies were available. As a result, differences between these subgroups should be interpreted with caution, as they may reflect study-specific effects rather than consistent, generalizable patterns.

A smaller subset of studies reported results using a four-stage classification system, and although limited in number (n = 7), they offered additional insights that in part mirrored the five-stage findings. As with five-stage studies, ACC tended to overestimate performance relative to balanced metrics such as MCC and κ, again pointing to inconsistencies in individual stage classification. Differences between micro- and macro-averaged ACC were also present, though performance across SE, PPV, and F1 was more uniform, suggesting less skew from dominant stages. Wake stage showed substantial variability across studies, with the lowest SE but high NPV, and relatively strong agreement in duration and proportion, indicating wEEGs effectively detected its absence but often missed its presence. Light Sleep mirrored challenges seen with N2, with low MCC and κ, signaling misclassification issues. Deep Sleep was the most reliably detected, echoing five-stage results, with high MCC and minimal bias in proportion estimates. REM was generally well classified, though lower MCC and overestimation in duration again highlighted areas for refinement.

The significant variability in stage-specific performance across studies highlights the need for a consistent and robust approach to evaluating wEEG-based sleep staging. Commonly used metrics such as overall ACC, SP, and NPV tend to overstate performance, largely due to the disproportionate influence of TNs in class-imbalanced data. This was evident in our analysis, where these metrics consistently outperformed more balanced alternatives like MCC across both five- and four-stage classification systems. To ensure comparability and reliability in future research, we propose a standardized framework for validating wEEG systems against PSG:

  1. 1.

    Use of balanced metrics (MCC or κ): For both overall and stage-specific evaluation, we recommend MCC and κ. Unlike overall accuracy, which can be misleading in imbalanced datasets, these metrics consider all elements of the confusion matrix and remain informative even when some stages (e.g., N1 or Wake) are underrepresented. MCC, in particular, offers a single, interpretable value that reflects the overall quality of classification and is increasingly recommended for imbalanced settings76. Stage-specific evaluation should rely on per-class MCC or κ, which reflect true classification quality even for minority stages like N1, and be complemented by sensitivity (SE) to highlight detection ability.

  2. 2.

    Reporting confusion matrices detailing labels from wEEGs and manually scored PSG should be included for detailed analysis.

  3. 3.

    Adopting standardized staging systems, classifying sleep into Wake, N1, N2, N3 and REM or Wake, Light Sleep, Deep Sleep and REM. Three studies in this meta-analysis did not employ five- or four-stage classification systems, making them incomparable to other studies.

This meta-analysis, while comprehensive, faced certain limitations. First, potential language and publication bias may have influenced the included studies, as only English-language, peer-reviewed publications were considered. This may have excluded relevant non-English data, potentially skewing results. Additionally, four analyzed studies did not employ five- or four-stage classification systems, making them incomparable to others. Moreover, most studies included in this meta-analysis focused on participants in their mid-20s to mid-30s, leading to a notable age bias. This overrepresentation limits the generalizability of findings, particularly for older adults who experience age-related changes in sleep architecture, such as reduced N3 and REM sleep, increased fragmentation, and altered EEG signal patterns, that may affect wEEG accuracy. Clinical populations were also underrepresented. Most studies were conducted on healthy participants, despite the increasing interest in applying wEEG systems to monitor or diagnose sleep disorders. Altered sleep architecture, comorbidities, and medication use and atypical sleep patterns may alter EEG signals and algorithm performance in ways not captured by current validation studies.

Future research directions should focus on expanding the scope of wEEG validation studies. Dedicated studies involving both older adults and clinical cohorts are critical to establishing the real-world reliability and clinical applicability of wEEG systems. Longitudinal and ambulatory studies capturing night-to-night variability and real-life usage conditions would improve ecological validity. Studies should also report on device wearability and user adherence, especially for home-based applications, to assess long-term feasibility. In addition to staging accuracy, future work should also assess wEEG reliability in estimating other sleep parameters—such as sleep onset latency and wake after sleep onset. Comparative analyses examining the impact of specific electrode placement in different device types on sleep stage classification can yield critical insights into optimizing wEEGs. Incorporating alternative technologies like Photoplethysmography (PPG) or actigraphy can also enhance the breadth of comparisons across different sleep staging technologies. Ethical considerations, including data privacy and responsible handling of sleep health data, must also be addressed as wEEG use expands beyond research settings. Crucially, adherence to the standardized framework proposed in this analysis is fundamental for future research. Embracing this framework will aid in systematically assessing and refining wEEGs’ capabilities and limitations in classifying sleep stages.

The increased use of wEEGs in sleep monitoring presents significant potential in both clinical and consumer health. The analysis of 43 validation studies revealed a substantial overall agreement between wEEGs and PSG. However, their performance in accurately classifying specific sleep stages varied considerably. The N1 stage presented significant classification challenges, while the N3 or Deep Sleep stage was the most reliably identified stage. These discrepancies are critical, as misclassifications can lead to incorrect sleep architecture analysis, directly impacting clinical decisions and patient care. Furthermore, manual scoring generally outperformed automatic methods, particularly for the N1 stage, highlighting the need for improved algorithms. Additionally, electrode count was positively associated with N3 detection, suggesting that spatial resolution remains an important factor for optimizing wEEG performance. To address the variabilities in stage-specific performance, we propose a standardized evaluation framework emphasizing balanced metrics like MCC and κ, and consistency in reporting validation results. This approach aims to facilitate advancements in wEEG technology by enhancing comparisons across studies and devices. As wEEGs become more integrated into home-based sleep monitoring, ensuring their reliability becomes crucial for clinical adoption. This meta-analysis paves the way for future research to refine wEEGs, particularly for stages with less distinct features like N1, ensuring their reliability across diverse populations and settings. This is essential to fully realize their potential in enhancing sleep monitoring and analysis in both clinical and consumer applications.

Methods

Search strategy

A systematic literature search was conducted across four databases: PubMed, Scopus, IEEE Xplore, and Web of Science. The objective was to identify studies published within the past 10 years that investigated the use of wearable EEG devices for sleep staging in human subjects and were validated against PSG. The search strategy included the following keywords and Boolean operators: “sleep AND (wearable OR mobile OR headband) AND (EEG OR electroencephalogram OR electroencephalography)”. The search was limited to English-language publications. Additionally, the reference lists of all included articles were manually screened to identify any further relevant studies.

Study selection

A total of 1585 records were identified through database searching, and 45 additional records through manual searches of reference lists and websites of wEEGs. After removing duplicates, 903 records remained for the title and abstract screening. Studies were included if they met the following inclusion criteria: conducted within the last 10 years, conducted on human subjects, published in English, used wEEG for sleep staging, and compared the results to a gold standard such as PSG. Studies were excluded if they were reviews or conducted on non-human subjects. After the application of these criteria, 791 records were excluded. The full texts of the remaining 112 studies were assessed for eligibility. Of these, 69 studies were excluded because they either lacked validation on a wearable EEG device, did not include validation against PSG, or did not report sleep staging results. This resulted in the inclusion of 43 validation studies in the final analysis (Fig. 8).

Fig. 8: PRISMA diagram.
figure 8

Diagram illustrating the systematic search and selection process in this meta-analysis. Initially, 1630 records were identified (1585 from databases, 45 from other sources). After removing duplicates, 903 records were screened based on title and abstract, leading to the exclusion of 791 records. Full-text assessment of 112 articles resulted in the exclusion of 69 for the following reasons: 44 studies developed models using PSG data but did not validate them on a wearable EEG device, 16 did not include validation against PSG, and 7 did not report any sleep staging results. Ultimately, 43 studies were included, categorized into three groups: 32 studies classifying Wake, N1, N2, N3, REM; 7 studies classifying Wake, Light, Deep, REM; and 4 studies using other classification systems.

Data extraction

Extracted information covered study design and device characteristics, including the device name, number of recording electrodes, their placements, and types (wet or dry). Participant-related data included the number of participants, age, and population type (e.g., healthy individuals or clinical populations). Additional details such as the study environment (e.g., home or controlled setting), number of recorded nights per participant, and total number of analyzed epochs were also recorded. For each study, the sleep staging method used by the wEEG (e.g., manual scoring, machine learning, or deep learning), the validation approach (e.g., comparison against PSG scored by an expert), and the sleep stage classification scheme (e.g., five-stage: Wake, N1, N2, N3, REM; or four-stage: Wake, Light, Deep, REM) were documented. Confusion matrices and reported evaluation metrics were collected as the primary data for quantitative analysis. When confusion matrices were not explicitly provided, they were reconstructed based on available information. All confusion matrices and details on their derivation are provided in the Supplementary Materials.

Data analysis

$${\kappa }_{{\rm{m}}{\rm{u}}{\rm{l}}{\rm{t}}{\rm{i}}{\rm{c}}{\rm{l}}{\rm{a}}{\rm{s}}{\rm{s}}}=\frac{{p}_{j}-{p}_{e}}{1-{p}_{e}}$$
(1)

where

$${p}_{j\,({\rm{m}}{\rm{u}}{\rm{l}}{\rm{t}}{\rm{i}}{\rm{c}}{\rm{l}}{\rm{a}}{\rm{s}}{\rm{s}})}=\frac{{\sum }_{i=1}^{n}{C}_{ii}}{N}$$
(2)

and

$${P}_{e\,(\mathrm{multiclass})}=\mathop{\sum }\limits_{i=1}^{n}\left(\frac{{C}_{i+}{\times C}_{+i}}{{N}^{2}}\right)$$
(3)

with

  • C: confusion matrix.

  • Cii: the diagonal elements (true positives for each class).

  • n: the number of classes.

  • N: total number of epochs (sum of all elements in the confusion matrix).

  • Ci+: sum of elements in row i (true instances of class i).

  • C+i: : sum of elements in column i (predicted instances of class i).

(1) The confusion matrices were used to derive true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN). Using these values, the overall agreement between the wEEG and the reference standard was evaluated with Cohen’s Kappa (κ), which is calculated as shown in Eq. (1). Κ values are interpreted as follows: <0.0 poor, 0.00–0.20 slight, 0.21–0.40 fair, 0.41–0.60 moderate, 0.61–0.80 substantial, and over 0.80 excellent agreement77.

To provide a balanced performance measure that accommodates variations in class sizes, the multiclass Matthews correlation coefficient (MCC) was computed using Eq. (4); MCC scores range from 1 (indicating total disagreement), through 0 (random prediction), to +1 (perfect prediction).

$${{{MCC}}}_{multiclass}=\frac{{\sum }_{k}{\sum }_{l}{\sum }_{m}{V}_{klm}-{\sum }_{i}{\sum }_{j}({X}_{ij}^{2})}{\root{2}\of{{\chi }^{2}+{\sum }_{i}({X}_{i+}^{2})\times {\sum }_{i}({X}_{j+}^{2})}}$$
(4)

where

$${V}_{klm}={C}_{ik}{C}_{il}{C}_{+j}{C}_{jm}\,for\,k\ne 1\,{\rm{a}}{\rm{n}}{\rm{d}}1\ne m$$
(5)

and

$${Xij}={C}_{{ij}}{C}_{+j}-{C}_{j+}{C}_{i+}$$
(6)

with

  • C: confusion matrix.

  • Cii: the diagonal elements (true positives for each class).

  • n: the number of classes.

  • N: total number of epochs (sum of all elements in the confusion matrix).

  • Ci+: sum of elements in row i (true instances of class i).

  • C+i: : sum of elements in column i (predicted instances of class i).

  • k, l, m, i, j are classes.

  • Cij: number of observations known to be in group i but predicted to be in group j.

  • χ2: is the chi-squared statistic relative to the confusion matrix.

For sleep-stage-specific evaluation, the following metrics were computed: ACC (Eq. (7)), κ (Eq. (8)), sensitivity (SE) (Eq. (11)), specificity (SP) (Eq. (12)), positive predictive value (PPV) (Eq. (13)), negative predictive value (NPV) (Eq. (14)), F1 score (Eq. (15)), and MCC (Eq. (16)).

$${\rm{ACC}}=\frac{{\rm{TN}}+{\rm{TP}}}{{\rm{TN}}+{\rm{TP}}+{\rm{FN}}+{\rm{FP}}}$$
(7)
$${\kappa }=\frac{{p}_{j}-{p}_{e}}{1-{p}_{e}}$$
(8)

where

$${p}_{j}=\frac{TN+TP}{N}$$
(9)

and

$${p}_{e}=\frac{({\rm{t}}{\rm{r}}{\rm{u}}{\rm{e}}\,{\rm{Y}}{\rm{E}}{\rm{S}}\times {\rm{p}}{\rm{r}}{\rm{e}}{\rm{d}}{\rm{i}}{\rm{c}}{\rm{t}}{\rm{e}}{\rm{d}}\,{\rm{Y}}{\rm{E}}{\rm{S}})+({\rm{t}}{\rm{r}}{\rm{u}}{\rm{e}}\,{\rm{N}}{\rm{O}}\times {\rm{p}}{\rm{r}}{\rm{e}}{\rm{d}}{\rm{i}}{\rm{c}}{\rm{t}}{\rm{e}}{\rm{d}}\,{\rm{N}}{\rm{O}})}{{N}^{2}}$$
(10)

with

  • true YES: number of times the ground truth identifies a particular class (row total in the confusion matrix).

  • predicted YES: number of times the device identifies a particular class (column total in confusion matrix).

  • N: total number of epochs (sum of all elements in the confusion matrix)

    $${\rm{SE}}=\frac{{\rm{TP}}}{{\rm{TP}}+{\rm{FN}}}$$
    (11)
    $${\rm{SP}}=\frac{{\rm{TN}}}{{\rm{TN}}+{\rm{FP}}}$$
    (12)
    $${\rm{PPV}}=\frac{{\rm{TP}}}{{\rm{TP}}+{\rm{FP}}}$$
    (13)
    $${\rm{NPV}}=\frac{{\rm{TN}}}{{\rm{TN}}+{\rm{FN}}}$$
    (14)
    $${\rm{F}}1=\frac{2\times {\rm{SE}}\times {\rm{PPV}}}{{\rm{SE}}+{\rm{PPV}}}$$
    (15)
    $${\rm{MCC}}=\frac{{\rm{TP}}\times {\rm{TN}}-{\rm{FP}}\times {\rm{FN}}}{\sqrt{({\rm{TP}}+{\rm{FP}})({\rm{TP}}+{\rm{FN}})({\rm{TN}}+{\rm{FP}})({\rm{TN}}+{\rm{FN}})}}$$
    (16)

Studies were categorized based on their use of either a five-stage classification system (i.e., N1, N2, N3, REM, Wake) or a four-stage system (e.g., Wake, Light, Deep, REM). In multiclass classifications, the above evaluation metrics are often averaged to reflect overall device performance. Micro-averaged metrics, which give equal weight to each epoch, were calculated by aggregating the counts of TPs, FPs, TNs, and FNs (Eqs. (17)–(22)).

$${\rm{A}}{\rm{C}}{\rm{C}}=\frac{{\sum }_{i=1}^{n}{C}_{ii}}{N}$$
(17)

where

  • C: confusion matrix.

  • Cii: the diagonal elements (true positives for each class).

  • n: the number of classes.

  • N: total number of epochs (sum of all elements in the confusion matrix).

    $${SE}_{micro}=\frac{\sum {\rm{T}}{\rm{P}}}{\sum {\rm{T}}{\rm{P}}+\sum {\rm{F}}{\rm{N}}}$$
    (18)
    $${SP}_{micro}=\frac{\sum {\rm{T}}{\rm{N}}}{\sum {\rm{T}}{\rm{N}}+\sum {\rm{F}}{\rm{P}}}$$
    (19)
    $${PPV}_{micro}=\frac{\sum {\rm{T}}{\rm{P}}}{\sum {\rm{T}}{\rm{N}}+\sum {\rm{F}}{\rm{P}}}$$
    (20)
    $${NPV}_{micro}=\frac{\sum {\rm{T}}{\rm{N}}}{\sum {\rm{T}}{\rm{N}}+\sum {\rm{F}}{\rm{N}}}$$
    (21)
    $${{F}{1}}_{micro}=\frac{2\times {{\rm{S}}{\rm{E}}}_{{\rm{m}}{\rm{i}}{\rm{c}}{\rm{r}}{\rm{o}}}\times {{\rm{P}}{\rm{P}}{\rm{V}}}_{{\rm{m}}{\rm{i}}{\rm{c}}{\rm{r}}{\rm{o}}}}{{{\rm{S}}{\rm{E}}}_{{\rm{m}}{\rm{i}}{\rm{c}}{\rm{r}}{\rm{o}}}+{{\rm{P}}{\rm{P}}{\rm{V}}}_{{\rm{m}}{\rm{i}}{\rm{c}}{\rm{r}}{\rm{o}}}}$$
    (22)

Macro-averaged metrics, which assign equal importance to each class, were calculated by averaging the metrics computed for each stage (Eqs. (29)–(29)).

$${{A}{C}{C}}_{macro}=\frac{{ACC}_{1}+{ACC}_{2}+\ldots +{{A}{C}{C}}_{n}}{n}$$
(23)
$${\kappa }_{macro}=\frac{{\kappa }_{1}+{\kappa }_{2}+\ldots +{\kappa }_{n}}{n}$$
(24)
$${{S}{E}}_{macro}=\frac{{{\rm{S}}{\rm{E}}}_{1}+{{\rm{S}}{\rm{E}}}_{2}+\ldots +{{\rm{S}}{\rm{E}}}_{n}}{n}$$
(25)
$${{S}{P}}_{macro}=\frac{{{\rm{S}}{\rm{P}}}_{1}+{{\rm{S}}{\rm{P}}}_{2}+\ldots +{{\rm{S}}{\rm{P}}}_{n}}{n}$$
(26)
$${{N}{P}{V}}_{macro}=\frac{{{\rm{N}}{\rm{P}}{\rm{V}}}_{1}+{{\rm{N}}{\rm{P}}{\rm{V}}}_{2}+\ldots +{{\rm{N}}{\rm{P}}{\rm{V}}}_{n}}{n}$$
(27)
$${{F}{1}}_{macro}=\frac{{{\rm{F}}1}_{1}+{{\rm{F}}1}_{2}+\ldots +{{\rm{F}}1}_{n}}{n}$$
(28)
$${{M}{C}{C}}_{macro}=\frac{{{\rm{M}}{\rm{C}}{\rm{C}}}_{1}+{{\rm{M}}{\rm{C}}{\rm{C}}}_{2}+\ldots +{{\rm{M}}{\rm{C}}{\rm{C}}}_{n}}{n}$$
(29)

For each sleep stage and overall performance, the evaluation metrics were used to compute the mean and standard deviation across studies. When confusion matrices were available, the metrics were computed directly from these; otherwise, reported metrics from the original publications were used. The results of studies classifying five stages and four stages were analyzed separately, as each device’s scoring model was developed and validated for a specific staging structure; reclassification would not reflect intended device behavior or validation protocols.

(2) Significant differences between MCC and all other metrics for both, overall and sleep-stage-specific metrics were analyzed using the Mann–Whitney U-test, which is suitable for small sample sizes and non-normally distributed data.

(3) Bland–Altman plots were analyzed for differences in stage durations and proportions between device and ground truth scorings.

(4) Studies classifying five sleep stages were stratified by device characteristics like electrode placement (e.g. forehead or ear), electrode type, whether wEEG was a prototype or commercially available, participants’ health status, study environment, and device sleep staging method. The Mann-Whitney U-test was employed to highlight significant differences in evaluation metrics between the groups.

(5) Relationships between evaluation metrics and the number of epochs and participants were explored using the Pearson correlation in studies classifying five sleep stages.

(6) To evaluate the robustness of pooled performance estimates, a leave-one-out (LOO) sensitivity analysis was conducted on all studies using five-stage sleep classification. For each evaluation metric and sleep stage, pooled values were recalculated iteratively by omitting one study at a time, and the maximum absolute (|Δ|) and relative (Δ %) changes were recorded. Because all performance metrics are scaled 0–1, we considered an absolute shift of |Δ| > 0.03 (i.e., ≥3 percentage points) to indicate a potentially influential study. This analysis was also repeated within each subgroup defined by participant health status, study environment, electrode placement, electrode type, device type (prototype vs. commercial), and scoring method, to assess the stability of results within stratified comparisons.

All data analyses were conducted in Python.