Abstract
Diagnosing sleep disordered breathing requires manual annotation of events from sleep studies, such as nocturnal polysomnography, a process that is time-intensive, costly, and prone to inter-rater variability. Automatic approaches exist but lack generalizability due to signal variability across centers. We develop an automatic apneic breathing event detector to localize and classify obstructive apneas, central apneas, hypopneas, and isolated respiratory events without arousals or desaturations. The model is trained on 5456 polysomnographies and tested on 1099 polysomnographies from six cohorts uses an end-to-end deep learning architecture. The model’s predictions show a strong correlation with expert annotations for apnea-hypopnea index (r² = 0.84) and achieve an F1 score of 0.78 across apnea event types, with specific F1 scores of 0.71, 0.51, and 0.65 for obstructive apnea, central apnea, and hypopnea events, respectively. In two independent, multi-scored datasets, The model performs comparably or better than individual expert raters. The model’s probabilistic output, termed “apnotyping,” provides insights into sleep disordered breathing etiology, with event probabilities correlating more strongly with key sleep apnea traits—such as loop gain and pharyngeal muscle compensation—than traditional apnea indexes. This probabilistic approach may enhance diagnostic accuracy and support personalized treatment strategies, leading to improved patient outcomes.
Similar content being viewed by others
Introduction
Sleep-related respiratory diseases, or sleep disordered breathing (SDB), is an umbrella term that includes multiple conditions, of which central sleep apnea (CSA) and obstructive sleep apnea (OSA) are the most common1. The most used metric of CSA and OSA severity is the apnea-hypopnea index (AHI). In this context, sleep “apneas” are defined in adults as complete cessations of breathing lasting at least 10 s (≥ 2 breaths), typically measured using a respiratory cannula measuring inhalation and exhalation. Because respiratory effort is concurrently measured using chest and abdominal belts, apneas can be differentiated as centrally (CA) mediated (cessation of breathing effort) or obstructive (OA). The primary distinction between CSA and OSA lies in their underlying mechanisms: CSA results from reduced ventilation due to decreased central drive of breathing, while OSA stems from upper airway obstruction2. In obstructive events, partial airway collapse often occurs, leading to reduced breathing airflow with or without oxygen desaturation. These partial events are termed hypopneas (HYPO), which are difficult to differentiate as obstructive or centrally mediated.
The system for scoring SDB events is complex, subjective, and has been extensively revised. The American Academy of Sleep Medicine (AASM)3 has established clear definitions for two main types of apnea events, OA and CA. AASM defines OA as ≥ 90% drop in airflow for at least 10 s with breathing effort shown in the thoracoabdominal respiratory movements. CA are defined by a ≥ 90% drop in combined flow amplitudes for at least 10 s without respiratory effort, shown as a lack of thoracoabdominal excursions3. Single sleep apnea events can have both central and obstructive components and are known as “mixed apnea” events. In complex sleep apnea cases, central events typically emerge more clearly when treating obstructive-looking events with Positive Airway Pressure (PAP) therapy4.
Scoring inconsistencies and evolving HYPO definitions over the last 20 years contribute to discrepancies in OSA prevalence across studies5,6,7. HYPO has multiple definitions3,8 and can be defined as either a ≥ 30% (AASM2012) or ≥ 50% (prior to 2012) drop in combined amplitude of an airflow signal lasting at least 10 s and producing either a ≥ 3% oxygen desaturation or/and an arousal. Centers for Medicare and Medicaid Services (CMS) defines hypopneas as desaturation ≥4% regardless of the presence of arousals6, making the resulting hypopnea index (HI) correlate highly with the oxygen desaturation index (ODI). Scoring variation across sites and constant evolution of HYPO scoring definition has contributed to discrepancies in OSA prevalence discrepancies across studies and laboratories, and to inconsistencies in the epidemiological literature5,6,7. In a recent epidemiological study, 60% of men and 30% of women ≥ 40 years old met criteria for significant sleep apnea (AHI ≥ 15/h) using the most recently used HYPO criteria. Complicating the picture further, SDB events that do meet criteria for HYPO, OA or CA also exist. For example, breathing cessation can present without resulting arousals or desaturations, denoted isolated respiratory events without arousals or desaturations (IRE) in this work, mere instabilities in breathing, which can be observed but are essentially not tabulated in clinical practice9,10. However, they could be clinically meaningful as they are associated with respiratory and heart rate changes, which can be reduced with optimal Continuous PAP (CPAP) treatment11.
The complexity in scoring and high variability between and within scorers have led some investigators to question the AHI as a measure of SDB severity12. Alternative metrics, such as the hypoxic burden, have been proposed. The hypoxic burden, calculated by summing the area between the baseline oxygen saturation level and the oxygen saturation curve for drops of 3% or more associated with respiratory events, has been suggested as a potentially better measure of SDB severity that has been demonstrated to predict the incidence of cardiovascular disease and mortality13,14,15. While the hypoxic burden effectively quantifies oxygen desaturation, it does not capture SBD events that occur without significant drops in blood oxygen levels16,17. This limitation is clinically important, as these non-desaturating events can still cause sleep fragmentation and daytime symptoms. Individuals with these events, usually young and likely characterized by a low arousal threshold, may not be at high risk for chronic cardiovascular disease18. These subjects are likely to have their SDB events evolve into desaturating events with weight gain or aging16. As these events become more frequent, chronic sleepiness increases, potentially raising the arousal threshold and leading to longer events that do cause desaturation. Additionally, higher Body Mass Index (BMI) contributes to SDB severity not only by exacerbating upper airway obstruction but also by potentially inducing hypoventilation due to the increased effort required to breathe against a heavier chest wall, further contributing to reduced ventilation and oxygen desaturation19.
Considering these complex definitions, there is a clear need for revamping the definition of SDB events, on one hand, differentiating central versus obstructive events, on another distinguishing events with and without desaturations and/or arousals. Better definitions based on pathophysiology and outcome variables are also needed. Indeed, even just for obstructive events, multiple mechanisms are involved so that subjects with the same exact “passive’ anatomy (i.e., the anatomical configuration observed under conditions without significant muscle tone, such as during anesthesia) can have vastly different indices20. Central control of breathing plays a role, as central detection of reduced ventilation directly affects breathing frequency and respiratory effort, a concept called loop gain. Obstructive events lead to breathing instability, which in some subjects, such as those with high loop gain (subjects reacting strongly to changes in carbon dioxide), destabilizes breathing, resulting in more events. These subjects at the extreme would have obstructive, central and mixed events. A second mechanism involves progressive activation of the upper airway dilator muscle, which can contract and tone up to avoid collapse as negative pressure due to the obstruction and increased effort of breathing occurs. These subjects may be more able to move their tongue forward and tense up their airway to avoid collapse during a significant part of the night. The third mechanism, mentioned above, is the ability to wake up to interrupt the event before the desaturation occurs-these subjects are deemed to have a low arousal threshold (younger subjects with arousal-only events are the extreme in this category). Finally, events can be highly positional (worse supine, as the tongue falls backward) or REM sleep dependent (because REM sleep removes any muscle tone compensation). Although evaluation of these phenotypes in individual patients mandates complex experimentation, such as putting patients on therapeutic PAP and suddenly or progressively dropping pressure21, imperfect proxy measures of loop gain (Lgn), arousal threshold (ArT) and compensatory pharyngeal dilation muscle activation (Vcomp) have been developed based on summary sleep study statistics22,23,24,25,26. These pathophysiological mechanisms are not simply academic, as differential responses to specific therapies should result27,28.
Presumably due to the different definitions and heterogeneous application thereof, several studies show large inter- and intra-scorer variability in expert annotations of SDB events. A recent study has demonstrated that agreement on OA, CA, and HYPO events ranges from 0.52 to 0.778. This variability persists even when allowing for temporal flexibility in event identification (i.e., presence/absence of an event was deemed concordant when an event was present within a 30-s window). Several factors contribute to reliability. Subjectivity in determining baseline breathing amplitude is a primary issue, as it is crucial for estimating the 30% drop in flow required for hypopnea identification. The evaluation of hypoxemic episode amplitude also varies depending on the selection of baseline oxygen saturation values. Inconsistencies in arousal annotation, particularly during REM sleep, further complicate the scoring process. Expert fatigue during long, repetitive scoring sessions can lead to inconsistencies, as can “cultural” biases within different sleep laboratories. Changes in event definitions over time have also contributed to scoring inconsistencies. Finally, variations in signal filtering can alter the appearance of recordings, notably for nasal flow, potentially affecting how events are interpreted8,29.
As shown above, defining events, their nature and their pathophysiology remains challenging even in facility-based Type I PSGs where sleep, breathing flow, effort and other measures are measured. Because SDB is so common, but time consuming to score manually, studies have attempted to detect SDB automatically, using simpler setups such as electrocardiogram (ECG)30,31,32, electroencephalogram (EEG)33, or using full night polysomnography (PSG) with all respiratory channels32,34,35,36,37,38,39,40,41. Unfortunately, however, only a few of these studies have used more than a few hundred subjects32,36,37,38,39,40,41, and even fewer have attempted to predict event types (OA, CA and HYPO)35,37,41. Further, many studies have only trained their models on data generated at a single sleep center30,31,32,33,37,38,40,41, a problem known to decrease generalizability as shown by Olesen et al.42. Therefore, there remains an unmet need for a large-scale multi-cohort model that can detect and classify different types of SDB, while simultaneously providing data-driven insights into the pathophysiology of SDB, which remains complex.
In this study, we designed and validated an end-to-end deep learning model, the Apneic Breathing Event Detector (ABED), for detecting OA, CA, HYPO, and IRE events in full-night PSGs. We used multiple independent cohorts for training and testing. ABED detects the location of events precisely as an ideal sleep scoring expert would. Interestingly, comparing with established pathophysiological model measures, we also show that ABED provides insight into the cause of SDB in individual subjects. With this, we aim to increase focus on the definition of sleep apnea and create a robust, readily available detection system that could make sleep apnea diagnosis less costly and more reliable, facilitating the investigation of sleep apnea on a larger scale. We also show that the probability distribution of event type within each event, a term we coin “apnotyping”, contains additional pathophysiological information, showing that deep learning methods evaluating sleep apnea offer the prospect of not only improving reliability but could also better define SDB phenotyping, with potential differential correlations with outcomes of interest, such as cardiovascular disease, disturbed sleep and sleepiness43 and personalized therapy.
Results
ABED was trained on 5456 PSGs from four cohorts available through the National Sleep Research Resource (NSRR)44: the Multi-Ethnic Study of Atherosclerosis (MESA), the Osteoporotic Fractures in Men Study (MrOS), the Cleveland Family Study (CFS), and the Wisconsin Sleep Cohort (WSC). For validation, 546 PSGs were reserved, while 1,099 PSGs were allocated for final testing, including 249 from MESA, 500 from MrOS, 250 from WSC, and 100 from CFS. Additional evaluation was performed on 113 PSGs from DREEM and Alliance datasets, which were scored by multiple experts. In these cohorts, there are three HYPO scoring criteria: the recommended (used in MESA, MrOS, CFS, and DREEM), the alternative (used in Alliance), and the 4% desaturation criteria (similar to how Medicare defines hypopneas and used in WSC).
ABED’s inputs include Nasal Flow, Naso-oral Thermistor, Abdominal and Thoracic Plethysmography, Blood/Oxygen Saturation (SpO2), and Arousal and Wake Probabilities, as derived from our previously described multimodal arousal detector45. Pre-processing involved applying a 4th-order digital Butterworth band-pass filter (0.1–4 Hz) to breathing channels (Nasal Flow, Naso-oral Thermistor, Abdominal and Thoracic Plethysmography) before downsampling to 8 Hz. Other modalities (SpO2, Arousal and Wake Probabilities) were upsampled to 8 Hz, and standardization was applied per PSG, except for SpO2 which was scaled to range from 0 to 1 based on saturation levels of 60% to 100%, with SpO2 below 60% being set to 0, and wake and arousal probabilities that were already in the range 0 to 1. ABED utilizes a 240-s window with 50% overlap for detection. The architecture incorporates channel mixing, feature extraction using a flattened ResNet50, bi-directional LSTM, and classification and localization layers. During inference, non-maximum suppression and classification rules were applied to refine event predictions. An overview of the input is shown in Fig. 1a, and the architecture is shown in Fig. 1b. This approach allowed for a comprehensive detection and classification of obstructive apneas (OA), central apneas (CA), hypopneas (HYPO), and isolated respiratory events without arousals or desaturations (IRE), providing a robust tool for SDB analysis. To simplify, any annotated mixed apneas were labeled as OA due to there being an effort to breathe during the event.
a Flow chart of ABED using Nasal flow, Naso-oral thermistor, Abdominal and Thoracic belts, SpO2, EEG, EOG, and EMG, with the latter three modalities being passed through MAD 45 to compute arousal and wake probability. Then, a 240 s window of the 7 input modalities is shown for a subject from the MrOS cohort, including annotations and predictions. Annotations are lines starting and ending with the same color, while boxes are predictions from ABED. b ABED architecture: Modalities are mixed into a single feature channel, amplified, and fed through 16 bottleneck blocks with residual connections 79. Two Bi-LSTM layers process temporal features, followed by two 1 × 60 kernel layers to determine 146 object class probabilities and positions. These are then passed to non-maximum suppression to remove overlapping predictions, to only focus on the most certain prediction. c Sleep apnea diagnosis confusion matrix on single-scored cohorts combined (MESA, MROS, WSC, CFS). The values represent the correctly diagnosed subjects as a fraction of the total number of true diagnoses in that category. d Confusion matrices of event predictions on all single-scored cohorts combined (MESA, MROS, WSC, CFS). The values represent the ratio of predicted events as a fraction of the total number of true events. e Pearson correlation plots between true and predicted apnea-hypopnea index (AHI) values. The blue line indicates the linear relationship, and the light blue hue shows the 95% confidence intervals. For correlation per cohort, please refer to Supplementary Table 1. f Latent space made using PHATE 46 zoomed to the area of apneas. A total of 200 PSGs were used to make the latent space, 50 from each of MESA, MrOS, WSC, and CFS. g, h F1-score distributions for ABED and individual experts on the Alliance (g; n = 58) and DREEM (h; n = 55) datasets. Each n represents independent biological measurements without replicas (one F1-score per unique subject). Data are shown as box plots: center line = median; box = 25th–75th percentiles (IQR); whiskers = 1.5 × IQR; and minima and maxima are suppressed. A star (★) above a box indicates a significant difference between ABED and the respective expert (two-sided t-test with Bonferroni correction, family-wise α = 0.05). PSG, polysomnography; EEG, electroencephalography; EOG, electrooculography; EMG, electromyography; MAD, multimodal arousal detector; OA, obstructive apnea; CA, central apnea; HYPO, hypopnea; IRE, irregular breathing event; Ch. mix, channel mixing; Ch. amp, channel amplification; ReLU, rectified linear unit; Bi-LSTM, bidirectional long short-term memory; Batch norm, batch normalization; Conv, convolutional layer; None, mild, moderate, severe, categories defined by AHI < 5, 5–15, 15–30, > 30 events h−1; AHI, apnea–hypopnea index; ICC, intraclass correlation; ABED, Apneic Breathing Event Detector. Source data are provided as a Source Data file.
SDB event detection performance of ABED per cohort and event type
Table 1 reports on ABED classification metrics F1, precision (P.), and recall (R.) as shown for each single scored cohort and event type, along with the average intersection over union (IOU). Metrics are shown for the detection of apnea events regardless of apnea event class (by definition, IREs are not considered apnea events). In Supplementary Table 1 we show the average of similar metrics per PSG, and in Supplementary Fig. 1 the event detection F1 score is shown as a function of the log-transformed number of events, showing that ABED has a better detection performance in PSGs with many events. Figure 1 summarizes the characteristics of ABED. Figure 1a shows the flow of data from PSG into ABED, including EEG, EOG, and EMG through the multimodal arousal detector (MAD)45 (represented as arousal and wake probabilities), along with the respiratory modalities, nasal flow, naso-oral thermistor, abdominal and thoracic belt, and SpO2. ABED returns the location and type of the predicted event in the 240 s window, shown as transparent boxes, with expert annotations shown as vertical lines. In this example, ABED correctly finds the type and approximate location of four of six annotated events, with one OA overlapping with a nearby unpredicted CA, and one CA is misclassified as OA. Figure 1c presents a confusion matrix on ABED’s diagnostic accuracy for sleep apnea on the four severity groups (none (AHI < 5/h); mild (5–15/h); moderate (15–30/h); severe ( > 30/h)) across the four single-scored cohorts MESA, MrOS, WSC, and CFS. Additionally, ABED demonstrates 90.6% accuracy in distinguishing central sleep apnea across severity classes. Figure 1d presents a confusion matrix where each cell’s value represents the fraction of events relative to the true annotations within each event category. Figure 1e shows a regression model of predicted AHI versus the annotated AHI with a correlation of 92% and a coefficient of determination of 0.84. The slope of the regression curve is 0.90 with an intersection of 0.84. In Supplementary Table 2 similar results for each cohort in the test set are shown. In Fig. 1f, ABEDs latent space is shown using PHATE46 in 200 randomly selected test PSGs (50 each from MESA, MrOS, WSC, and CFS). Each 240-s window is plotted in a latent space representation, with windows containing events (colored) and those without events (black). In the latent space, IRE (red) is contiguous with HYPO (blue) and OA (green), while the rarer CA (orange) forms a distinct subpopulation, almost orthogonal to the contiguous IRE-HYPO-OA clusters. The HYPO (blue) appears to consist of two separate subpopulations on either side of CA, creating a nearly symmetrical latent space orthogonal to the CA phenotype. Figure 1g, h presents ABED performance in the unseen, multi-expert DREEM47 and Alliance8 datasets cohorts by class in comparison to 4-5 experts when scoring 113 PSGs. ABED demonstrates equivalent or better performance to scorers when ABED is used on these new, unseen datasets. In the plot, the F1 scores of ABEDs (blue) and experts (red) are compared to the experts themselves, apart from the expert being tested. We applied a two-tailed t-test to the F1 scores from experts and ABED predictions to find out if there was any significant difference in F1 scores. We found that ABED performed equal to or better than experts in all cases, with ABED performing better in 8 of 36 comparisons, denoted by a star on the expert F1 score median.
We investigated the impact of signal quality, as determined by metrics directly obtained from the NSRR, on the F1-score for detecting apnea events. Signal quality was classified as Poor based on the lowest category defined by the NSRR. Our findings indicate that only thoracic and EEG signals classified as Poor significantly worsened the F1-score. Conversely, factors such as an increase in age, AHI, and Asian race positively influenced the F1-score. All significant effects were observed with p-values less than 0.05 and can be seen in Supplementary Fig. 2.
Apnotyping, probability of each event within each majority class and its association with endotypes
ABED does not only scores and outlines the location of each event, but it also outputs a probability for each event to belong to a specific category; for example, if the highest probability is for OA, the event will be scored as OA, while the sum of the remaining event probabilities can be as high as 0.49. As such, ABED’s probability for each event also denotes how “certain” the algorithm is for this assignment, and with what other type of event it is closest to. Because CA, OA, HYPO and IRE are a continuum, the probability of one blend into that of another type of event, as illustrated in Fig. 2. Figure 2a illustrates apnotyping, the probability density of detected apneas per event, representing the phenotype of each apnea event. In Fig. 2a, we compare subjects exhibiting high fractions of different sleep-disordered breathing events: CA (subject 1 from the top), OA (subject 2 from the top), HYPO (subject 3 from the top), and IRE (subject 4 from the top). These subjects are arranged from high to low on the loop gain spectrum. In subjects 1 and 2, who have relatively high loop gain, we observe multiple OA events with a relatively large CA-probability. Subject 2 specifically shows an IRE event with high OA- and CA-probability and an OA event with high IRE-probability. Subjects 3 and 4, who have relatively low loop gain, both primarily exhibit uniprobable events. Only a few of the HYPO and IRE events in these low loop gain subjects show some none-probability.
a Apnotyping plot showing the seventh hour of four subjects from the MESA cohort, arranged with decreasing loop gain from top to bottom. b Bar plot of linear stepwise regression models using summary statistics and demographics (base) to predict loop gain (Lgn), arousal threshold (ArT), and pharyngeal muscle compensation (Vcomp). The models utilize summary statistics from true annotations, predicted annotations, and apnotyping from ABED. c Confusion matrix of apnotyping with the most certain events on the rows and their mean probabilities shown for each event in the columns. The diagonal represents ABED’s certainty that the events are predicted correctly. d Forest plots displaying the univariate effect of the chosen summary statistics in each stepwise model: base, true, predicted, and predicted + apnotyping for each of the predicted values: Lgn, Vcomp, and ArT. Coefficients (β) were estimated using univariate ordinary least squares (OLS) models on n = 249 independent subjects. Horizontal bars indicate 95% confidence intervals from the student-t distribution (β ± t₀.₉₇₅ × SE), reflecting statistical uncertainty. The dashed line at 0 denotes no effect. Data are shown as β ± SE (95% CI). LGN, loop gain; ART, arousal threshold; VCOMP, compensatory pharyngeal dilation muscle activation; BMI, body mass index; OA, obstructive apnea; CA, central apnea; HYPO, hypopnea; IRE, irregular breathing event; OAI, CAI, HI, IREI, respective event indices; CAI_pred, IREI_pred, predicted central apnea and irregular breathing event indices; P(X|Y), conditional probability of event X given event Y. Source data are provided as a Source Data file.
Sleep apnea pathophysiology involves 1) passive anatomy, how restricted is the airway in the absence of any effort, often defined as the critical closing pressure, the minimal negative pressure that needs to occur to collapse the airway 2) loop gain (Lgn), a measure of the respiratory system’s sensitivity to CO₂ fluctuations and its ability to stabilize breathing adjustments; 3) compensation, the ability of the upper-airway to resist collapse by dilating the airway through the stimulation of dilator muscles (Vcom); 4) arousal threshold (Art), how sensitive the sleeper is to a collapsing airway, resulting in an arousal. Each trait is defined by fluctuations in ventilation and ventilatory drive, estimated using a chemoreflex model with least-squares regression. Ventilatory drive is calculated by fitting a feedback control model to the ventilation signal during open-airway breaths. Similarly, upper airway dilator compensation is measured by comparing peak ventilation during muscle activation at the arousal threshold with normal ventilation. These metrics correlate with traditional methods like PAP pressure drops and were derived from the MESA cohort48. We therefore studied how apnotypes correlate with physiological subtypes of sleep apnea on the 249-subject subset from the MESA study, reserved for testing in this work. We applied stepwise linear models to predict loop gain, arousal threshold, and pharyngeal muscle compensation using four groups of information: (1) demographics, including BMI, age, and sex; (2) manually annotated indexes, such as the apnea-hypopnea index (AHI), obstructive apnea index (OAI), central apnea index (CAI), hypopnea index (HI), and the isolated respiratory events without arousals or desaturations index (IREI); (3) predicted indexes matching the same categories; and (4) apnotyping, which included predicted indexes and the ABED average probabilities of each event given each event for each subject.
Figure 2b presents the results of the stepwise linear regression models applied separately to four groups of information (demographics, annotations, predictions and apnotyping) for the three predicted variables (Lgn, Vcomp, and ArT), reporting their adjusted R2. All models included demographics, and the results indicate that apnotyping is the strongest feature set for predicting loop gain and pharyngeal muscle compensation, outperforming manual annotation. In contrast, manual annotations demonstrated stronger predictive power in the arousal threshold prediction task, with apnotyping being a close second. Figure 2c, illustrates the probabilistic overlap between different classes, showing that CA and OA share the highest probability overlap, followed by OA with HYPO, and then HYPO with IRE. These patterns reflect their physiological similarities and the most commonly observed pathophysiological overlaps. Figure 2d presents a forest plot displaying the univariate effects of each information group (demographics, annotations, predictions, and apnotyping) on predicting loop gain, arousal threshold, and pharyngeal muscle compensation. The multivariate effects are shown in Supplementary Table 4. Our analysis highlights specific physiological correlations. For example, high loop gain is most strongly predicted by a high isolated respiratory event index and central apnea index. Events classified as central apneas within IREs and IREs within hypopneas appear to be protective for loop gain. A high probability of IREs within hypopneas is associated with a lower arousal threshold, whereas a high probability of hypopneas, whether within hypopneas or obstructive apneas, correlates with a higher arousal threshold. Similarly, a high OA probability within IREs is linked to an elevated arousal threshold. Interestingly, pharyngeal muscle compensation shows a strong positive association with isolated respiratory events and a negative association with obstructive apnea index. A high probability of hypopneas within obstructive apneas suggests low pharyngeal muscle compensation.
ABED includes the detection of IREs, events that share features with HYPOs but are not associated with arousals or desaturations, which are related to changes in respiratory and heart rate changes11. This class had the weakest performance, with 22% of IREs being missed completely. We included these events for two main reasons: first, introducing IREs was expected to improve SA diagnosis performance by providing a class between regular breathing and HYPOs, potentially reducing over-detection of HYPOs and improving AHI estimation. Second, although IREs are seldom studied, they clearly occur in cohorts where arousal and breathing channels were scored independently, suggesting they are genuine SDB events deserving of study. The inclusion of IRE scoring in ABED decreased the over-detection of HYPOs.
The OAs are most frequently misclassified as HYPOs, and HYPOs as IREs, reflecting their shared mechanisms of flow reduction and progression as variations of the same obstructive phenomenon. In contrast, few HYPOs are misclassified as CAs, aligning with the relative rarity of central events. At the macro level, ABED’s predicted AHI correlates highly with expert-annotated AHI (R² = 0.84), with an almost perfect regression line (β = 0.90) and limited bias (0.84 AHI). Furthermore, ABED accurately determines severity in 75.3% of cases, where very few cases are more than one severity group away. This indicates that ABED can reliably diagnose sleep apnea with a fast inference time of 14 s per average PSG, outperforming human experts in speed. The use of ABED could ultimately increase diagnosis rates and help alleviate the economic burden caused by undiagnosed sleep apnea49.
ABED may also facilitate large-scale global studies, such as genetic or epidemiological research, by providing more consistent apnea event scoring than human experts. Despite the gold standard guidelines from the AASM, the actual annotation process varies based on individual clinics and experts’ interpretations. As shown in this and other studies8,29, inter-clinic and inter-expert annotation bias and variance are high, leading to uncertainty in diagnoses across different clinics. ABED’s objective detections offer a reliable and consistent diagnosis, which could enhance global investigations into the consequences and causes of sleep apnea. Our signal quality analysis suggests that the model is resilient to variations in signal quality. ABED demonstrates strong capabilities in identifying signal quality, allowing for accurate annotations even in noisy conditions. However, its robustness to low signal quality is limited for EEG and thoracic modalities. Based on the visualization of the latent space of 240-s windows, two notable findings emerge: (1) ABED shows clear latent structures for windows containing each event type, and (2) in the latent space, regular breathing transitions to IREs, then HYPOs, which merge into OAs and CAs. This indicates that ABED understands breathing irregularities as a severity continuum of similar phenomena.
When comparing ABED to multiple experts and previous multi-class and single-class apnea detectors32,36,37,38,39,40,41, ABED consistently outperforms them on several key metrics. Specifically, ABED achieves an F1 score of 0.78 for event detection (apnea vs. non-apnea), surpassing the previous high F1 score of 0.70 by Olesen et al.38. Furthermore, ABED demonstrates a significant improvement in the mean F1 scores for obstructive, central, and hypopnea events, achieving a mean F1 score of 0.62, compared to the previous highest reported score of 0.43 by Nassi et al41. While ABED excels in event detection, its performance is not uniformly superior across all areas. It achieved a slightly inferior R2 for AHI correlation (0.844), with Nassi et al.41 reporting the highest R2 of 0.880. ABED ranked first in diagnosing the presence of sleep apnea (AHI > 15) but placed second in determining apnea severity, indicating a slight gap in its ability to differentiate between various levels of severity compared to other models. An overview of this comparison is shown in Supplementary Table 3 in the Supplementary Information.
Although ABED performs well in estimating AHI (R² = 0.84, slope = 0.90, ICC = 0.92), we acknowledge that recent single-channel approaches, using only SpO₂, have demonstrated similarly high diagnostic accuracy, with ICC values ranging from 0.92 to 0.94 on the MESA, MrOS, and CFS cohorts50. However, ABED is designed not merely to estimate AHI but to characterize the temporal structure and subtype of each respiratory disturbance. By providing interpretable, event-level annotations, ABED enables detailed phenotyping and supports endotyping of sleep apnea beyond what AHI-based systems can offer.
As shown in Supplementary Table 2, many previous studies do not report all performance metrics, often presenting only a fragmented picture of their detection and diagnosis capabilities. Those that do offer comprehensive results tend to perform well in specific areas but fall short across all metrics. In contrast, ABED demonstrates a more balanced and robust performance across multiple evaluation dimensions, making it a more generalizable apnea detector than previous models. Notably, ABED achieved these results despite being trained on multiple cohorts rather than one. While training on a single cohort often improves performance on that specific dataset, it can hinder the model’s ability to generalize to unseen data. This challenge of generalization is well-documented in the literature42, where Olesen et al. advocate for using more diverse training data to enhance model robustness across cohorts. Finally, ABED was tested on two unseen datasets (Alliance and DREEM), where it performed as well as or better than expert annotations, further supporting the argument for training on diverse data to improve generalizability. Even when confronted with hypopnea events based on definitions (in the Alliance dataset), it was not specifically trained on, ABED outperformed expert scorers.
The probabilistic view of apnea scoring provided by ABED informs apnea endotyping and has multiple implications for apnea research, diagnosis, and phenotyping. Firstly, ABED’s automated predictions outperformed manual annotations in modeling loop gain and pharyngeal muscle compensation and closely matched manual annotations in modeling the arousal threshold. This suggests that automated systems can capture subtle patterns that human categorical scoring might miss, potentially leading to more accurate endotyping of sleep apnea. Secondly, incorporating ABED’s prediction probabilities, apnotypes, alongside derived indexes improved model performance further, particularly for Lgn and Vcomp. This indicates that including uncertainty in event detection can enhance the accuracy of sleep apnea endotyping. The finding that manual annotations did not provide significant information in predicting Vcomp highlights potential limitations in current manual scoring methods for capturing certain aspects of sleep apnea physiology. Eventually, apnotyping could lead to more accurate and efficient clinical assessments of sleep apnea endotypes, potentially improving treatment planning and outcomes. This paves the way for apnotyping directly predicting treatment response in obstructive sleep apnea, similar to previous studies investigating the treatment effect of oral appliance therapy in both laboratory and home sleep settings51,52.
Higher central apnea index (CAI) was associated with a high loop gain, which is expected for both annotated and predicted CAI because central apneas are closely linked to instabilities in the body’s ventilatory control system. A high loop gain indicates that the system overreacts to small changes in blood gas levels, leading to oscillations in breathing and an increased number of CAs. HYPO events with high IRE probability have an inverse relationship with loop gain. When the IRE probability is high—indicating events like HYPOs but without causing arousal or desaturation—it may correlate with lower arousal than regular HYPOs, thereby avoiding respiratory oscillations during sleep. Interestingly, compensatory upper airway dilation is the least accurately predicted among the endotypes. However, a high number of IREs per hour indicates strong pharyngeal muscle compensation. Clinically, effective pharyngeal muscle compensation during SDB reduces the severity of respiratory events, preventing them from escalating to apneas. Conversely, a high probability of experiencing a HYPO following an OA suggests low pharyngeal muscle compensation. Inadequate compensation leads to more frequent arousals and oxygen desaturations, which are key features of HYPO events. Our results also align with existing literature regarding sex and age differences, which have shown that men generally have higher loop gain, while older subjects exhibit higher loop gain and arousal threshold values53.
The endotypic traits, loop gain and pharyngeal muscle compensation, have been shown to predict therapeutic responses to various therapeutic approaches54,55,56,57 and have high within- and across-night repeatability (intraclass correlations of ~0.8 for collapsibility and loop gain, ~0.7 for compensation)26,48. The observed relationships between ABED probability distributions and endotypic traits suggest that probabilistic event detection may offer valuable insights into the underlying physiological characteristics of sleep apnea. It is, however, crucial to note that the endotyping model we employed, while more accessible than original invasive techniques, relies on assumptions that have been questioned in the field. For instance, the model assumes a fixed effect of arousals, potentially oversimplifying the known variability in arousal intensity. Therefore, our findings—particularly the potential of ABED-derived probabilities to enhance endotype estimation—should be viewed as exploratory rather than definitive. Future work should focus on validating endotyping approaches across multiple nights, comparing them to gold-standard measurements, and investigating how variability in arousal intensity might impact endotype estimation.
The pathophysiology of SDB is complex, with events varying based on mechanisms such as abnormal central control of breathing (loop gain, Lgn), pharyngeal muscle compensation (Vcomp), or the rapidity of arousals interrupting events (arousal threshold, ArT). Algorithms have been established to derive these traits from PSGs, correlating with parameters measured using classic methods involving positive airway pressure drops22,23 and responses to oxygen and acetazolamide for loop gain. In our analysis, we controlled for variables such as age, sex, and body mass index (BMI). Prior studies have shown that women have lower loop gain, less airway collapsibility, and lower arousal thresholds during NREM sleep58, which could explain their lower AHI. Similarly, elevated collapsibility and greater loop gain have been observed with increased age and BMI48. The variables considered included both annotated and predicted values for AHI, OAI, CAI, ODI, and IREI, alongside the average certainty of ABED predictions. Additionally, we used ABED’s prediction probabilities, termed “apnotyping”, to investigate whether probabilistic detection of event subtypes within each majority class could benefit sleep apnea endotyping.
ABED has many limitations. It does not apply to children, as these have different definitions for scoring SDB events. Full generalization of performance would likely require training on additional datasets. Further, it has its lowest diagnostic accuracy in the none-to-mild apnea range, which was expected as AHI is the least stable in that range59. This is likely also due to the nature of the MESA, MrOS, CFS, and WSC cohorts, which mostly include older subjects, almost all with some SDB; incorporating a low-AHI dataset during training could increase diagnostic accuracy in the none-to-mild sleep apnea range. Future investigations should involve testing ABED on a large dataset from an unseen cohort with low AHI ranges and extending studies to children. Further, adding annotations of respiratory effort-related arousals (RERAs), another type of SDB event characterized by ≥10 s of increasing upper airway effort terminating in arousal without meeting hypopnea criteria60,61, could further enhance the detector for more comprehensive event detection; correlating scored RERAs with low probability HYPO events leading to arousals would be an interesting direction. Finally, to deepen the investigation into SDB endotypes, it may be beneficial to explore respiratory signals, ECG, and EEG using a self-supervised learning setting62. This approach would allow the model to learn from any pattern rather than focusing on specific, predetermined patterns (i.e., OA, CA, HYPO, IRE), potentially leading to new discoveries within the field of SDB. ABED is limited by having mixed apneas labeled as OA, which reduces the granularity of the detection. However, this is necessary due to the substantial class imbalance between mixed apneas and any other event type. Additionally, ABEDs probabilistic apnea detection should be used to model SDB-related outcomes, including daytime sleepiness, cardiovascular disease, CPAP compliance and effect, and stroke directly. Exploring these phenotypes in correlation with comorbid insomnia or other clinical presentations of SDB described by others63 could also be of great interest. Despite ABEDs promise via apnotyping, it remains to be shown whether ABED is as effective as endotyping in clinical practice. Future research should ascertain endotyping’s role in predicting patients at risk of adverse health outcomes related to OSA. Furthermore, we have not demonstrated ABEDs ability to identify other informative metrics such as hypoxic burden64, a strong predictor of OSA consequences. Finally, to gain a comprehensive understanding of a PSG, it is necessary to use ABED in conjunction with other automatic sleep scoring algorithms, including sleep staging, cortical and autonomic arousal detection, and leg movement detection45,65,66,67.
ABED is limited by resampling all input signals to 8 Hz to balance computational efficiency with sufficient temporal resolution. These capture breathing dynamics (0.2–0.33 Hz, 20–12 breaths per minute) relevant to sleep apnea while reducing processing demands. This limits ABED as it is below the AASM-recommended ≥25 Hz for respiratory signals. Prior studies show that lower rates of down to 4 Hz retain key respiratory features68,69. We acknowledge this as a limitation, particularly for applications requiring fine-grained waveform detail.
In conclusion, we developed and tested an automatic sleep disordered breathing event detector (ABED) that allows for the detection of a new event type: isolated respiratory events without arousals or desaturations (IRE). We showed that ABED performs well in the cohorts it was trained on and that it is better than or equal to 9 experts in two left-out datasets. Moreover, ABED provides a probability for each event type associated with each event it scores. Endotyping apnea with three main traits, i.e., loop gain, arousal threshold, and pharyngeal compensation, we find that automatically scored events, notably IRE, and their probabilities help endotype sleep apnea compared to manually annotated apnea indexes. Although imperfect in its current iteration, this work offers new directions on how to better characterize the full spectrum of SDB events, a much-needed direction in the field61,70,71.
Methods
Data description
All data used in this study were collected under research protocols approved by the local institutional review boards (IRB) at each participating institution, with written informed consent obtained from individuals prior to participation. Specifically, MESA was approved by the IRBs of its six field centers (Columbia University, Johns Hopkins Medicine, Northwestern University, UCLA Office of the Human Research Protection Program, University of Minnesota Human Research Protection Program, and Wake Forest Baptist Health). CFS was approved by the IRBs of Case Western Reserve University and Mass General Brigham. MrOS received approval from the IRBs of the Kuakini Medical Center and the University of Hawai’i at Mānoa. WSC was approved by the University of Wisconsin–Madison Health Sciences IRB. DREEM was approved by the Stanford University IRB. Alliance received approval from IRBs at the University of Pennsylvania, Harvard University, the University of Wisconsin–Madison, St. Luke’s Hospital, and Stanford University.
To train and evaluate ABED, we used 6555 full-night PSGs from MESA (Multi-Ethnic Study of Atherosclerosis)72, MrOS (The Osteoporotic Fractures in Men Study)73,WSC (Wisconsin Sleep Cohort)74, and CFS (Cleveland Family Study)75. These cohorts are available at the NSRR (National Sleep Research Resource)44. We also included 113 PSGs scored based on five and four experts’ annotations, datasets that we refer to as DREEM47 and Alliance8, respectively, for evaluation only. All cohorts contain OA, CA, and HYPO annotations. For MESA and MrOS, we leveraged partial breathing cessations that were annotated independently of desaturations and arousals. When these cessations coincided with desaturations or arousals, we created HYPO annotations. The remaining unlinked breathing cessations were used to create an additional class of breathing event denoted as isolated respiratory events without arousals or desaturations (IRE), which were used during training. For WSC and CFS, we used the original annotations without modification. An overview of demographics and sleep apnea statistics are shown in Table 2. Seven modalities were used as input: Nasal Flow (NF), Naso-oral Thermistor (OT), Abdominal Plethysmography (AbP), Thoracic Plethysmography (ThP), Blood/Oxygen Saturation (SpO2), Arousal Probability (AP), and Wake Probability (WP). NF, OT, AbP, ThP, and OS were directly obtained from the PSG while AP and WP were obtained using our previously described multimodal arousal detector (MAD), which predicts arousals based on central EEG, right and left EOG, and EMG45. All modalities used in this detector are used in the definition of one or more of the events (OA, CA, HYPO, and IRE) and are therefore helpful and important for SDB event subtype detection.
NF and OT measure a decrease in airflow, while AbP and ThP indicate if there is a concomitant effort to breathe. OS shows blood oxygen desaturations that occur after SDB event onset (typically with a 30 s delay). AP shows if there is an arousal following the decreased airflow and is typically associated with a resumption of flow. Finally, as SDB events such as OA, CA, and HYPO can only occur during sleep, WP is important to distinguish irregular breathing during wakefulness from SDB.
Pre-processing and detection pipeline
Signal preprocessing included bandpass filtering (0.1–4 Hz) and standardization of breathing channels, with all signals resampled to 8 Hz68,69. ABED processes 240-s windows with 50% overlap, using a convolutional neural network to detect and classify SDB events as seen in Fig. 1a. Following DOSED76, non-maximum suppression was used to retrain only the most certain event detection77. The network architecture combines channel mixing layers, feature extraction through a modified ResNet5078 with batch normalization79 and ReLU80 activations, and bi-directional LSTM for temporal processing, both with dropout81 to regularize. The specific architecture was chosen to capture both local morphological patterns and long-range dependencies in respiratory signals. The modified ResNet50 efficiently extracts local features—such as respiratory amplitude and frequency—while the bi-directional LSTM models delayed relationships across time—such as those between flow limitation, desaturation, and arousal—crucial for distinguishing event types like HYPOs and IREs. More details can be found in Appendix A3 sections Pre-Processing and Detection pipeline.
Training
ABED was trained on 5456 PSGs across four cohorts, with 546 PSGs for validation and 1099 for testing. Training used balanced sampling across event classes and optimized both event classification and localization. The model employs classification and localization loss functions rooted in the cross entropy and Huber82 loss, with Adam83 optimization and carefully tuned detection thresholds based on F1 score for each event type. More details are present in Appendix A3 sections Training, Network optimization, and Detection threshold.
Evaluation
Model performance was assessed through precision, recall, F1-score, and IOU metrics. We evaluated ABED’s performance per event, per PSG, and for AHI-based diagnosis classification. Sleep apnea severity was categorized as: No SA (AHI < 5), Mild (5 ≤ AHI < 15), Moderate (15 ≤ AHI < 30), and Severe (AHI ≥ 30). To see the full details on the evaluations, see Appendix A3, section Evaluation.
Endotype-apnotyping analysis
We performed apnotyping using ABED, an analysis where we examine the probability of each event type within each predicted event. To evaluate the capabilities of this approach, we applied apnotyping to a stepwise linear regression model. We compared how ABED’s apnotyping performed relative to baseline demographics and manual annotations when used to predict sleep apnea endotypes such as loop gain (Lgn), arousal threshold (ArT), and pharyngeal muscle compensation (Vcomp).
Each endotypic trait is defined by spontaneous fluctuations in ventilation (mean-normalized nasal pressure) and ventilatory drive (intended ventilation estimated using a chemoreflex model via least-squares regression). For example, ventilatory drive is calculated by inputting the ventilation signal—breath-to-breath tidal volume multiplied by respiratory rate, mean-normalized—into a chemoreflex feedback control model characterized by parameters such as gain, response time, and delay. This model outputs a ventilatory drive signal, and parameters are adjusted to best fit the drive signal to the ventilation signal during breaths when the airway is open (i.e., ventilation between scored obstructive events) using least squares regression.
Similarly, upper airway dilator compensation is defined as the difference between ventilation at its peak muscle activation (at the arousal threshold) and ventilation at normal (eupneic) ventilatory drive. These measures have been found to correlate with more traditional methods that use positive airway pressure (PAP) pressure drops. Notably, these metrics have been derived from the MESA cohort48, which is one of the cohorts included in our study, and 249 were available from the test-set.
In our endotype-apnotyping analysis, we employed a stepwise linear regression approach. This method iteratively adds variables with the lowest p-value greater than 0.01 and removes variables with the highest p-value exceeding 0.05 from the model, continuing this process until no further variables can be added or removed. The effectiveness of the stepwise linear regression models was assessed using the adjusted R-squared (R²) coefficient. This measure only increases if an added variable significantly enhances the model’s explanatory power beyond what would be expected by chance.
We undertook feature engineering to construct variables used within our model, creating measures based on the average probability of each event type for each subject. For the loop gain test, detected events were chosen based on the highest probability rather than the heuristic described earlier. Specifically, we engineered a feature for each event that encapsulates the probabilities of each event type occurring within each event. We define these features, called apnotypes, as follows:
This represents the average probability of event type i given event type j.
Finally, we present the linear correlations of our apnotypes with each of the endotyping measures—loop gain (Lgn), arousal threshold (ArT), and pharyngeal muscle compensation (Vcomp)—while controlling for the effects of age, body mass index (BMI), and sex.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Data availability
The polysomnography data from the DREEM dataset are publicly available via Zenodo (https://zenodo.org/records/15900394). The remaining polysomnography data in this study are available under restricted access due to ethical and legal constraints. Data from the Multi-Ethnic Study of Atherosclerosis (MESA), the Osteoporotic Fractures in Men Study (MrOS), the Cleveland Family Study (CFS), Wisconsin Sleep Cohort (WSC), and ALLIANCE can be obtained through the National Sleep Research Resource (NSRR; https://sleepdata.org) following data use agreement approval. The data will be made available within a month through the NSRR and is then available for the expected duration of the study. Source data are provided with this paper.
Code availability
All custom source code, along with software packages and versions used for this project, is available at https://github.com/RuudeResearch/ABED.
References
American Academy of Sleep Medicine. International classification of sleep disorders: Diagnostic and coding manual, third edition. American Academy of Sleep Medicine https://learn.aasm.org/Public/Catalog/Details.aspx?id=%2FgqQVDMQIT%2FEDy86PWgqgQ%3D%3D&returnurl=%2FUsers%2FUserOnlineCourse.aspx%3FLearningActivityID%3D%252fgqQVDMQIT%252fEDy86PWgqgQ%253d%253d (2014).
L‚vy, P. et al. Obstructive sleep apnoea syndrome. Nat. Rev. Dis. Prim. 1, 15015 (2015).
Berry, R. B. et al. AASM | Scoring Manual Version 2.2 The AASM Manual for the Scoring of Sleep and Associated Events Rules, Terminology And Technical Specifications Version 2.2. www.aasmnet.org (2015).
Stanchina, M. et al. Clinical use of loop gain measures to determine continuous positive airway pressure efficacy in patients with complex sleep apnea. a pilot study. Ann. Am. Thorac. Soc. 12, 1351–1357 (2015).
Hirotsu, C. et al. Effect of three hypopnea scoring criteria on OSA prevalence and associated comorbidities in the general population. J. Clin. Sleep. Med 15, 183–194 (2019).
Berry, R. B. et al. A transition to the American Academy of Sleep Medicine-recommended hypopnea definition in adults: initiatives of the Hypopnea Scoring Rule Task Force. J. Clin. Sleep. Med 18, 1419–1425 (2022).
Benjafield, A. V. et al. Estimation of the global prevalence and burden of obstructive sleep apnoea: a literature-based analysis. Lancet Respir. Med. 7, 687 (2019).
Kuna, S. T. et al. Agreement in computer-assisted manual scoring of polysomnograms across sleep centers. Sleep 36, 583–589 (2013).
Blaha, M. J. & DeFilippis, A. P. Multi-Ethnic Study of Atherosclerosis (MESA): JACC focus seminar 5/8. J. Am. Coll. Cardiol. 77, 3195–3216 (2021).
Khan, A. et al. Obstructive sleep apnea during rapid eye movement sleep, daytime sleepiness, and quality of life in older men with osteoporotic fractures in men (MrOS) sleep study. J. Clin. Sleep Med. 9, 191–198 (2013).
Sankari, A. et al. Characteristics of non-apneic respiratory events-sankari et al. Characteristics and consequences of non-apneic respiratory events during sleep. Sleep 40, zsw024 (2017).
Malhotra, A. et al. Metrics of sleep apnea severity: beyond the apnea-hypopnea index. Sleep 44, zsab030 (2021).
Pinilla, L. et al. Hypoxic burden to guide CPAP treatment allocation in patients with obstructive sleep apnoea: a post hoc study of the ISAACC trial. Eur. Respir. J. 62, 2300828 (2023).
Labarca, G. et al. Sleep apnea physiological burdens and cardiovascular morbidity and mortality. Am. J. Respir. Crit. Care Med 208, 802–813 (2023).
Esmaeili, N. et al. Hypoxic burden based on automatically identified desaturations is associated with adverse health outcomes. Ann. Am. Thorac. Soc. 20, 1633–1641 (2023).
Koch, H. et al. Breathing disturbances without hypoxia are associated with objective sleepiness in sleep apnea. Sleep 40, zsx152 (2017).
Mann, D. L. et al. Flow limitation is associated with excessive daytime sleepiness in individuals without moderate or severe obstructive sleep apnea. Ann. Am. Thorac. Soc. 21, 1186–1193 (2024).
Azarbarzin, A. et al. Relevance of cortical arousals for risk stratification in sleep apnea: a 3-cohort analysis. J. Clin. Sleep. Med 19, 1475–1484 (2023).
Kjar, M. R. et al. Polysomnographic plethysmography excursions are reduced in obese elderly men. Annu Int Conf. IEEE Eng. Med Biol. Soc. 2021, 2396–2399 (2021).
Malhotra, A., Mesarwi, O., Pepin, J. L. & Owens, R. L. Endotypes and phenotypes in obstructive sleep apnea. Curr. Opin. Pulm. Med 26, 609 (2020).
Eckert, D. J., White, D. P., Jordan, A. S., Malhotra, A. & Wellman, A. Defining phenotypic causes of obstructive sleep apnea. Identification of novel therapeutic targets. Am. J. Respir. Crit. Care Med 188, 996–1004 (2013).
Azarbarzin, A. Estimation of pharyngeal collapsibility during sleep by peak inspiratory airflow. Sleep 40, zsw005 (2017).
Terrill, P. I. et al. Quantifying the ventilatory control contribution to sleep apnoea using polysomnography. Eur. Respir. J. 45, 408–418 (2015).
Anwar, A. I. et al. Novel physiologic predictors of positive airway pressure effectiveness (NICEPAP) study: rationale, design and methods. Sleep. Breath. 28, 2005–2015 (2024).
Dutta, R. et al. A novel model to estimate key obstructive sleep apnea endotypes from standard polysomnography and clinical data and their contribution to obstructive sleep apnea severity. Ann. Am. Thorac. Soc. 18, 656–667 (2021).
Alex, R. M. et al. Within-night repeatability and long-term consistency of sleep apnea endotypes: the Multi-Ethnic Study of Atherosclerosis and Osteoporotic Fractures in Men Study. Sleep 45, zsac129 (2022).
Chu, Y. & Zinchuk, A. The present and future of the clinical use of physiological traits for the treatment of patients with OSA: a narrative review. J. Clin. Med. 13, 1636 (2024).
Aishah, A. et al. Stepwise add-on and endotype-informed targeted combination therapy to treat obstructive sleep apnea: a proof-of-concept study. Ann. Am. Thorac. Soc. 20, 1316–1325 (2023).
Rosenberg, R. S. & Van Hout, S. The American Academy of Sleep Medicine inter-scorer reliability program: respiratory events. J. Clin. Sleep. Med. 10, 447–454 (2014).
Urtnasan, E., Park, J. U. & Lee, K. J. Automatic detection of sleep-disordered breathing events using recurrent neural networks from an electrocardiogram signal. Neural Comput. Appl. 32, 4733–4742 (2020).
Tripathy, R. K., Gajbhiye, P. & Acharya, U. R. Automated sleep apnea detection from cardio-pulmonary signal using bivariate fast and adaptive EMD coupled with cross-time-frequency analysis. Comput. Biol. Med. 120, 103769 (2020).
Van Steenkiste, T., Groenendaal, W., Deschrijver, D. I. & Dhaene, T. Automated sleep apnea detection in raw respiratory signals using long short-term memory neural networks. IEEE J. Biomed. Health Inf. 23, 2354–2364 (2019).
Zhao, X. et al. Classification of sleep apnea based on EEG sub-band signal characteristics. Sci. Rep. 2021 11, 1–11 (2021).
Mostafa, S. S., Mendonca, F., Ravelo-Garcia, A. G., Julia-Serda, G. & Morgado-Dias, F. Multi-objective hyperparameter optimization of a convolutional neural network for obstructive sleep apnea detection. IEEE Access 8, 129586–129599 (2020).
Yu, H. et al. A sleep apnea-hypopnea syndrome automatic detection and subtype classification method based on LSTM-CNN. Biomed. Signal Process Control 71, 103240 (2022).
Biswal, S. et al. Expert-level sleep scoring with deep neural networks. J. Am. Med. Inform. Assoc. 25, 1643–1650 (2018).
Haidar, R., McCloskey, S., Koprinska, I. & Jeffries, B. Convolutional neural networks on multiple respiratory channels to detect hypopnea and obstructive apnea events. Proceedings of the International Joint Conference on Neural Networks. 2018-July, (2018).
Zahid, A. N., Jennum, P., Mignot, E. & Sorensen, H. B. D. MSED: a multi-modal sleep event detection model for clinical sleep analysis. IEEE Trans. Biomed. Eng. 70, 2508–2518 (2023).
Yeo, M. et al. Respiratory event detection during sleep using electrocardiogram and respiratory-related signals: using polysomnogram and patch-type wearable device data. IEEE J. Biomed. Health Inf. 26, 550–560 (2022).
Lakhan, P., Ditthapron, A., Banluesombatkul, N. & Wilaiprasitporn, T. Deep neural networks with weighted averaged overnight airflow features for sleep apnea-hypopnea severity classification. IEEE Region 10 Annual International Conference, Proceedings/TENCON 2018-October, 441–445 (2018).
Nassi, T. E. et al. Automated scoring of respiratory events in sleep with a single effort belt and deep neural networks. IEEE Trans. Biomed. Eng. 69, 2094 (2022).
Olesen, A. N., Jorgen Jennum, P., Mignot, E. & Sorensen, H. B. D. Automatic sleep stage classification with deep residual networks in a mixed-cohort setting. Sleep 44, zsaa161 (2021).
Cohen, O. et al. Achieving better understanding of obstructive sleep apnea treatment effects on cardiovascular disease outcomes through machine learning approaches: a narrative review. J. Clin. Med. 13, 1415 (2024).
Dean, D. A. et al. Scaling up scientific discovery in sleep medicine: the national sleep research resource. Sleep 39, 1151–1164 (2016).
Brink-Kjaer, A. et al. Automatic detection of cortical arousals in sleep and their contribution to daytime sleepiness. Clin. Neurophysiol. 131, 1187–1203 (2020).
Moon, K. R. et al. PHATE: A dimensionality reduction method for visualizing trajectory structures in high-dimensional biological data. bioRxiv https://doi.org/10.1101/120378 (2017).
Guillot, A., Sauvet, F., During, E. H. & Thorey, V. Dreem open datasets: multi-scored sleep datasets to compare human and automated sleep staging. IEEE Trans. Neural Syst. Rehab. Eng. 28, 1955–1965 (2020).
Sands, S. A. et al. Pathophysiology underlying demographic and obesity determinants of sleep apnea severity. Ann. Am. Thorac. Soc. 20, 440–449 (2023).
Faria, A., Allen, A. H., Fox, N., Ayas, N. & Laher, I. The public health burden of obstructive sleep apnea. Sleep. Sci. 14, 257 (2021).
Levy, J., Álvarez, D., Del Campo, F. & Behar, J. A. Deep learning for obstructive sleep apnea diagnosis based on single channel oximetry. Nat. Commun. 14, 1–12 (2023).
Eckert, D., Dutta, R. & Levendowski, D. 0602 Application of a physiology-based OSA endotype model to predict oral appliance outcomes using home. Sleep. Study Data. Sleep. 47, A257–A257 (2024).
Dutta, R., Tong, B. K. & Eckert, D. J. Development of a physiological-based model that uses standard polysomnography and clinical data to predict oral appliance treatment outcomes in obstructive sleep apnea. J. Clin. Sleep. Med. 18, 861–870 (2022).
Hang, L.-W. et al. Sex-specific age-related worsening of pathological endotypic traits in patients with obstructive sleep apnea. Sleep https://doi.org/10.1093/SLEEP/ZSAE185 (2024).
Sands, S. A. Identifying obstructive sleep apnoea patients responsive to supplemental oxygen therapy. Eur. Respir. J. 52, 1800674 (2018).
Joosten, S. A. Loop gain predicts the response to upper airway surgery in patients with obstructive sleep apnea. Sleep https://doi.org/10.1093/sleep/zsx094 (2017).
Bamagoos, A. A. et al. Polysomnographic endotyping to select patients with obstructive sleep apnea for oral appliances. Ann. Am. Thorac. Soc. 16, 1422–1431 (2019).
Op de Beeck, S. et al. Endotypic mechanisms of successful hypoglossal nerve stimulation for obstructive sleep apnea. Am. J. Respir. Crit. Care Med 203, 746–755 (2021).
Won, C. H. J. et al. Sex differences in obstructive sleep apnea phenotypes, the multi-ethnic study of atherosclerosis. Sleep 43, zsz274 (2020).
Maury, G. et al. Mandible behaviour interpretation during wakefulness, sleep and sleep-disordered breathing. J. Sleep. Res 23, 709–716 (2014).
Pépin, J. L., Guillot, M., Tamisier, R. & Lévy, P. The upper airway resistance syndrome. Respiration 83, 559–566 (2012).
Azarbarzin, A., Labarca, G., Kwon, Y. & Wellman, A. Physiologic consequences of upper airway obstruction in sleep apnea. Chest 166, 1209–1217 (2024).
Thapa, R. et al. SleepFM: multi-modal representation learning for sleep across brain activity. ECG and Respiratory Signals, 48019–48037 (2024).
Morris, J. L. et al. Symptom subtype progression in obstructive sleep apnea over 5 years. J. Clin. Sleep Med. 20, 1773–1783 (2024).
Azarbarzin, A. et al. The hypoxic burden of sleep apnoea predicts cardiovascular disease-related mortality: the Osteoporotic Fractures in Men Study and the Sleep Heart Health Study. Eur. Heart J. 40, 1149–1157 (2019).
Nasiri, S. et al. CAISR: achieving human-level performance in automated sleep analysis across all clinical sleep metrics. Sleep 48, zsaf134 (2025).
Stephansen, J. B. et al. Neural network analysis of sleep stages enables efficient diagnosis of narcolepsy. Nat. Commun. 9, 5229 (2018).
Carvelli, L. et al. Design of a deep learning model for automatic scoring of periodic and non-periodic leg movements during sleep validated against multiple human experts. Sleep. Med 69, 109–119 (2020).
Aittokallio, T., Nevalainen, O., Pursiheimo, U., Saaresranta, T. & Polo, O. Classification of nasal inspiratory flow shapes by attributed finite automata. Comput. Biomed. Res. 32, 34–55 (1999).
Aittokallio, T., Malminen, J. S., Pahikkala, T., Polo, O. & Nevalainen, O. S. Inspiratory flow shape clustering: an automated method to monitor upper airway performance during sleep. Comput Methods Prog. Biomed. 85, 8–18 (2007).
Cohen, O. et al. The great controversy of obstructive sleep apnea treatment for cardiovascular risk benefit: advancing the science through expert consensus. An official American Thoracic Society workshop report. Ann. Am. Thorac. Soc. https://doi.org/10.1513/ANNALSATS.202409-981ST (2024).
Pépin, J. L. et al. Multidimensional phenotyping to distinguish among central (CSA), obstructive (OSA) and co-existing central and obstructive sleep apnea (CSA-OSA) phenotypes in real-world data. Sleep. Med 124, 426–433 (2024).
Chen, X. et al. Racial/Ethnic differences in sleep disturbances: the Multi-Ethnic Study of Atherosclerosis (MESA). Sleep 38, 877–888 (2015).
Blackwell, T. et al. Associations between sleep architecture and sleep-disordered breathing and cognition in older community-dwelling men: the osteoporotic fractures in men sleep study. J. Am. Geriatr. Soc. 59, 2217–2225 (2011).
Young, T. et al. Burden of sleep apnea: rationale, design, and major findings of the Wisconsin Sleep Cohort Study. WMJ 108, 246 (2009).
Redline, S. et al. The familial aggregation of obstructive sleep apnea. Am. J. Respir. Crit. Care Med 151, 682–687 (1995).
Chambon, S., Thorey, V., Arnal, P. J., Mignot, E. & Gramfort, A. DOSED: A deep learning approach to detect multiple sleep micro-events in EEG signal. J. Neurosci. Methods 321, 64–78 (2019).
Redmon, J., Divvala, S., Girshick, R. & Farhadi, A. You only look once: unified, real-time object detection. in 779–788 (2016).
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. in 770–778 (2016).
Ioffe, S. & Szegedy, C. Batch normalization: accelerating deep network training by reducing internal covariate shift. in 448–456 (PMLR, 2015).
Nair, V. & Hinton, G. E. Rectified linear units improve restricted Boltzmann machines. Proceedings of the 27th International Conference on Machine Learning (ICML). 807–814 (2010).
Srivastava, N., Hinton, G., Krizhevsky, A. & Salakhutdinov, R. Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1929–1958 (2014).
Huber, P. J. Robust estimation of a location parameter. Ann. Math. Stat. 35, 73–101 (1964).
Kingma, D. P. & Ba, J. L. Adam: a method for stochastic optimization. 3rd International Conference on Learning Representations, ICLR 2015 - Conference Track Proceedings. https://arxiv.org/abs/1412.6980v9 (2014).
Acknowledgments
This research was funded by Stanford University, Mignot Lab, Danish Center for Sleep Medicine, and the Technical University of Denmark. M.R.K. was awarded with Stibo, Augustinus, Knud Højgaard, Otto Mønsted, William Demant, Director Einar Hansen’s and wife Mrs. Vera Hansen’s, DDSA, Viet-Jacobsen, Vera and Carl Johan Michaelsen, Marie and M.B. Richters, Idella, and Rienholdt W. Jorck and Wifes foundations. NIH NHLBI (R01HL146697) funded S.S. The Osteoporotic Fractures in Men (MrOS) Study is supported by National Institutes of Health funding. The following institutes provide support: the National Institute on Aging (NIA), the National Institute of Arthritis and Musculoskeletal and Skin Diseases (NIAMS), the National Center for Advancing Translational Sciences (NCATS), and NIH Roadmap for Medical Research under the following grant numbers: U01 AG027810, U01 AG042124, U01 AG042139, U01 AG042140, U01 AG042143, U01 AG042145, U01 AG042168, U01 AR066160, and UL1 TR000128.The National Heart, Lung, and Blood Institute (NHLBI) provides funding for the MrOS Sleep ancillary study “Outcomes of Sleep Disorders in Older Men” under the following grant numbers: R01 HL071194, R01 HL070848, R01 HL070847, R01 HL070842, R01 HL070841, R01 HL070837, R01 HL070838, and R01 HL070839. The Multi-Ethnic Study of Atherosclerosis (MESA) Sleep Ancillary study was funded by NIH-NHLBI Association of Sleep Disorders with Cardiovascular Health Across Ethnic Groups (RO1 HL098433). MESA is supported by NHLBI-funded contracts HHSN268201500003I, N01-HC-95159, N01-HC-95160, N01-HC-95161, N01-HC-95162, N01-HC-95163, N01-HC-95164, N01-HC-95165, N01-HC-95166, N01-HC-95167, N01-HC-95168 and N01-HC-95169 from the National Heart, Lung, and Blood Institute, and by cooperative agreements UL1-TR-000040, UL1-TR-001079, and UL1-TR-001420 funded by NCATS. The National Sleep Research Resource was supported by the National Heart, Lung, and Blood Institute (R24 HL114473, 75N92019R002). The Cleveland Family Study (CFS) was supported by grants from the National Institutes of Health (HL46380, M01 RR00080-39, T32-HL07567, RO1-46380). The National Sleep Research Resource was supported by the National Heart, Lung, and Blood Institute (R24 HL114473, 75N92019R002). We would like to acknowledge Helge B. D. Sorensen’s contribution to this work, who passed before the work was submitted. We would like to acknowledge Jan Ruud Hansen for his assistance in making the linear regression analysis of ABED AHI detection performance.
Author information
Authors and Affiliations
Contributions
M.R.K. laid out the design of the study, conducted the analyses, wrote the source code for the project from preprocessing to evaluation and analyses, and optimized the deep learning architecture and objectives in relation to sleep apnea and implemented the stepwise linear regression and apnotyping. U.H. assisted in writing, optimized the deep learning architecture and learning objectives in relation to sleep apnea detection, and supervised the analyses. A.B. provided wake and arousal probabilities for all cohorts, assisted in the optimization of deep learning and objectives along with the apnotyping, and supervised the analyses. M.O. assisted in creating figures based on analyses. S.S. and S.R. contributed datasets and provided the loop gain, arousal threshold, and pharyngeal muscle compensation. K.L.S. contributed data. P.J. participated in the design of the study. E.M. participated in the design of the study, supervised the analyses and the writing. O.S. supervised the analyses. O.C. reviewed the signal-processing related issues. All authors contributed to manuscript writing and helped revise the manuscript.
Corresponding authors
Ethics declarations
Competing interests
S.S. received grant support from Apnimed, Prosomnus, and Dynaflex and has served as a consultant for Apnimed, Nox Medical, Inspire Medical Systems, Eli Lilly, Respicardia, LinguaFlex, Forepont, and Achaemenid. He receives royalties for intellectual property pertaining to combination pharmacotherapy for sleep apnea via his Institution. He is a co-inventor of intellectual property pertaining to wearable sleep apnea phenotyping, also via his Institution. His industry interactions are actively managed by his Institution. S.R. received consulting fees from Eli Lilly (not related to the project) and funding from NIH. E.M. received a grant from the ResMed Foundation to study questionnaire predictors of SDB, unrelated to this work. U.H. is an employee of BioSerenity. The remaining authors declare no competing interests.
Peer review
Peer review information
Nature Communications thanks the anonymous reviewer(s) for their contribution to the peer review of this work. A peer review file is available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Source data
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Kjaer, M.R., Hanif, U., Brink-Kjaer, A. et al. Expert-level probabilistic breathing event detector informs phenotyping of sleep apnea. Nat Commun 17, 2548 (2026). https://doi.org/10.1038/s41467-026-69163-z
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41467-026-69163-z




