Introduction

Obstructive sleep apnea (OSA) is the most common sleep-related respiratory disorder, characterized by airway obstruction, decreased blood oxygen, and interrupted sleep cycles1. Affecting nearly one billion people globally, OSA is linked to serious health risks, including cardiovascular diseases, cognitive impairments, and traffic accidents2. In 2015, the cost of diagnosing and treating OSA in the USA was approximately US$12.4 billion3. Traditional diagnostic methods for OSA rely on multiple intrusive sensors, such as respiratory inductive plethysmography (RIP), electromyography (EMG), nasal/oral airflow sensors, and pulse oximeters, which can be uncomfortable and disruptive to patients4. This complexity contributes to underdiagnosis in general population5,6. Simplifying the diagnostic process while maintaining accuracy remains a challenge, especially in patients with comorbidities.

Wrist actigraphy, an energy-efficient and non-invasive method for monitoring respiratory rate (RR), has shown promise for OSA detection7. For healthy and OSA individuals with no comorbidities, wrist actigraphy demonstrates comparable accuracy to traditional chest-worn sensors. Hernandez et al. reported a mean absolute error (MAE) of 0.97 ± 2.20 breaths per minute (bpm) in RR estimation8. Xiao Sun et al. integrated a Kalman filter, achieving MAEs of 0.72 and 1.08 bpm for healthy and simple OSA patients, respectively9. Leube et al. demonstrated the potential of reconstructing respiratory waveforms using electrocardiogram and accelerometer data10. Chen et al. have tried a random forest algorithm to detect OSA without comorbidities, achieving an F1 score of 0.96 and an Apnea-Hypopnea Index (AHI) intraclass correlation of 0.8811. Zhao et al. suggested nonlinear Poincaré analysis to enhance respiratory pattern recognition12.

Despite these advancements, real-world applications of respiratory monitoring face additional challenges. Comorbidities such as snoring and restless leg syndrome (RLS) can significantly disrupt measurements, reducing their accuracy. Schipper et al. assessed respiratory effort using a chest-worn accelerometer, observing a fourfold increase in mean squared errors during leg movements and a tenfold increase during wakefulness compared to non-rapid-eye-movement (NREM) sleep13. D. Liaqat et al. used Convolutional Neural Networks (CNN) to extract features from accelerometer and gyroscope data, by discarding “unsuitable” data, they achieved a MAE of 2.05 bpm for ambulatory “real-life” RR estimation, with varying accuracies for healthy individuals and chronic obstructive pulmonary disease (COPD) patients14. Papini et al. evaluated OSA detection using even-spaced 30-second window, achieving an AHI correlation of 0.72 and an error of 0.56 ± 14.74 events/h. However, this short window may miss prolonged OSA events, possibly reducing positive predictive value for severe cases15. Although these studies were conducted in real-world settings, the influence of comorbidities on OSA detection was not thoroughly analyzed, and there is a need for further improvement in performance.

To address these challenges, recent research has increasingly turned to advanced machine learning architectures, such as Fully Convolutional Networks (FCN), CNN, Long Short-Term Memory networks (LSTM), and U-Nets14,16,17,18,19. The integration of transformer architectures, typically seen in token-based Large Language Models (LLMs)20, signifies a new exploration in this domain. Nonetheless, these methodologies face limitations: they either adhere to a predetermined window size without refining the token definition, thereby risking misclassification due to incomplete capture of events, or they employ pixel-to-pixel sequence classification approaches that demand meticulous labeling and may prove unfeasible for extensive datasets with unavoidable temporal misalignments.

In this work, we made several key contributions to address these problems:

  1. (1)

    Infer oxygen levels from motion patterns: By leveraging physiological principles, we demonstrated that a single energy-efficient motion sensor can reliably detect apnea events and oxygen desaturation episodes - even in the presence of RLS and snoring21,22,23.

  2. (2)

    Apex-Centric Flexible Windowing: We introduce an apex-centric flexible windowing approach to enable comprehensive and precise event detection.

  3. (3)

    Multi-Head Causal Attention (MHCA): We leverage MHCA to focus on physiologically plausible event sequences, enhancing the accurate detection and classification of diverse OSA-related desaturation patterns24,25.

By distinguishing these events, our method enables more informative, accurate and low-cost actigraphy-based monitoring, enhancing its clinical utility for large-scale retrospective studies.

Methods

This section details our experimental protocol, data processing pipeline, and analytical framework. First, we describe the study design and data acquisition procedures. Then, we present our standardized preprocessing workflow, comprising signal filtering, data alignment, and a novel flexible apex-centered windowing approach. Lastly, we introduce our two-stage machine learning architecture, emphasizing its physiologically-grounded design principles alongside the comprehensive validation process.

Data collection

Our study involved 58 participants who underwent full-night digital polysomnography (PSG, Grael, Compumedics, Australia) and concurrent bilateral wrist actigraphy (Geneactiv, Activinsights Ltd, UK; Axivity, Axivity Ltd, UK) at the Department of Sleep Medicine, Guangdong Mental Health Center, Guangdong Province People’s Hospital. The participants included 30 males, with ages ranging from 12 to 73 years. The study was approved by the Ethics Committee (IRB KY-Q-2022-171-01) and followed the Declaration of Helsinki. Informed consents have been obtained from all subjects. The two actigraphy brands were selected due to their widespread use in retrospective studies and their demonstrated equivalent performance26,27.

Figure 1 illustrates the experimental setup. Actigraphy data were collected at a 100 Hz sampling rate. A registered technologist annotated PSG data according to the American Academy of Sleep Medicine (AASM) criteria24. OSA is characterized by episodes of upper airway obstruction during sleep, including apneas (complete cessation of airflow for ≥ 10 s) and hypopneas (≥ 30% reduction in airflow for ≥ 10 s with ≥ 3–4% oxygen desaturation or arousal). Limb movements were measured by EMG and were not scored within 0.5s of a respiratory event. Electroencephalogram (EEG) was used to annotate sleep stages, although this is not shown in the illustration.

Fig. 1
figure 1

(A) Illustration of experimental setup. (B) Detailed data processing flowchart for OSA detection. PaCO2: arterial partial pressure of carbon dioxide; BiLSTM: Bi-directional Long and Short-Term Memory.

Data processing

Preprocessing

Actigraphy data contains rich spectral information. Gravitational and seismic vibrations were separated using different filters. A Butterworth bandpass filter (0.05 –0.6 Hz) was used to obtain respiratory information. A low-pass filter of 0.01 Hz was used to obtain posture-related information. Principal Component Analysis (PCA) was then applied to the three-axis signal, and the first principal component was used.

Complete ensemble empirical mode decomposition (CEEMD)

To extract physiologically meaningful motions, the CEEMD method was used to decompose the data into intrinsic mode functions (IMFs)28,29. Unlike traditional EMD, CEEMD reduces mode mixing, enhances robustness, and improves signal separation by adding white noise multiple times and averaging the results, as shown in the Appendix. Using a noise standard deviation of 0.1 and maximum 100 iterations, we decomposed each epoch into ten IMFs. The fourth IMF was selected for its strong respiratory pattern correlation.

The fourier synchrosqueezed transform (FSST) and RR Estimation

FSST is a time-frequency analysis technique that combines ideas from the Continuous Wavelet Transform and the Fourier Transform, as shown in the Appendix30. It provides a sharper localization of signal components in both time and frequency domains simultaneously. RR was estimated as the highest amplitude peak in each epoch’s spectrum. In our study, RR was used as an auxiliary feature for apnea detection. Therefore, we did not employ more advanced techniques to refine RR estimation.

Signal alignment

Due to the offline nature of PSG clocks, PSG and actigraphy data can become misaligned. To correct this, we implemented a two-step realignment:

Posture Alignment: We aligned the sleep posture changes obtained from actigraphy data with the PSG annotations.

Respiratory Synchronization: We then calculated the lag for maximum cross-correlation between the thoracic RIP signal and preprocessed respiratory band signal from actigraphy, as described in 2.2.1.

This process ensured accurate synchronization between the PSG and actigraphy data. However, 16 out of 58 subjects had noisy posture-change peaks and even noisier respiratory signals due to restless motion disturbances, leading to their exclusion from the training set. Since “per-subject” AHI and ODI calculations do not require precise alignment, these 16 subjects were used as testing data, providing a rigorous test of the algorithm’s performance on noisy, unseen data.

Apex-centric tokenization

Tokenization, a concept borrowed from natural language processing (NLP), is a critical step in machine learning that significantly influences model outcomes. Respiratory events like OSA and hypopnea last over 20 s and are often followed by oxygen desaturation, while RLS and snoring are much shorter31. To capture these events, we used 100-second epochs centered on the potential apex of apnea episodes, labeling each epoch with the dominant event to minimize secondary interferences.

The signal was segmented using a sliding 3-minute window with a 1-minute step. Potential events (PEs) were identified as points exceeding two standard deviations from the median of each window. Only the apex of successive PEs was used as the center of respiratory events, defined by Eqs. (12). If PEs were within 0.05s time windows, they were included in the same cluster. Here ACC3min refers to the 3-minute window used to search for PEs, and PEcluster refers to the group of successive PE points, as shown in Fig. 2.

$$PE = \left| {ACC_{{3min}} - median\left( {ACC_{{3min}} } \right)} \right| > 2 \times std\left( {ACC_{{3min}} } \right)$$
(1)
$$\:Apex=\text{m}\text{a}\text{x}\left(\right|{PE}_{cluster}\left|\right)$$
(2)
Fig. 2
figure 2

Apex-centric tokenization procedure for an OSA event.

A full-night scan isolates these potential events, while the remaining signal is deemed largely devoid of events. For each apex, a symmetrical 100-second window is established, centered on the apex, encapsulating potential OSA events. PEs caused by noise are quite different from the OSA patterns, as shown in Supplementary Fig. 1. Segments without any apex are deemed ‘clean’ and were partitioned into even-spaced non-overlapping 100-second intervals. If the edges of these ‘clean’ segments fell within 1 min of the edge of a PE window, the corresponding segments were discarded to prevent potential artifact.

Given the uneven distribution of apex positions, the inter-epoch intervals vary, with a minimum interval of 0.06 s. The median length of a 30-epoch sequence is 430.7 ± 127.5 s, while a “clean” 30-epoch sequence can have a maximum length of 3000 s. This tokenization method ensures the meticulous capture of potential events, while the strategic sampling in highly unstable regions forces the subsequent MHCA model to focus on detailed information about apnea, enhancing discrimination.

BiLSTM-MHCA algorithm

Algorithm design

To enhance specificity in the presence of motion artifacts, such as limb movements or respiratory asynchrony between the chest and abdomen, we employed a BiLSTM-MHCA architecture15. This model leverages the principle of respiratory control: apneic pauses elevate PaCO₂ levels, triggering compensatory increases in respiratory rate and tidal volume until normocapnia is restored32. Our model utilizes BiLSTM to compare the post-apneic ventilation patterns against expected PaCO₂ responses to differentiate true OSA events.

The BiLSTM detects epoch-wise OSA events but may struggle with severe OSA, where recurrent apneas interrupt PaCO₂ recovery. We therefore added a second-stage MHCA to model longer-term respiratory patterns across multiple cycles.

The detailed procedures are as follows:

The BiLSTM first classifies respiratory epochs into five categories, generating similarity scores. The scores are generated by encoding input X into hidden states h, then transform h to class logits z = WTh + b. Here, W and b are learned parameters, and y is the ground truth label. Class probabilities are computed by Eq. (3).

$$p\left( {y = c{\text{|}}X} \right) = \frac{{e^{{z_{c} }} }}{{\mathop \sum \nolimits_{k} e^{{z_{k} }} }},~~{\text{c}},{\text{ k}} \in \left\{ {{\text{1}},...,{\text{5}}} \right\}.$$
(3)

These scores are combined with entropy, amplitude, posture, and respiratory rate into 30-epoch non-overlapping sequences, which are subsequently fed into the MHCA for refined predictions. The workflow is illustrated in Fig. 3.

Fig. 3
figure 3

(A) Illustration of the multi-stage data flow. (B) Detailed design of each module. AHIseq and ODIseq were defined in  “Target labelling and recovering” section.

The key component of this structure is causal self-attention, also known as masked self-attention. It ensures that the model aligns with the natural progression of events, and refines the epoch-wise predictions, reducing erratic epoch-wise prediction and enhancing the model’s sensitivity to specific occurrences.

To enforce the causal (or autoregressive) property, a mask is applied to the attention scores. The attention scores are computed as the dot product between the Query and Key vectors, scaled by the square root of the key dimension dk. In the attention mechanism, Query (Q), Key (K), and Value (V) matrices are standard. Q queries information from the Keys, K determines the relevance of the corresponding Values, and V contains the actual content to be extracted.

The mask M sets the future contribution to negative infinity, ensuring that they do not influence the attention weights.

$$\:Attention(K,Q,V)=softmax\left(\frac{Q{K}^{T}}{\sqrt{{d}_{k}}}+M\right)V$$
(4)

To demonstrate the advantages of the apex-centric tokenization and MHCA, we conducted an ablation study. We compared the performance of our approach with variations that used an even-spaced window size and those without the MHCA module. To ensure comparable sequence lengths, we used a 100-second epoch length with a 15-second sliding step to create 450-second sequences, which is similar to the median sequence length of our flexible apex-centric tokenization method, as described in “Apex-centric tokenization” section.

Target labelling and recovering

AHI is a measure used in the diagnosis and assessment of sleep-disordered breathing, particularly in cases of OSA and other sleep-related breathing disorders33. The definition of AHI used in this study is shown in Eqs. 5–6. It represents the average number of OSA (complete pauses in breathing) and hypopnea (shallow breaths or reduced airflow) per hour of sleep34, either using the whole-night data (AHIsub) or using a 30-epoch sequence (AHIseq). Our study uses five labels for each epoch: normal, OSA, hypopnea, RLS, and snoring.

For each subject, we use AHIsub to stratify the OSA severity.

$$\:{AHI}_{sub}=\frac{Apnea\:counts+Hypopnea\:counts}{Total\:sleep\:time}$$
(5)
$$\:{AHI}_{seq}=\frac{Apnea\:counts+Hypopnea\:counts}{Sequence\:duration}$$
(6)
$$\:{ODI}_{sub}=\frac{Oxygen\:desaturation\:counts}{Total\:sleep\:time}$$
(7)
$$\:{ODI}_{seq}=\frac{Oxygen\:desaturation\:counts}{Sequence\:duration}$$
(8)

ODI measures the frequency of reduced oxygen levels35, which is used to stratify hypoxia severity, as in Eqs. (78). ODIsub and AHIsub are positively correlated but not interchangeable, as shown by the red arrow in Fig. 4. RLSsub and Snoringsub were defined similarly.

Fig. 4
figure 4

Correlation of AHIsub with (A) ODIsub, (B) snoringsub and (C) RLSsub.

Since we adopted an apex-centric tokenization method, which may have significant overlap between epochs, an event counting procedure was necessary to derive AHIsub or AHIseq. For a sequence with 30 estimated labels, we identified the epochs labeled as OSA and hypopnea. We then evaluated the distances between the apex of these events and defined a 60-second ‘dead-response’ time. An apex fell within this ‘dead-response’ zone was absorbed by the prior apex. This process is repeated until all the intervals are higher than 60s, as illustrated in Supplementary Fig. 2. The threshold was chosen by optimizing the consistency with the reference counts.

Data partitioning

Due to misalignment in PSG and actigraphy data, 16 challenging-to-align subjects were excluded from training and used as testing data. For the remaining 42 subjects:

First-Tier Training: To prevent data leakage from overlapping epochs, each training iteration randomly selected a single epoch from each overlapping set. Adjacent epochs were excluded to prevent data leakage. This produced 592 OSA, 638 hypopnea, 804 RLS, and 1518 snoring events per iteration. We balanced categories by resampling to match the least frequent class and split data 80/10/10 for training, validation, and testing. By randomly selecting a single epoch from each cluster of overlapping epochs during each training iteration, we enhanced training diversity while effectively controlling overfitting.

Second-Tier Training: Data was sorted into three AHIseq categories: <10, 10–30, and > 30 events/h. An equal number of sequences was selected from each category, with the same 80/10/10 split for training, validation and testing. The remaining sequences were allocated to the testing set. Since the sequences were non-overlapping to begin with, no additional screening was required. This division helped maintain a large training pool while minimizing data contamination.

The training process was repeated 20 times to ensure robust evaluation.

Performance evaluation

Median absolute error (MAE) and correlation evaluation

MAE was used to evaluate AHIseq and ODIseq estimation performance. Pearson correlation coefficients assessed the similarity between reference and estimated values.

Classification evaluation

We used clinical standards to classify the subjects into normal, mild, moderate, and severe OSA patients. For our analysis, we also employed a simplified classification, dividing the subjects into two groups: normal/mild and moderate/severe. The first group (normal/mild) does not require clinical intervention, while the second group (moderate/severe) does.

To evaluate the performance of our classification model, we employed sensitivity and specificity, as defined in Eqs. 9–10. For event-level or subject-level classification, TP (True Positives) is the number of correct identifications of an event or subject-level OSA severity. FN (False Negatives) is the number of missed detections of non-event or opposite labels. FP (False Positives) is the number of incorrect detections of an event or label, and TN (True Negatives) is the number of correct non-event or opposite label detections.

$$\:Sensitivity=\frac{TP}{TP+FN}$$
(9)
$$\:Specificity=\frac{TN}{TN+FP}$$
(10)

For multi-class classification, the overall sensitivity and specificity are the weighted sum of the class-wise evaluations. We also used Cohen’s kappa for consistency test.

These metrics were analyzed on a per-event, per-measurement, and per-subject basis to address their different application scenarios. For decisions regarding medical intervention, a per-subject analysis is sufficient. For a more detailed analysis of event distribution, a per-event analysis is necessary. The per-measurement analysis can compare the performance of the dominant and non-dominant hands and help in designing a more effective protocol.

Influence of handedness, co-existing diseases, sleep stages and postures

In many clinical studies, the wrist watches were worn on the dominant hand. To evaluate the effect of handedness, we compared the AHIsub and ODIsub estimation accuracy between the dominant and non-dominant hands.

We also analyzed the impact of co-existing conditions such as RLS and snoring. Additionally, we examined variations in sleep stages (N1, N2, N3, and REM) and body postures (supine, left-side, and right-side), which account for 93.8% of all observed postures. These factors were essential for a comprehensive understanding of the data.

Results and discussion

Patient characterization

The cohort’s characteristics, sleep durations, and event distribution are shown in Fig. 5; Tables 1 and 2. Sleep patterns are heterogeneous, with various events including spontaneous awakenings (ARO SPONT), respiratory-related awakenings (ARO RES), limb movement-related awakenings (ARO Limb), periodic leg movement-related awakenings (ARO PLM), blood oxygenation (SpO2) artifacts, central sleep apnea (CSA), and mixed apnea. CSA occurs when the brain fails to send proper signals to breathing muscles, while mixed apnea combines elements of OSA and CSA.

Snoring happens most frequently, followed by oxygen desaturation. OSA incidents last an average of 20 s, while oxygen desaturation events span about 30 s. This distribution allowed our apex-centered tokenization to achieve complete or near-complete encapsulation of OSA events. Among the study cohort, 41 subjects were normal (AHIsub<5), with 9 subjects classified as having mild OSA (5 ≤ AHIsub<15), four subjects had moderate OSA (15 ≤ AHIsub <30), and four subjects had severe OSA (AHIsub≥ 30).

Fig. 5
figure 5

Event distribution and durations.

Table 1 Subject characteristics and sleep-related parameters (mean ± SD).
Table 2 Event counts and durations (mean ± SD).

Comparison of RIP, nasal flow and processed actigraphy data

Our investigation revealed that transforming raw triaxial accelerometer data through PCA to IMFs effectively retained the subtle vibrations induced by respiratory activity. Simultaneous examination of signals during OSA episodes demonstrated that the chosen IMF closely aligns with the thoracic movements, as shown in Fig. 6. The signal comparison during RLS, snoring and hypopnea are shown in the Supplementary Fig. 3.

Fig. 6
figure 6

Concurrent signals and processed data during OSA events. First row: the raw data, the first principal component and the used IMF; Second row: experimental thoracic, abdomen and nasal flow signal.

Apnea severity estimation

The distribution of AHIseq and AHIsub is shown in Supplementary Fig. 4, with AHIsub being the weighted sum of AHIseq for each subject.

Oxygen desaturation can result from both hypopneas and OSA. Our event classification system effectively identifies OSA-related oxygen desaturation events, achieving a per-event sensitivity of 67.4% and specificity of 92.5%.

On a per-measurement level, the Pearson correlation coefficient between the estimated and reference AHIsub is 0.87, and for ODIsub, it is 0.89. The red circle with dashed line in Fig. 7A and B represents measurements when subjects have high AHI but low ODI, as illustrated in Fig. 4. These results suggest that the system is less sensitive to hypopneas without accompanying oxygen desaturation. The comparison of reference and estimated AHIsub, ODIsub, and RLSsub is shown in Fig. 7. However, snoring detection showed a lack of correlation with reference data, which is presented in Supplementary Fig. 5.

On a per-subject level, we used left-hand data due to its better performance. Given that AHIsub was underestimated at higher values (Fig. 7A), we stratified AHIsub−est using thresholds of 5, 10, and 15, similar to15. For the standard four-class apnea severity (normal, mild, moderate, severe), the Cohen’s kappa for AHIsub and ODIsub on the left hand was 0.68 and 0.54, respectively (Table 3). When simplified into two categories (normal/mild and moderate/severe), Cohen’s kappa improved significantly. For ODIsub, it increased to 0.84. The sensitivity and specificity for moderate/severe OSA detection were 85.7% and 98.1%, respectively.

Fig. 7
figure 7

(AC) correlation between reference AHIsub, ODIsub and RLSsub versus their estimations; (DF) Corresponding Bland–Altman plot. The limits of agreement were drawn as dotted line (1.96 SD).

This result is reflected in the confusion matrix (Fig. 8), where the algorithm nearly perfectly distinguishes between conditions requiring intervention (moderate/severe) and those that do not. Notably, the algorithm effectively distinguishes cases with high AHI but relatively lower ODI, highlighting its ability to detect subtle clinical nuances of diagnostic information. Only one subject with moderate OSA was misclassified as normal, who had 58 arousals and 7 central apneas during sleep, indicating potential signal degradation. Other comorbidities, such as snoring or RLS, slightly reduce quantification accuracy but do not significantly impact diagnostic value, as illustrated in Fig. 9.

Notably, while right-hand data performed less optimally for four-class classification, it achieved similar results when using the two-class system.

Table 3 Cohen’s kappa between estimated and expected AHIsub and ODIsub.

Among the 16 subjects reserved for testing, their performance is shown in Supplementary Fig. 6. Using four AHIsub classes, 6 out of 16 were misclassified, primarily normal and mild OSA patients. When using two-category system, only 1 subject was misclassified.

Fig. 8
figure 8

OSA severity comparison, if the classification is made based on (A) AHIsub (B) ODIsub.

Influencing factors

Handedness and posture

OSA detection performance was enhanced when using left-hand data or when the individual adopted a left-side sleeping posture, as shown in Table 3; Fig. 9. Handedness did not significantly influence outcomes, as both left-handed and right-handed individuals achieved optimal results with their left hand. One possible explanation is that, left-side positioning was associated with a ~ 50% higher average AHIseq compared to right-side sleeping, consistent with reports that left decubitus may promote airway collapse in OSA patients36. We speculate that left-side episodes generate more characteristic signals, potentially reflecting more severe airway obstructions.

Influence of sleep stages and co-existing conditions on OSA detection

In the absence of RLS or snoring, the MAE of AHIseq for sequences with OSA (AHIseq>0) was 7.37 events/h, with a standard deviation of 11.36 events/h, lower than the 14.74 events/h reported standard deviation in a previous study15. The presence of RLS and snoring did not negatively impact AHIseq accuracy (Fig. 9).

AHIseq estimation shows stage-dependent variability: In REM sleep, accuracy decreases, probably due to motion artifacts from wrist movements and irregular breathing patterns, compounded by reduced respiratory muscle activity that weakens motion signals37,38. During N2 sleep, diminished PaCO₂ chemosensitivity results in mild hypoventilation and less distinct post-apneic ventilation patterns, increasing AHIseq variability39. Apnea rates rose moderately in N3 (30% increase) and substantially in REM (66% increase), with ODIseq following similar trends.

It is worth noting that the for sequences without OSA events (AHIseq=0), the mean AHIseq error is 1.67 events/h, with a median of 0.

Fig. 9
figure 9

For sequences with OSA events, AHIseq estimation error under various influences.

Influence of device brands

In this study, subjects were matched across groups by sample size and gender. Preliminary interviews also indicated comparable self-reported sleep apnea risk levels in daily life. However, during actual testing, we observed that subjects wearing Axivity devices exhibited significantly more snoring events, indicating that before objective testing, individuals may misinterpret severe snoring as OSA.

Both devices demonstrated comparable performance in AHIseq estimation, as confirmed by a two-sample t-test (Geneactiv: 2.64 events/hour vs. Axivity: 2.42 events/hour; p > 0.05), indicating that both are equally suitable for OSA detection and AHI/ODI estimation.

Ablation study

We found that the apex-centric tokenization method enhanced OSA event detection sensitivity while maintaining specificity. The MHCA module significantly improved both sensitivity and specificity by optimizing physiological sequential predictions, but only when used with apex-centric tokenization.

Interestingly, adding the MHCA mechanism decreased the overall performance when even-spaced tokenization was used, likely due to low sensitivity to OSA events (Table 4). In contrast, apex-centric tokenization retained most potential events, resulting in higher sensitivity. The sensitivity and specificity comparison in Table 4 was based on a per-event evaluation.

Table 4 Performance enhancement by proper tokenization and multi-stage structure.

Comparison with previous studies

Few studies have used wrist actigraphy for apnea detection, so we compared our work with similar sensor types (Table 5). Our multi-stage approach outperformed previous studies in a ‘real-world’ population15. While our per-event classification in a heterogeneous population had slightly lower performance than studies using simpler cohorts, our per-subject evaluation showed better or comparable sensitivity and specificity for moderate to severe OSA.

Table 5 Comparison with previous studies with advanced machine-learning with/without heterogenous population.

Traditional actigraphy devices like Geneactiv and Axivity have lower resolution (3.9 mg) compared to newer sensors (< 0.1 mg), making respiratory detection challenging. Our algorithm was developed to work with these devices due to their retrospective research value. There is a wealth of data available from these actigraphy, such the UK Biobank with over 100,000 participants who had 7-day actigraphy data41, which can provide valuable clinical insights if well-utilized.

Novelties and implications

Several key technological innovations distinguished this study:

  1. (1)

    ODI estimation using only motion sensors: AHI is commonly used for OSA diagnosis, but ODI may be more relevant for long-term complications and prognosis. Wrist actigraphy lacks optical sensors for direct oxygen saturation measurement. However, we successfully inferred ODI using accelerometer data, advancing the use of this low-cost, power-efficient technology.

  2. (2)

    Apex-centric tokenization Method: We designed a novel apex-centric tokenization method based on apex detection and clustering. This method addressed the issue of incomplete event coverage and enhanced the sensitivity to potential events. By strategically sampling unstable region and use physiological reasoning, we forced the MHCA to focus on meaningful temporal patterns. This resulted in better specificity, while maintaining good sensitivity.

  3. (3)

    Multi-stage approach with causal attention mechanism: This design integrates the strengths of BiLSTM and MHCA. The causal attention emphasizes physical plausibility at different levels of granularity, significantly reducing the impact of noise and motion artifacts caused by disturbances such as RLS and snoring.

  4. (4)

    Comprehensive evaluation in practical application: Using novel bilateral experimental design, we demonstrated that wearing the actigraphy device on the left hand led to improved performance. RLS and snoring was shown to have minimal impact on OSA detection. However, the accuracy of OSA detection varies with OSA severity, which correlates with different sleep stages and postures. These factors should be considered in clinical practice.

The findings of this study showed that actigraphy could be a practical alternative to PSG for OSA screening, which is robust against disturbances from RLS and snoring.

Although our algorithm produced promising results, several limitations need to be addressed. The low resolution of the accelerometer sensors limits the reliable quantification of mild OSA. In addition, the duration of each event cannot be accurately determined. Our method is more effective for detecting apnea with oxygen desaturation than pure apnea. However, most subjects had mental comorbidities, and since the algorithm does not rely on biometric information, this limitation does not significantly affect the results.

Conclusion

In conclusion, our innovative combination of apex-centric tokenization and MHCA demonstrated the feasibility of using low-resolution wrist actigraphy to detect moderate to severe OSA with high accuracy, even in the presence of snoring and RLS. This method shows significant potential for large-scale public health studies.