Diagnosing sleep-related paroxysmal motor events accurately remains a significant clinical challenge, particularly when differentiating epileptic seizures from parasomnias1. Although these conditions are distinct in terms of underlying mechanisms, their external manifestations during sleep often share overlapping motor characteristics, leading to potential diagnostic confusion. Experienced clinicians rely on comprehensive clinical history, video-polysomnography, and prolonged video-EEG recordings to make accurate distinctions. However, these methods can be resource-intensive, time-consuming, and prone to variability between observers, particularly in borderline cases or in institutions lacking subspecialty expertise1.

The clinical overlap between disorders such as Sleep-Related Hypermotor Epilepsy (SHE), Disorders of Arousal (DOA), and REM Sleep Behavior Disorder (RBD) has been well documented. For example, episodes in both SHE and parasomnias may present with complex motor behaviors, including sudden arousals, limb movements, or vocalizations, complicating the diagnostic process2,3,4. This is particularly relevant in children and young adults, where semiologic differences can be subtle.

Recent advances in artificial intelligence have introduced the possibility of supporting this diagnostic process. Video-based action recognition methods have gained traction, leveraging deep learning to extract motion patterns from raw video data without the need for wearable sensors or external markers5. These approaches offer the potential to streamline diagnostic workflows, enhance reproducibility, and support clinicians, especially in environments lacking full neurophysiological monitoring5,6,7,8,9,10.

Building on our earlier pilot work, which highlighted that SHE and DOA could be distinguished using automated video classification5, we now extend this framework by incorporating REM Sleep Behavior Disorder (RBD) alongside SHE and DOA and leveraging a larger and more heterogeneous dataset. In this multicenter study, we analyzed a dataset comprising 253 annotated video recordings from 167 participants. The recordings were acquired under heterogeneous conditions, further reflecting real-world clinical variability. As an additional advancement over our previous work, we employed the SlowFast neural network architecture, which combines dual temporal resolution pathways to analyze both fast and slow visual cues, thereby capturing a wide range of motor patterns11. This approach was evaluated as a fully automated video-based classifier of SHE, DOA, and RBD. The complete overview of the workflow for this study is shown in Fig. 1.

Fig. 1: Overview of the workflow.
Fig. 1: Overview of the workflow.
Full size image

A Sketch of the video acquisition setup. B Schematic representation of the SlowFast network, illustrating its dual-pathway design: the slow pathway processes a temporally down-sampled sequence to capture overall spatial context (height H, width W, time T, channels (C), while the fast pathway operates at a higher temporal resolution to capture more rapid motion patterns. Features from both pathways are fused to generate the final classification output. C Test procedure, showing how the trained network provides the final classification (SHE/DOA/RBD) for each input video.

To determine the most effective model architecture for this classification task, we first benchmarked several leading 3D convolutional neural networks, including Temporal Segment Networks (TSN)12 and the R2 + 1D model13. However, these models demonstrated limited accuracy, generally around 50%, and were therefore deemed inadequate.

The SlowFast model11 was selected as the optimal solution based on its performance. It was tested across three independently constructed data splits, where no individual’s data appeared in more than one set (train/validation/test) for each split, particularly the test sets contain a single video per participant. All reported metrics therefore reflect patient-level classification performance. These partitions ensured that the evaluation was robust against overfitting and participant-specific bias.

As can be seen in Table 1, across the three validation splits, the model achieved a mean classification accuracy of 83% ± 3.6%, with a 95% Wilson confidence interval of 73–90%, with consistently high performance in identifying SHE (mean F1 = 88%) and slightly lower but comparable precision for DOA (F1 = 79%) and RBD (F1 = 83%). The confusion matrix in Fig. 2 highlights this pattern, showing that most errors occurred between DOA and RBD, reflecting their clinical and motor overlap. Performance was most stable across splits for SHE (recall = 92%), while greater variability was observed for RBD (recall range 62–100%). In addition to recall and F1 trends, the model achieved consistent overall specificity across splits (Split 1: 93.7%, Split 2: 91.7%, Split 3: 89.6%; overall: 91.7%), indicating stable performance in correctly rejecting non-target classes. A slight reduction in overall accuracy was observed in Split 3 (79%), mainly due to misclassifications between DOA and RBD, likely related to borderline or atypical examples within this subset. Both DOA and RBD can present with overlapping or ambiguous motor manifestations, particularly when dream-enactment–like or subtle motor behaviors occur. In Split 3, the test data included cases with greater variability in movement patterns: several DOA episodes displayed complex motor behaviors partly resembling RBD, while some RBD cases were characterized by limited or less distinctive activity. This heterogeneity likely contributed to the reduced discriminability between the two classes. Nevertheless, performance remained stable across the other splits, supporting the robustness of the proposed model despite interindividual variability in behavioral expression.

Fig. 2: General confusion matrix combining the results of all three splits.
Fig. 2: General confusion matrix combining the results of all three splits.
Full size image

Rows represent the actual classes, and columns represent the predicted classes. SHE: Sleep-Related Hypermotor Epilepsy, DOA: Disorders of Arousal, RBD: REM Sleep Behavior Disorder.

Table 1 Precision (P), Recall (R) and F1-score for each diagnostic group (Sleep-Related Hypermotor Epilepsy (SHE), Disorders of Arousal (DOA), and REM Sleep Behavior Disorder (RBD)) in each split and overall

Two false-negative SHE cases were identified, both in Split 2: one from a very young participant misclassified as RBD, and another misclassified as DOA. The first involved brief myoclonic jerks resembling RBD-like twitches, while the second showed agitated movements followed by sitting up and an attempt to get out of bed, mimicking a confusional arousal or sleepwalking episode as can be seen in Fig. 3. This qualitative visualization provides an example of overlap in motor patterns, such as partial arousals or complex motor sequences, that can lead to model confusion between epileptic and parasomnic events. Despite these challenges, the model demonstrated robust and generalisable performance across all splits, particularly in distinguishing SHE from parasomnias.

Fig. 3: Example of a misclassified event (SHE predicted as DOA).
Fig. 3: Example of a misclassified event (SHE predicted as DOA).
Full size image

Representative anonymized frames show a Sleep-Related Hypermotor Epilepsy (SHE) episode characterized by slow, agitated movements followed by partial rising and an attempt to leave the bed.

To assess inter-center generalization, we conducted an additional experiment excluding all RBD videos from one of the centers during training and validation, using 8 of them randomly selected solely for testing. In this configuration, the model achieved an overall accuracy of 83% (20/24 videos correctly classified). All SHE cases were correctly identified, while two RBD episodes were misclassified as DOA, and two DOA were misclassified, one as RBD and one as SHE. This finding indicates that the model retains good generalization capability when applied to data from an unseen clinical site.

This study highlights that deep learning, when applied to nocturnal video recordings, can offer a reliable, automated method for classifying three major categories of sleep-related motor disorders: SHE, DOA, and RBD. The application of the SlowFast architecture, with its dual temporal pathway design, was especially effective in extracting complex motor features spanning multiple time scales. Compared to other 3D CNNs tested, the SlowFast model delivered superior performance and generalizability.

One of the model’s strongest results was in the classification of SHE, which was consistently identified with high precision across all test splits. This is notable because SHE is often difficult to diagnose due to its behavioral overlap with parasomnias. The model’s accuracy in this regard underscores its potential role as a diagnostic aid, especially in cases where expert neurophysiologic interpretation may not be available. However, the model showed reduced accuracy in distinguishing between DOA and RBD, particularly in Split 3. This limitation mirrors clinical challenges, where these parasomnia types often require careful consideration of contextual factors such as sleep stage, age, comorbidities, or even associated vocalizations, none of which were available to the model in this study. This underscores the value of a multi-modal approach and highlights opportunities for future development.

To further explore model robustness across acquisition sites, we performed a complementary analysis excluding RBD recordings from one center (Bellaria Hospital in Bologna) during training and validation and considering them only during test. The model maintained satisfactory performance (83% accuracy) comparable to the overall performance of the three original splits, correctly identifying all SHE events. This result supports the potential generalizability of video-based models to unseen clinical environments, while emphasizing the need for more balanced multi-center datasets to fully assess inter-site performance. A complete leave-one-center-out analysis was not feasible at this stage due to the imbalanced distribution of participants across classes among centers, an aspect that we plan to address in future work. Nonetheless, our dataset, drawn from multiple sleep centers using heterogeneous recording protocols and equipment, provides a realistic and ecologically valid testbed for evaluating generalizability. The variability in video quality, lighting, and resolution adds robustness to our findings, suggesting that similar models could be deployed across diverse clinical settings without extensive recalibration.

Future work will expand the dataset to include additional paroxysmal events and continuous overnight recordings, enabling assessment of age-related variability, event detection performance, and false positive rates across entire nights. Additional data acquisition will also be necessary to balance the number of individuals per class across centers, ensuring a more even class distribution and allowing for a meaningful leave-one-center-out analysis. It will also be of interest to explore multimodal approaches, first by integrating audio signals to capture vocalizations, then incorporating textual information such as demographic data and physicians’ reports and ultimately extending the analysis to include EEG recordings. The integration of these complementary data sources is expected to enhance the model’s accuracy and overall diagnostic reliability. Finally, future work should also investigate the impact of varying the dimensions of the two pathways in the SlowFast network on model accuracy. For prospective applications, automated anonymisation and controlled access pipelines will be implemented to ensure data privacy and reproducibility across centers.

In summary, our findings represent a promising proof of concept that warrants prospective and on-site validation. When validated further, such tools could assist in triage, diagnosis, or longitudinal monitoring of people with suspected nocturnal motor events, reducing diagnostic delays and relieving the burden on expert clinical teams.

Methods

Dataset

This retrospective study was conducted using video recordings acquired from five centers: Niguarda Hospital and IRCCS San Raffaele Hospital in Milan, Giannina Gaslini Hospital in Genoa, the Neurocenter of Southern Switzerland in Lugano, and Bellaria Hospital in Bologna. Ethical approval was granted by the Niguarda Hospital ethics committee (ID 939–12.12.2013), and all participants or their guardians provided written informed consent to the use of their recordings for research purposes.

The dataset included 253 video clips from 167 participants: 73 diagnosed with SHE, 53 with DOA, and 41 with RBD. Recordings were acquired using a variety of video-polysomnographic setups and reflect wide heterogeneity in temporal resolution (24–30 frames per second), camera angle, lighting, and background. Event durations ranged from 3 to 138 s, with a mean duration ± standard deviation of 28 ± 22 s. Event annotation was performed independently by two experienced experts at each participating center. A third senior expert subsequently reviewed all annotated videos across centers to ensure inter-center consistency. Only unequivocal events from patients with confirmed diagnoses, based on comprehensive clinical, neurophysiological, neuroradiological (when needed), and follow-up data, were included. All annotations corresponded to diagnostically certain events, and full agreement was reached among raters; therefore, no formal inter-rater reliability statistics were computed.

Pre-processing

Only minimal preprocessing was applied: videos were resized to 224 × 224 pixels to fit the input to the SlowFast model, kept at their original frame rates, and uniformly subsampled to 32 frames and 8 frames to match, respectively, the fast and slow pathways of the SlowFast network input while preserving the original temporal dynamics. This approach was intended to assess model robustness under heterogeneous recording conditions.

Deep learning models

We treated the classification task as a multiclass action recognition problem14. Several deep learning architectures were evaluated. At first, the Temporal Segment Network (TSN)12 served as a baseline 2D CNN model, aggregating frame-level features over time to capture coarse motion dynamics. We next tested the R(2 + 1)D architecture13, a 3D convolutional model that decomposes spatiotemporal filters into separate spatial and temporal components, allowing finer motion modelling. Finally, due to poor performance from baseline models, we adopted the SlowFast network11, which employs two parallel pathways operating at different temporal resolutions: a slow branch for detailed spatial semantics and a fast branch for rapid motion cues. The slow pathway processed low-frequency spatial patterns (i.e., contextual clues) by sampling 8 temporally spaced frames, while the fast pathway handled 32 densely sampled frames to capture short-term motion. Frame selection was dynamically adapted to each video’s duration. The network was initialized with pretrained weights from the Kinetics-400 dataset15.

Data splitting strategy

To ensure unbiased generalization, we adopted a three-split cross-validation design at the participant level. For each split, data were divided into training, validation, and test sets, ensuring that no participant contributed to more than one set. As stated in Table 2, the training set comprised 119 participants (205 videos) and each individual contributed between 1 and 4 recordings (median = 1, range = 1–4), while the validation and test sets contained 8 participants each (8 unique videos), with one video per participant and no repetition of clips across or within splits. This approach produced three independent train–validation–test configurations, each with a distinct validation and test cohort, allowing us to evaluate model stability and generalization across different participant compositions. Hyperparameters were tuned using the validation performance, while the final classification accuracy was computed as the average across the three test splits. To mitigate class imbalance, we employed a class-weighted focal loss during training.

Table 2 Demographic and clinical characteristics of the study cohort

To further assess inter-center generalization, we conducted an additional preliminary leave-one-center-out (LOCO) experiment. Although a full LOCO analysis is not feasible at this stage due to the strong imbalance in class distributions across centers (an aspect we plan to address in future work), we nonetheless performed a targeted experiment to obtain an initial indication of cross-site generalizability. In this experiment, all RBD videos from one center (Bellaria Hospital in Bologna, 14 clips from 14 unique participants) were excluded from training and validation, with 8 of them randomly selected and used exclusively for testing. This configuration simulated a previously unseen acquisition environment, providing an independent evaluation of the model’s robustness across clinical sites.