Introduction

Preterm birth (PTB), which is generally defined as delivery before 37 weeks of gestation, is the single largest cause of death in children under the age of 51 with ~1 million deaths occuring per year2. While some etiologies of PTB have been identified, many remain unknown. Previous literature has shown that disruptive maternal sleep patterns have been associated with PTB outcomes3,4,5,6.

One major limitation with previous studies is the reliance on self-reported sleep patterns, which is limited by a patient’s ability to recall their sleep patterns accurately and consistently7. Wearable devices can alleviate this problem as they provide a more reliable and detailed stream of data8,9. Previous literature has found that wearable sensor data can be used to make predictions regarding both physical and mental health issues, ranging from pancreatic complications10 to depression11.

Using data collected from wearables, we evaluate predictions of binary PTB outcomes with patients from a cohort study conducted at Washington University in St. Louis/BJC HealthCare12. Participants from this cohort study were given actigraphy watches to wear for 2 weeks over the course of each trimester, capturing high-resolution sleep data. The collected actigraphy data are then transformed into interpretable quantitative features and used as input for several shallow machine learning (ML) models. These models are then evaluated to assess the relative impact of these features, offering several clinical insights into the relative importance of individual sleep and non-sleep behaviors, as well as insights for more complex ML models.

Previous work with this dataset has attempted to evaluate regression models of unengineered time-series data to predict the entire spectrum of gestational age (GA) directly from individual actigraphy samples13, which is intrinsically different in both objective and approach from predicting binary-outcome PTB from statistics across a pregnancy. The authors noted that measured mean absolute error between actual and predicted GA was higher overall in PTB patients, but did not evaluate any classifier performance with respect to binary-outcome PTB. Moreover, the models presented in13 are limited in their explanability as a result of both learning non-linear representations and at attempting to predict GA at a sample level. In addition, previous work has also examined direct correlations between engineered actigraphy features and PTB, evaluating the risk associated with each individual feature5,6.

This paper evaluates the performance of binary-outcome classification of PTB from engineered actigraphy features and selected patient history features. The models presented here are computationally simpler and interpretable, which offer engineering and clinical insights about potential approaches for more complicated models. Overall, we validate the usage of sleep measures derived from actigraphy data in ML models for the prediction of binary-outcome PTB. From these models, relative comparisons of the impact of actigraphy and patient history features on predictions are examined. We finally offer interpretations of each of the tested models, and guidance for future works.

Results

Among the 1523 patients who participated in the cohort study, we analyze the 665 patients who had actigraphy data in at least the first or second trimester of their pregnancy and had a recorded delivery date. The average patient had 39.1 (±32.2) day-level samples throughout the duration of their pregnancy, with the first trimester having 15.7 (±10.4) samples on average, the second trimester having an average of 24.0 (±18.9) samples, and the third trimester having an average of 17.0 (±10.3) samples. The overall distribution of samples collected from all patients can be seen in Fig. 1. Of these patients, the mean age was 29.2 (±5.29) years, and the majority (55.34%) of the patients were multiparous. A minority of patients (14.18%) experienced a PTB outcome. Full details about the demographics of the patients used in this dataset can be found in Section 1 in the Supplementary Materials, and details about the actigraphy features and numerical case report form features can be found in Table 1.

Fig. 1: Histogram of the collected actigraphy samples.
figure 1

Results are stratified by whether the patient experienced a positive or negative preterm birth outcome. In the first trimester there are 263 negative patients and 39 positive patients who had a sample, 262 negative patients and 47 positive patients who have a sample, and in the third trimester there are 46 positive patients and 8 negative patients who have a sample.

Table 1 Actigraphy and numerical case report form features, stratified by trimester where applicable

We compare the performance of models trained on the two primary sources of data, the engineered actigraphy features and case report form responses collected at each visit, in Table 2 and Fig. 2. Performance curves of the models trained only on the actigraphy or case report form data can be found in Section 3 of the Supplementary Material. Confusion matrices comparing the best model by area under the receiver-operator curve (AUROC) are provided in Table 3, and Tukey’s honest significant difference test (HSD) results comparing each are provided in Tables 4, 5, and 6.

Table 2 Comparison of models for all patients and nulliparous patients
Fig. 2: Reciever-operator and precision-recall curves for models using all features.
figure 2

Pooled a reciever-operator curves and b precision-recall curves for all models using all data sources.

Table 3 Confusion matrices for classifiers with the best area under the receiver-operator curve (AUROC) among all patients, with the threshold set to match a 50% true positive rate
Table 4 Confusion matrices for classifiers with the best area under the receiver-operator curve (AUROC) among nulliparous patients, with the threshold set to match a 50% true positive rate
Table 5 Tukey’s honest significant difference test for area under the receiver-operator curve (AUROC) for across all models staratifed by feature set
Table 6 Tukey’s honest significant difference test for area under the precision-recall curve (AUPRC) across all models stratified by feature set

We find that, using actigraphy features and case report form survey data, it is possible to make reasonable predictions about binary-outcome PTB. As seen in Table 2, actigraphy features appear to underperform features from case report forms at predicting PTB when comparing the best models for each configuration. The combined performance is better than either source of data individually.

Gestational age and model performance

Figure 3 shows the performance of each model as samples up to a specified GA are included. As seen, the performance of the models does not change consistently as the GA upper-bound is increased, although it does increase noticeably in performance as the full GA spectrum is enabled.

Fig. 3: Reciever-operator and precision-recall curves for data up to a given gestational age (GA).
figure 3

Selected a receiver-operator curves and b precision-recall curves using with all features calculated with features up to a maximum GA using one random seed.

This lack of consistent performance change likely occurs for several reasons. First, the distribution of study participants who have data up to a given GA is variable, and for those that do have data up to a specified GA, the duration and lengths are also variable. In addition, the aggregation used for all actigraphy features, mean and standard deviation, does not change linearly as the amount of data increases. This variability in AUROC and area under the precision-recall curve (AUPRC) appears to weakly correspond to the sample trends seen in Fig. 1, which is roughly centered around the boundaries in each trimester.

Feature explanations

To assess the importance of each feature in each model, we evaluate the features with SHapley Additive exPlanations (SHAP) scores14, which provide relative estimates of how the output of a model will change as the input features change. Figure 4 shows the feature explanations for the best performing model with all features.

Fig. 4: SHapley Additive exPlanations (SHAP) analysis of Gaussian NB with all features.
figure 4

Features indicative of socioeconomic status are highlighted green, and other patient history variables are highlighted red.

When all features are used, we find that the features that affect the output of the model the most are related to the number of complications that occurred during previous births. This is consistent with the literature, which finds that past PTB is a strong predictor of future PTB outcomes15,16. Features relating to socioeconomic status, highlighted in green, also rank highly, which is consistent with prior literature as race, ethnicity, and employment status are associated with preterm birth17,18.

Actigraphy features were impactful to a lesser degree, with the highest ranked feature being the average day-to-day variability between sleep start. Other similarly ranked features following this included sleep start time, the variance of the start of the sleep cycle, and day-to-day variability in the duration of the sleep cycle, etc. Overall, actigraphy features relating to variability in sleep patterns appeared to rank higher than those derived from averages across a patient’s pregnancy. Prior studies have shown that shift work is associated with higher incidence of PTB19. The variability captured by our actigraphy features may help explain this association, as shift work often disrupts regular sleep patterns and leads to increased day-to-day variability.

When we evaluate the best performing actigraphy-only model, shown in Fig. 5, we find a similar ordering of relevant features, with features reflecting variance between daily actigraphy measurements appearing towards the top. Some of this difference in ordering can be attributed to the issue of dimensionality, as the number of examples is smaller, although the limited sample size prevents us from making any conclusive orderings.

Fig. 5: SHapley Additive exPlanations (SHAP) analysis of Gaussian Naïve Bayes (Gaussian NB) with actigraphy features only.
figure 5

Features aggregating average patient behavior are highlighted in gold, and features aggregating the standard deviation of patient behaviors are highlighted in blue.

First-time/nulliparous pregnancies

Prior PTB complications are a strong predictor of future PTB complications, but such foresight does not exist in the case of nulliparious pregnancies. To evaluate these pregnancies, we train separate models on nulliparous patients. For training, we replace case report form features relating to delivery history with empty values. Results from training on nulliparous patients only are reported in Table 2.

As seen in Table 2, we find that the performance of actigraphy features is distinctive when we use Gaussian Naïve Bayes as the classifier. For all remaining model types, the performance is comparable both in contrast or together with case report form data, differing by relatively small amount for both AUROC and specificity at 90% sensitivity. This indicates that actigraphy data may provide performance comparable to or better than what can be assessed in a clinical survey specifically with regards to nulliparous patients.

In addition, the actigraphy features become a larger component of the most impactful features, as seen in Fig. 6, although part of this can be attributed to the reduced dimensionality. Of the case report form features included, those relating to socioeconomic status appear to be the most impactful. When examining the actigraphy features only, as seen in Fig. 7, we find that features evaluating variability are still among the most impactful features, whether they are averages of the day-to-day variability features or variances of the daily features.

Fig. 6: SHapley Additive exPlanations (SHAP) analysis of logistic regression with all features for nulliparous patients.
figure 6

Features indicative of socioeconomic status are highlighted green, and other patient history variables are highlighted red.

Fig. 7: SHapley Additive exPlanations (SHAP) analysis of Gaussian Naïve Bayes (Gaussian NB) with actigraphy features only.
figure 7

Features aggregating average patient behavior are highlighted in gold, and features aggregating the standard deviation of patient behaviors are highlighted in blue.

Discussion

Overall, we find that actigraphy data compiled into simple measures of sleep can aid in the prediction of PTB, and that simpler ML architectures appear to perform better at this. For all ablations tested, we find that Gaussian Naïve Bayes (Gaussian NB) has the highest average AUROC. This is remarkable since it is architecturally simpler than other models, and suggests that the underlying features exhibit some independence from each other. This independence argument is furthered by the lower performance from our XGBoost models, as they learn decision trees where learned relationships may have dependencies. We do note that the small sample size and reduced dimensionality may enable this difference.

We also find that for the actigraphy-only models, there is a noticeable split in the explanability between aggregating variability and averages of actigraphy features. Among the highest performing features, we find that those capturing variability in sleep patterns—either at the day-to-day or whole-sample level—were the most explanatory features. Conversely, features examining a patient’s average behavior generally ranked lower, which suggests that consistent sleep patterns are more important than any specific sleep metric. This insight could inform the development of intervention strategies focused on sleep hygiene, emphasizing the importance of reducing variability in sleep patterns rather than targeting sleep duration or timing alone.

For the case report form features included, we find that some of the most explanatory features are past pregnancy complications, which is consistent with previous literature18,20. We also find that various socioeconomic features are predictive of PTB, as they may be proxy measures of maternal sleep. Race and ethnicity have been linked to increased sleep disturbances and poorer sleep quality in tandem with more frequent PTB outcomes21,22. Similarly, employment status and income have also been associated with differences in PTB outcomes23, as the effects of employment range from physical overexertion24 to direct conflicts with sleep25.

For nulliparous patients, we find that the overall performance of the actigraphy data is more comparable in performance to models trained on the case report form data only. When compared to whole-cohort models, the performance is similar for the actigraphy-based models, while the performance of the models trained on the case report form data drops noticeably. In addition to past PTB being a strong predictor of future PTB, this may suggest that monitoring sleep patterns is more necessary for nulliparous patients.

One limitation of this approach is that we do not evaluate categorical features as one-hot values, as the sample size would not be able to counterbalance the large number of features generated by one-hot categorical features. As a result, it is more difficult to interpret the impact of some categorical variables that do not actually have an ordinality to them (e.g., race, marital status). Similarly, we discard non-numerical features from the case report form features, as incorporating them with vision/language models would significantly increase the overall dimensionality; future models may incorporate these for improved performance.

Sample size, particularly with regards to the nulliparous pregnancies, is another limiting issue, as it makes noise more prominent when training and evaluating these models. To mitigate this issue, we employed multiple random shufflings of the data for training and evaluation. However, we note that this is limited given the notable discrepancy between the AUROC/AUPRC metric poolings and their corresponding confidence intervals, which may result from the wide performance differences between each shuffling and how they interact when averaging together. Sample size is a limitation not only in cohort size, but also the amount of actigraphy data, as the duration and frequency at which study participants wore their actigraphy watches was not consistent. Further studies should evaluate larger cohorts of patients to ensure accurate performance measurements, as well as cohorts from other locations to validate the performance with respect to different demographics. Moreover, longer and more consistent usage of actigraphy watches may also reveal more reliable patterns of motion behavior that predict PTB.

In addition, future work with actigraphy data could incorporate luminosity sensor data, as it may provide additional signals and corroborate signals captured by an actigraphy sensor. Another area of future work are with models trained with self-supervised learning (SSL), which learn relationships between input features before being fine-tuned for a downstream task. SSL models are particularly effective as these learned relationships between features generalize well in supervised tasks26.

In conclusion, our findings show that actigraphy data can help preterm birth (PTB) in both multiparous and nulliparous patients, with sleep variability emerging as a key predictive feature. These results highlight the potential of unobtrusive wearable measurements to enable early detection and intervention for PTB. Future work could explore larger or more diverse cohorts and develop targeted intervention strategies informed by these predictions to improve pregnancy outcomes.

Methods

Study characteristics

This study was completed as a part of of the March of Dimes Prematurity Research Center at Washington University in St. Louis/BJC HealthCare12, which was approved by the Washington University IRB (reference #201612070) in accordance FDA Good Clinical Practices and the Declaration of Helsinki. Written informed consent was obtained from participants for the usage of their clinical, biospecimen, imaging, and questionnaire data. Patients were recruited at the Washington University Medical Campus if they had a singleton pregnancy with an estimated GA under 20 weeks, planned to deliver at Barnes-Jewish Hospital, and were age 18 or older.

Trained obstetric research staff used a series of case report forms to collect baseline maternal demographics, medical history, antepartum data and obstetric outcomes as previously described in ref. 12. Patient data were collected at scheduled study visits during each trimester and at delivery, where biological samples, imaging, actigraphy, and responses to standardized surveys were obtained from each patient.

Survey data included questions from eleven different validated surveys and standalone questions covering stress, schedule, sleep quality, physical activity, postnatal depression, diet, demographics, and overall lifestyle. We derive the label of PTB from the reported estimated date of confinement (EDC), labeling births that occur 3 full weeks before the listed EDC as PTB. EDC was derived from the patient’s last menstrual period or first ultrasound27.

Actigraphy feature design

Actigraphy measurements were collected over a 2-week period in each trimester (first trimester: 0–13 weeks and 6 days, second trimester: 14–27 weeks and 6 days, third trimester: ≥28 weeks) using the CamNtech MotionWatch 8. Measurements were collected at a minute-frequency over the duration a study participant wore their actigraphy watch. Patients were reminded through calls, emails, and texts to return their actigraphy watches after the capture period either at the next study visit or through a courier12. Patients who did not have actigraphy data in either their first or second trimester were filtered from the results for this analysis.

These features are very high-resolution, and to ensure the data is tractable for shallow ML model training, we engineer these raw time-series signals into aggregate features over day-level windows. On top of the day-level measurements, we also measure the absolute change between days where data is present. Section 2 of the Supplementary Materials contains a summary of these engineered features.

To generate these features, all actigraphy data is separated into days centered around midnight, from which we then attempt to estimate the sleep cycle that occurred for each given day. A summary of these calculated features that were used in the dataset for the ML models can be found in Section 2 of the Supplementary Materials.

Model design

For each study participant, we aggregate the day-level actigraphy features down to their mean and standard deviation across the entire duration of the pregnancy. When evaluating the window of GAs below a full-term pregnancy, we drop all actigraphy data with a GA below a set range (e.g., if we set the upper limit at 140 days, all data before 140 days are dropped, and the remainder is aggregated).

For the survey data, we select features with both domain knowledge and automatic techniques. We first select a predefined set of features based on pre-determined clinical knowledge, and sum values of questions regarding individual births together. After these features, we select an additional 10 features with the minimal-redundancy-maximal-relevance algorithm with semantic textual similarity scores generated with PubMedBERT28 fine-tuned on several clinical and general datasets, as described in29. Features not represented numerically are dropped. The full list of features that we used can be found in Section 2 of the Supplementary Materials.

After this, we concatenate both sources of data, scale all numerical features to its normal distribution, and encode all categorical features as ordinal values. Missing values are imputed with either the mean, median, most common value, or the mean of the 5 nearest neighbors, which is learned during cross-validation (CV). The data is randomly split across a 80%/20% train/test split. For the whole cohort, 532 and 133 patients appear in in each split, with each split having 66 and 28 PTB patients, respectively. For the nulliparous cohort, this becomes 238 in the train set and 59 in the test set, with each of those splits having 24 patients and 9 patients respectively.

We train our models with several standard ML models, including logistic regression, linear support vector machine (SVM), kernelized/non-linear SVM30, XGBoost31, and Gaussian NB32. Logistic regression predicts the output class using the sigmoid of the linear combination of the input weights. Linear SVM predicts the class using a linearly-separated hyperplane, and kernelized SVM uses a kernel function to learn a non-linear separation of each class30. XGBoost is a gradient boosting method that builds an ensemble of decision trees to optimize predictive performance31, and Gaussian NB models output class conditioned on normal distributions of each feature32. We evaluate the results across 10 random initializations for each model in Section “Results”, and report the average AUROC and AUPRC through pooling33, as well as the 95% confidence interval over all initializations. SHAP values are averaged across all random initializations. A graphical summary of this training pipeline can be seen in Section 2 of the Supplementary Materials.

To find the best hyperparameters for each of the tested models, we use 5-fold stratified CV, which preserves the class proportionality across each fold, using the training set. For XGBoost, the hyperparameter space ranges from 1 to 3 estimators, 1 to 3 maximum depth, a learning rate of 0.1, and a fitting objective of AUROC. For linear SVM, we test regularization parameters ranging logarithmically from 0.001 to 10, with 1000 iterations of training. For non-linear SVM, we evaluate polynomial and radial basis function kernels on top of the linear SVM parameters. For logistic regression, we evaluate regularization parameters from 0.001 to 10 with a L2 penalty, and 1000 maximum iterations of training. For Gaussian NB, we use 10−9 as a fixed smoothing parameter.