Validation of sleep-based actigraphy machine learning models for prediction of preterm birth

Warner, Benjamin C.; Zhao, Peinan; Herzog, Erik D.; Frolova, Antonina I.; England, Sarah K.; Lu, Chenyang

doi:10.1038/s44294-025-00082-y

Download PDF

Article
Open access
Published: 20 June 2025

Validation of sleep-based actigraphy machine learning models for prediction of preterm birth

Benjamin C. Warner^1,2,
Peinan Zhao³,
Erik D. Herzog⁴,
Antonina I. Frolova³,
Sarah K. England^1,3 &
…
Chenyang Lu^1,2

npj Women's Health volume 3, Article number: 40 (2025) Cite this article

1592 Accesses
20 Altmetric
Metrics details

Subjects

Abstract

Disruptive sleep is a well-established predictor of preterm birth. However, the exact relationship between sleep behavior and preterm birth outcomes remains unknown, in part because prior work has relied on self-reported sleep data. With the advent of smartwatches, it is possible to obtain more reliable and accurate sleep data, which can be utilized to evaluate the impact of specific sleep behaviors in concert with machine learning. We evaluate motion actigraphy data collected from a cohort of participants undergoing pregnancy, and train several machine learning models based on aggregate features engineered from this data. We then evaluate the relative impact from each of these actigraphy features, as well as features derived from questionnaires collected from participants. Our findings suggest that actigraphy data can predict preterm birth outcomes with a degree of effectiveness, and that variability in sleep patterns is a relatively fair predictor of preterm birth.

Deep representation learning identifies associations between physical activity and sleep patterns during pregnancy and prematurity

Article Open access 28 September 2023

Maternal stress during pregnancy alters fetal cortico-cerebellar connectivity in utero and increases child sleep problems after birth

Article Open access 26 January 2021

Preterm toddlers have low nighttime sleep quality and high daytime activity

Article Open access 21 November 2022

Introduction

Preterm birth (PTB), which is generally defined as delivery before 37 weeks of gestation, is the single largest cause of death in children under the age of 5¹ with ~1 million deaths occuring per year². While some etiologies of PTB have been identified, many remain unknown. Previous literature has shown that disruptive maternal sleep patterns have been associated with PTB outcomes^3,4,5,6.

One major limitation with previous studies is the reliance on self-reported sleep patterns, which is limited by a patient’s ability to recall their sleep patterns accurately and consistently⁷. Wearable devices can alleviate this problem as they provide a more reliable and detailed stream of data^8,9. Previous literature has found that wearable sensor data can be used to make predictions regarding both physical and mental health issues, ranging from pancreatic complications¹⁰ to depression¹¹.

Using data collected from wearables, we evaluate predictions of binary PTB outcomes with patients from a cohort study conducted at Washington University in St. Louis/BJC HealthCare¹². Participants from this cohort study were given actigraphy watches to wear for 2 weeks over the course of each trimester, capturing high-resolution sleep data. The collected actigraphy data are then transformed into interpretable quantitative features and used as input for several shallow machine learning (ML) models. These models are then evaluated to assess the relative impact of these features, offering several clinical insights into the relative importance of individual sleep and non-sleep behaviors, as well as insights for more complex ML models.

Previous work with this dataset has attempted to evaluate regression models of unengineered time-series data to predict the entire spectrum of gestational age (GA) directly from individual actigraphy samples¹³, which is intrinsically different in both objective and approach from predicting binary-outcome PTB from statistics across a pregnancy. The authors noted that measured mean absolute error between actual and predicted GA was higher overall in PTB patients, but did not evaluate any classifier performance with respect to binary-outcome PTB. Moreover, the models presented in¹³ are limited in their explanability as a result of both learning non-linear representations and at attempting to predict GA at a sample level. In addition, previous work has also examined direct correlations between engineered actigraphy features and PTB, evaluating the risk associated with each individual feature^5,6.

This paper evaluates the performance of binary-outcome classification of PTB from engineered actigraphy features and selected patient history features. The models presented here are computationally simpler and interpretable, which offer engineering and clinical insights about potential approaches for more complicated models. Overall, we validate the usage of sleep measures derived from actigraphy data in ML models for the prediction of binary-outcome PTB. From these models, relative comparisons of the impact of actigraphy and patient history features on predictions are examined. We finally offer interpretations of each of the tested models, and guidance for future works.

Results

Among the 1523 patients who participated in the cohort study, we analyze the 665 patients who had actigraphy data in at least the first or second trimester of their pregnancy and had a recorded delivery date. The average patient had 39.1 (±32.2) day-level samples throughout the duration of their pregnancy, with the first trimester having 15.7 (±10.4) samples on average, the second trimester having an average of 24.0 (±18.9) samples, and the third trimester having an average of 17.0 (±10.3) samples. The overall distribution of samples collected from all patients can be seen in Fig. 1. Of these patients, the mean age was 29.2 (±5.29) years, and the majority (55.34%) of the patients were multiparous. A minority of patients (14.18%) experienced a PTB outcome. Full details about the demographics of the patients used in this dataset can be found in Section 1 in the Supplementary Materials, and details about the actigraphy features and numerical case report form features can be found in Table 1.

**Fig. 1: Histogram of the collected actigraphy samples.**

Table 1 Actigraphy and numerical case report form features, stratified by trimester where applicable

Full size table

We compare the performance of models trained on the two primary sources of data, the engineered actigraphy features and case report form responses collected at each visit, in Table 2 and Fig. 2. Performance curves of the models trained only on the actigraphy or case report form data can be found in Section 3 of the Supplementary Material. Confusion matrices comparing the best model by area under the receiver-operator curve (AUROC) are provided in Table 3, and Tukey’s honest significant difference test (HSD) results comparing each are provided in Tables 4, 5, and 6.

Table 2 Comparison of models for all patients and nulliparous patients

Full size table

**Fig. 2: Reciever-operator and precision-recall curves for models using all features.**

Table 3 Confusion matrices for classifiers with the best area under the receiver-operator curve (AUROC) among all patients, with the threshold set to match a 50% true positive rate

Full size table

Table 4 Confusion matrices for classifiers with the best area under the receiver-operator curve (AUROC) among nulliparous patients, with the threshold set to match a 50% true positive rate

Full size table

Table 5 Tukey’s honest significant difference test for area under the receiver-operator curve (AUROC) for across all models staratifed by feature set

Full size table

Table 6 Tukey’s honest significant difference test for area under the precision-recall curve (AUPRC) across all models stratified by feature set

Full size table

We find that, using actigraphy features and case report form survey data, it is possible to make reasonable predictions about binary-outcome PTB. As seen in Table 2, actigraphy features appear to underperform features from case report forms at predicting PTB when comparing the best models for each configuration. The combined performance is better than either source of data individually.

Gestational age and model performance

Figure 3 shows the performance of each model as samples up to a specified GA are included. As seen, the performance of the models does not change consistently as the GA upper-bound is increased, although it does increase noticeably in performance as the full GA spectrum is enabled.

**Fig. 3: Reciever-operator and precision-recall curves for data up to a given gestational age (GA).**

This lack of consistent performance change likely occurs for several reasons. First, the distribution of study participants who have data up to a given GA is variable, and for those that do have data up to a specified GA, the duration and lengths are also variable. In addition, the aggregation used for all actigraphy features, mean and standard deviation, does not change linearly as the amount of data increases. This variability in AUROC and area under the precision-recall curve (AUPRC) appears to weakly correspond to the sample trends seen in Fig. 1, which is roughly centered around the boundaries in each trimester.

Feature explanations

To assess the importance of each feature in each model, we evaluate the features with SHapley Additive exPlanations (SHAP) scores¹⁴, which provide relative estimates of how the output of a model will change as the input features change. Figure 4 shows the feature explanations for the best performing model with all features.

**Fig. 4: SHapley Additive exPlanations (SHAP) analysis of Gaussian NB with all features.**

When all features are used, we find that the features that affect the output of the model the most are related to the number of complications that occurred during previous births. This is consistent with the literature, which finds that past PTB is a strong predictor of future PTB outcomes^15,16. Features relating to socioeconomic status, highlighted in green, also rank highly, which is consistent with prior literature as race, ethnicity, and employment status are associated with preterm birth^17,18.

Actigraphy features were impactful to a lesser degree, with the highest ranked feature being the average day-to-day variability between sleep start. Other similarly ranked features following this included sleep start time, the variance of the start of the sleep cycle, and day-to-day variability in the duration of the sleep cycle, etc. Overall, actigraphy features relating to variability in sleep patterns appeared to rank higher than those derived from averages across a patient’s pregnancy. Prior studies have shown that shift work is associated with higher incidence of PTB¹⁹. The variability captured by our actigraphy features may help explain this association, as shift work often disrupts regular sleep patterns and leads to increased day-to-day variability.

When we evaluate the best performing actigraphy-only model, shown in Fig. 5, we find a similar ordering of relevant features, with features reflecting variance between daily actigraphy measurements appearing towards the top. Some of this difference in ordering can be attributed to the issue of dimensionality, as the number of examples is smaller, although the limited sample size prevents us from making any conclusive orderings.

**Fig. 5: SHapley Additive exPlanations (SHAP) analysis of Gaussian Naïve Bayes (Gaussian NB) with actigraphy features only.**

First-time/nulliparous pregnancies

Prior PTB complications are a strong predictor of future PTB complications, but such foresight does not exist in the case of nulliparious pregnancies. To evaluate these pregnancies, we train separate models on nulliparous patients. For training, we replace case report form features relating to delivery history with empty values. Results from training on nulliparous patients only are reported in Table 2.

As seen in Table 2, we find that the performance of actigraphy features is distinctive when we use Gaussian Naïve Bayes as the classifier. For all remaining model types, the performance is comparable both in contrast or together with case report form data, differing by relatively small amount for both AUROC and specificity at 90% sensitivity. This indicates that actigraphy data may provide performance comparable to or better than what can be assessed in a clinical survey specifically with regards to nulliparous patients.

In addition, the actigraphy features become a larger component of the most impactful features, as seen in Fig. 6, although part of this can be attributed to the reduced dimensionality. Of the case report form features included, those relating to socioeconomic status appear to be the most impactful. When examining the actigraphy features only, as seen in Fig. 7, we find that features evaluating variability are still among the most impactful features, whether they are averages of the day-to-day variability features or variances of the daily features.

**Fig. 6: SHapley Additive exPlanations (SHAP) analysis of logistic regression with all features for nulliparous patients.**

**Fig. 7: SHapley Additive exPlanations (SHAP) analysis of Gaussian Naïve Bayes (Gaussian NB) with actigraphy features only.**

Discussion

Overall, we find that actigraphy data compiled into simple measures of sleep can aid in the prediction of PTB, and that simpler ML architectures appear to perform better at this. For all ablations tested, we find that Gaussian Naïve Bayes (Gaussian NB) has the highest average AUROC. This is remarkable since it is architecturally simpler than other models, and suggests that the underlying features exhibit some independence from each other. This independence argument is furthered by the lower performance from our XGBoost models, as they learn decision trees where learned relationships may have dependencies. We do note that the small sample size and reduced dimensionality may enable this difference.

We also find that for the actigraphy-only models, there is a noticeable split in the explanability between aggregating variability and averages of actigraphy features. Among the highest performing features, we find that those capturing variability in sleep patterns—either at the day-to-day or whole-sample level—were the most explanatory features. Conversely, features examining a patient’s average behavior generally ranked lower, which suggests that consistent sleep patterns are more important than any specific sleep metric. This insight could inform the development of intervention strategies focused on sleep hygiene, emphasizing the importance of reducing variability in sleep patterns rather than targeting sleep duration or timing alone.

For the case report form features included, we find that some of the most explanatory features are past pregnancy complications, which is consistent with previous literature^18,20. We also find that various socioeconomic features are predictive of PTB, as they may be proxy measures of maternal sleep. Race and ethnicity have been linked to increased sleep disturbances and poorer sleep quality in tandem with more frequent PTB outcomes^21,22. Similarly, employment status and income have also been associated with differences in PTB outcomes²³, as the effects of employment range from physical overexertion²⁴ to direct conflicts with sleep²⁵.

For nulliparous patients, we find that the overall performance of the actigraphy data is more comparable in performance to models trained on the case report form data only. When compared to whole-cohort models, the performance is similar for the actigraphy-based models, while the performance of the models trained on the case report form data drops noticeably. In addition to past PTB being a strong predictor of future PTB, this may suggest that monitoring sleep patterns is more necessary for nulliparous patients.

One limitation of this approach is that we do not evaluate categorical features as one-hot values, as the sample size would not be able to counterbalance the large number of features generated by one-hot categorical features. As a result, it is more difficult to interpret the impact of some categorical variables that do not actually have an ordinality to them (e.g., race, marital status). Similarly, we discard non-numerical features from the case report form features, as incorporating them with vision/language models would significantly increase the overall dimensionality; future models may incorporate these for improved performance.

Sample size, particularly with regards to the nulliparous pregnancies, is another limiting issue, as it makes noise more prominent when training and evaluating these models. To mitigate this issue, we employed multiple random shufflings of the data for training and evaluation. However, we note that this is limited given the notable discrepancy between the AUROC/AUPRC metric poolings and their corresponding confidence intervals, which may result from the wide performance differences between each shuffling and how they interact when averaging together. Sample size is a limitation not only in cohort size, but also the amount of actigraphy data, as the duration and frequency at which study participants wore their actigraphy watches was not consistent. Further studies should evaluate larger cohorts of patients to ensure accurate performance measurements, as well as cohorts from other locations to validate the performance with respect to different demographics. Moreover, longer and more consistent usage of actigraphy watches may also reveal more reliable patterns of motion behavior that predict PTB.

In addition, future work with actigraphy data could incorporate luminosity sensor data, as it may provide additional signals and corroborate signals captured by an actigraphy sensor. Another area of future work are with models trained with self-supervised learning (SSL), which learn relationships between input features before being fine-tuned for a downstream task. SSL models are particularly effective as these learned relationships between features generalize well in supervised tasks²⁶.

In conclusion, our findings show that actigraphy data can help preterm birth (PTB) in both multiparous and nulliparous patients, with sleep variability emerging as a key predictive feature. These results highlight the potential of unobtrusive wearable measurements to enable early detection and intervention for PTB. Future work could explore larger or more diverse cohorts and develop targeted intervention strategies informed by these predictions to improve pregnancy outcomes.

Methods

Study characteristics

This study was completed as a part of of the March of Dimes Prematurity Research Center at Washington University in St. Louis/BJC HealthCare¹², which was approved by the Washington University IRB (reference #201612070) in accordance FDA Good Clinical Practices and the Declaration of Helsinki. Written informed consent was obtained from participants for the usage of their clinical, biospecimen, imaging, and questionnaire data. Patients were recruited at the Washington University Medical Campus if they had a singleton pregnancy with an estimated GA under 20 weeks, planned to deliver at Barnes-Jewish Hospital, and were age 18 or older.

Trained obstetric research staff used a series of case report forms to collect baseline maternal demographics, medical history, antepartum data and obstetric outcomes as previously described in ref. ¹². Patient data were collected at scheduled study visits during each trimester and at delivery, where biological samples, imaging, actigraphy, and responses to standardized surveys were obtained from each patient.

Survey data included questions from eleven different validated surveys and standalone questions covering stress, schedule, sleep quality, physical activity, postnatal depression, diet, demographics, and overall lifestyle. We derive the label of PTB from the reported estimated date of confinement (EDC), labeling births that occur 3 full weeks before the listed EDC as PTB. EDC was derived from the patient’s last menstrual period or first ultrasound²⁷.

Actigraphy feature design

Actigraphy measurements were collected over a 2-week period in each trimester (first trimester: 0–13 weeks and 6 days, second trimester: 14–27 weeks and 6 days, third trimester: ≥28 weeks) using the CamNtech MotionWatch 8. Measurements were collected at a minute-frequency over the duration a study participant wore their actigraphy watch. Patients were reminded through calls, emails, and texts to return their actigraphy watches after the capture period either at the next study visit or through a courier¹². Patients who did not have actigraphy data in either their first or second trimester were filtered from the results for this analysis.

These features are very high-resolution, and to ensure the data is tractable for shallow ML model training, we engineer these raw time-series signals into aggregate features over day-level windows. On top of the day-level measurements, we also measure the absolute change between days where data is present. Section 2 of the Supplementary Materials contains a summary of these engineered features.

To generate these features, all actigraphy data is separated into days centered around midnight, from which we then attempt to estimate the sleep cycle that occurred for each given day. A summary of these calculated features that were used in the dataset for the ML models can be found in Section 2 of the Supplementary Materials.

Model design

For each study participant, we aggregate the day-level actigraphy features down to their mean and standard deviation across the entire duration of the pregnancy. When evaluating the window of GAs below a full-term pregnancy, we drop all actigraphy data with a GA below a set range (e.g., if we set the upper limit at 140 days, all data before 140 days are dropped, and the remainder is aggregated).

For the survey data, we select features with both domain knowledge and automatic techniques. We first select a predefined set of features based on pre-determined clinical knowledge, and sum values of questions regarding individual births together. After these features, we select an additional 10 features with the minimal-redundancy-maximal-relevance algorithm with semantic textual similarity scores generated with PubMedBERT²⁸ fine-tuned on several clinical and general datasets, as described in²⁹. Features not represented numerically are dropped. The full list of features that we used can be found in Section 2 of the Supplementary Materials.

After this, we concatenate both sources of data, scale all numerical features to its normal distribution, and encode all categorical features as ordinal values. Missing values are imputed with either the mean, median, most common value, or the mean of the 5 nearest neighbors, which is learned during cross-validation (CV). The data is randomly split across a 80%/20% train/test split. For the whole cohort, 532 and 133 patients appear in in each split, with each split having 66 and 28 PTB patients, respectively. For the nulliparous cohort, this becomes 238 in the train set and 59 in the test set, with each of those splits having 24 patients and 9 patients respectively.

We train our models with several standard ML models, including logistic regression, linear support vector machine (SVM), kernelized/non-linear SVM³⁰, XGBoost³¹, and Gaussian NB³². Logistic regression predicts the output class using the sigmoid of the linear combination of the input weights. Linear SVM predicts the class using a linearly-separated hyperplane, and kernelized SVM uses a kernel function to learn a non-linear separation of each class³⁰. XGBoost is a gradient boosting method that builds an ensemble of decision trees to optimize predictive performance³¹, and Gaussian NB models output class conditioned on normal distributions of each feature³². We evaluate the results across 10 random initializations for each model in Section “Results”, and report the average AUROC and AUPRC through pooling³³, as well as the 95% confidence interval over all initializations. SHAP values are averaged across all random initializations. A graphical summary of this training pipeline can be seen in Section 2 of the Supplementary Materials.

To find the best hyperparameters for each of the tested models, we use 5-fold stratified CV, which preserves the class proportionality across each fold, using the training set. For XGBoost, the hyperparameter space ranges from 1 to 3 estimators, 1 to 3 maximum depth, a learning rate of 0.1, and a fitting objective of AUROC. For linear SVM, we test regularization parameters ranging logarithmically from 0.001 to 10, with 1000 iterations of training. For non-linear SVM, we evaluate polynomial and radial basis function kernels on top of the linear SVM parameters. For logistic regression, we evaluate regularization parameters from 0.001 to 10 with a L₂ penalty, and 1000 maximum iterations of training. For Gaussian NB, we use 10⁻⁹ as a fixed smoothing parameter.

Data availability

The data used in these findings can be obtained from the authors by request with permission from Washington University in St. Louis.

Code availability

We make the code used in this study available at https://github.com/bcwarner/mod-actigraphy-clf.

References

Cao, G., Liu, J. & Liu, M. Global, regional, and national incidence and mortality of neonatal preterm birth, 1990-2019. JAMA Pediatr. 176, 787–796 (2022).
Article PubMed PubMed Central Google Scholar
Chawla, D. & Agarwal, R. Preterm births and deaths: from counting to classification. Lancet Glob. Health 10, e1537–e1538 (2022).
Article CAS PubMed Google Scholar
Sutcliffe, S. et al. Risk of pre-term birth as a function of sleep quality and obesity: prospective analysis in a large Prematurity Research Cohort. Sleep. Adv. 4, zpad043 (2023).
Article PubMed PubMed Central Google Scholar
Wang, L. & Jin, F. Association between maternal sleep duration and quality, and the risk of preterm birth: a systematic review and meta-analysis of observational studies. BMC Pregnancy Childbirth 20, 125 (2020).
Article PubMed PubMed Central Google Scholar
Hoyniak, C. P. et al. The association between maternal sleep and circadian rhythms during pregnancy and infant sleep and socioemotional outcomes. https://www.researchsquare.com/article/rs-3937599/v1 (2024).
Hoyniak, C. P. et al. Sleep and circadian rhythms during pregnancy, social disadvantage, and alterations in brain development in neonates. Dev. Sci. 27, e13456 (2024).
Article PubMed Google Scholar
Li, R. et al. Sleep disturbances during pregnancy are associated with cesarean delivery and preterm birth. J. Matern. Fetal Neonatal Med. 30, 733–738 (2017).
Article PubMed Google Scholar
Cespedes, E. M. et al. Comparison of Self-Reported Sleep Duration With Actigraphy: Results From the Hispanic Community Health Study/Study of Latinos Sueño Ancillary Study. Am. J. Epidemiol. 183, 561–573 (2016).
Article PubMed PubMed Central Google Scholar
Nauha, L. et al. Comparison and agreement between device-estimated and self-reported sleep periods in adults. Ann. Med. 55, 2191001 (2023).
Article PubMed PubMed Central Google Scholar
Cos, H. et al. Predicting outcomes in patients undergoing pancreatectomy using wearable technology and machine learning: prospective cohort study. J. Med. Internet Res. 23, e23595 (2021).
Article PubMed PubMed Central Google Scholar
Dai, R. et al. Multi-task learning for randomized controlled trials: a case study on predicting depression with wearable data. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 6, 1–23 (2022).
Article Google Scholar
Stout, M. J. et al. A multidisciplinary prematurity research cohort study. PLoS ONE 17, e0272155 (2022).
Article CAS PubMed PubMed Central Google Scholar
Ravindra, N. G. et al. Deep representation learning identifies associations between physical activity and sleep patterns during pregnancy and prematurity. npj Digit. Med. 6, 171 (2023).
Article PubMed PubMed Central Google Scholar
Lundberg, S. M. & Lee, S.-I. A unified approach to interpreting model predictions. Advances in neural information processing systems 30, https://papers.nips.cc/paper_files/paper/2017/hash/8a20a8621978632d76c43dfd28b67767-Abstract.html (2017).
Varner, M. W. & Esplin, M. S. Current understanding of genetic factors in preterm birth. BJOG 112, 28–31 (2005).
Article CAS PubMed Google Scholar
Hsieh, T.-T. et al. The impact of interpregnancy interval and previous preterm birth on the subsequent risk of preterm birth. J. Soc. Gynecol. Investig. 12, 202–207 (2005).
Article PubMed Google Scholar
Culhane, J. F. & Goldenberg, R. L. Racial disparities in preterm birth. Semin. Perinatol. 35, 234–239 (2011).
Article PubMed Google Scholar
Goldenberg, R. L., Culhane, J. F., Iams, J. D. & Romero, R. Epidemiology and causes of preterm birth. Lancet 371, 75–84 (2008).
Adane, H. A., Iles, R., Boyle, J. A., Gelaw, A. & Collie, A. Maternal occupational risk factors and preterm birth: a systematic review and meta-analysis. Public Health Rev. 44, 1606085 (2023).
Article PubMed PubMed Central Google Scholar
Kvalvik, L. G., Wilcox, A. J., Skjærven, R., Østbye, T. & Harmon, Q. E. Term complications and subsequent risk of preterm birth: registry based study. BMJ m1007. https://www.bmj.com/lookup/doi/10.1136/bmj.m1007 (2020).
Christian, L. M. et al. Pathways to maternal health inequities: Structural racism, sleep, and physiological stress. Brain Behav. Immun. 123, 502–509 (2025).
Article PubMed Google Scholar
Lucchini, M. et al. Racial/ethnic disparities in subjective sleep duration, sleep quality, and sleep disturbances during pregnancy: an ECHO study. Sleep 45, zsac075 (2022).
Article PubMed PubMed Central Google Scholar
Huang, L. et al. Association between sleep during pregnancy and birth outcomes: a prospective cohort study. Reprod. Biol. Endocrinol. 23, 18 (2025).
Article PubMed PubMed Central Google Scholar
Saurel-Cubizolles, M. J. Employment, working conditions, and preterm birth: results from the Europop case-control survey. J. Epidemiol. Community Health 58, 395–401 (2004).
Article CAS PubMed PubMed Central Google Scholar
Wallace, D. A. et al. Associations between evening shift work, irregular sleep timing, and gestational diabetes in the Nulliparous Pregnancy Outcomes Study: monitoring mothers-to-be (nuMoM2b). Sleep 46, zsac297 (2023).
Article PubMed Google Scholar
Erhan, D., Courville, A., Bengio, Y. & Vincent, P. Why does unsupervised pre-training help deep learning? Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics 201–208. ISSN: 1938-7228. https://proceedings.mlr.press/v9/erhan10a.html (2010).
American College of Obstetrics and Gynecology Committee on Practice Bulletins-Obstetrics. ACOG Practice Bulletin Number 49, December 2003: Dystocia and augmentation of labor. Obstet. Gynecol. 102, 1445–1454 (2003).
Gu, Y. et al. Domain-specific language model pretraining for biomedical natural language processing. ACM Trans. Comput. Healthc. 3, 1–23 (2021).
Article CAS Google Scholar
Warner, B. C., Xu, Z., Haroutounian, S., Kannampallil, T. & Lu, C. Utilizing Semantic Textual Similarity for Clinical Survey Data Feature Selection ArXiv:2308.09892 [cs]. Find. Assoc. Comput. Linguist.: ACL 2025. https://2025.aclweb.org/program/find_papers/ (2025).
Cortes, C. & Vapnik, V. Support-vector networks. Mach. Learn. 20, 273–297 (1995).
Article Google Scholar
Chen, T. & Guestrin, C. XGBoost: a scalable tree boosting system. In Proc. 22nd ACM SIGKDD International Conference on Knowledge 785–794. https://dl.acm.org/doi/10.1145/2939672.2939785 (2016).
Chan, T. F., Golub, G. H. & LeVeque, R. J. Updating formulae and a pairwise algorithm for computing sample variances. In COMPSTAT 1982 5th Symposium Held Toulouse 1982 (eds Caussinus, H., Ettinger, P. & Tomassone, R.) 30–41. http://link.springer.com/10.1007/978-3-642-51461-6_3 (Physica-Verlag HD, 1982).
Jack, H. & Niall, A. On averaging ROC curves. Trans. Mach. Learn. Res. https://openreview.net/forum?id=FByH3qL87G (2023).
Benjamini, Y. & Yekutieli, D. The control of the false discovery rate in multiple testing under dependency. Ann. Stat. 29, https://projecteuclid.org/journals/annals-of-statistics/volume-29/issue-4/The-control-of-the-false-discovery-rate-in-multiple-testing/10.1214/aos/1013699998.full (2001).

Download references

Acknowledgements

We thank the participants in the cohort study for their key contributions in advancing preterm birth science, as well as the staff in the Division of Clinical Research who facilitated the enrollment of participants and collection of data that made this study possible. This work was supported financially by a research grant from the March of Dimes Foundation and by the Fullgraf Foundation. This work was supported institutionally by St. Louis Children's Hospital, Barnes-Jewish Hospital, the Department of Obstetrics and Gynecology at Washington University School of Medicine in St. Louis, the Washington University AI for Health Institute, and the Washington University in St. Louis James McKelvey School of Engineering.

Author information

Authors and Affiliations

AI for Health Institute, Washington University in St. Louis, St. Louis, MO, USA
Benjamin C. Warner, Sarah K. England & Chenyang Lu
Department of Computer Science & Engineering, Washington University in St. Louis, MO, St. Louis, USA
Benjamin C. Warner & Chenyang Lu
Center for Reproductive Health Sciences, Department of Obstetrics & Gynecology, Washington University School of Medicine in St. Louis, St. Louis, MO, USA
Peinan Zhao, Antonina I. Frolova & Sarah K. England
Department of Biology, Washington University in St. Louis, St. Louis, MO, USA
Erik D. Herzog

Authors

Benjamin C. Warner
View author publications
Search author on:PubMed Google Scholar
Peinan Zhao
View author publications
Search author on:PubMed Google Scholar
Erik D. Herzog
View author publications
Search author on:PubMed Google Scholar
Antonina I. Frolova
View author publications
Search author on:PubMed Google Scholar
Sarah K. England
View author publications
Search author on:PubMed Google Scholar
Chenyang Lu
View author publications
Search author on:PubMed Google Scholar

Contributions

B.C.W. wrote the main manuscript text and contributed the code for the training, evaluation and interpretation of the models. P.Z. contributed the code for the features. E.D.H., S.K.E., A.I.F., and C.L. provided subject matter expertise. All authors were responsible for the interpretation of the models.

Corresponding authors

Correspondence to Benjamin C. Warner, Peinan Zhao, Erik D. Herzog, Antonina I. Frolova, Sarah K. England or Chenyang Lu.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Warner, B.C., Zhao, P., Herzog, E.D. et al. Validation of sleep-based actigraphy machine learning models for prediction of preterm birth. npj Womens Health 3, 40 (2025). https://doi.org/10.1038/s44294-025-00082-y

Download citation

Received: 22 August 2024
Accepted: 26 May 2025
Published: 20 June 2025
DOI: https://doi.org/10.1038/s44294-025-00082-y