Generating cervical anatomy labels using a deep ensemble multi-class segmentation model applied to transvaginal ultrasound images

Dagle, Alicia B.; Liu, Yucheng; Skeel, Madeline; Trigo, Gabriel G.; Crosby, David; Feltovich, Helen; House, Michael; Yan, Qi; Myers, Kristin M.; Jambawalikar, Sachin

doi:10.1038/s44294-025-00075-x

Download PDF

Article
Open access
Published: 10 May 2025

Generating cervical anatomy labels using a deep ensemble multi-class segmentation model applied to transvaginal ultrasound images

Alicia B. Dagle¹,
Yucheng Liu²,
Madeline Skeel³,
Gabriel G. Trigo³,
David Crosby⁴,
Helen Feltovich⁵,
Michael House⁶,
Qi Yan⁷,
Kristin M. Myers¹ &
…
Sachin Jambawalikar⁸

npj Women's Health volume 3, Article number: 28 (2025) Cite this article

3654 Accesses
1 Citations
Metrics details

Subjects

Abstract

Preterm birth (PTB) is the leading cause of perinatal death, affecting 10% of pregnancies. Currently, transvaginal ultrasound (TVUS) measurement of cervical length (CL) is the sole quantitative imaging metric for PTB risk, but offers limited predictive value. While computational models of cervical biomechanics show promise as PTB risk predictors, they require precise clinician-provided measurements. AI-enabled ultrasound segmentation offers a solution by automatically extracting anatomical features, thus addressing the labeling bottleneck. This study utilizes an ensemble of deep learning-based multi-class segmentation models trained on diverse TVUS data (N = 246) and evaluated on an out-of-distribution dataset (N = 29). High agreement (Dice metric ~ 0.8) between expert and model labels demonstrates the utility of AI tools in accurately measuring cervical geometry. Ultimately, this can enhance biomechanical models and more sophisticated AI-based models to better predict birth timing, specifically targeting PTB risk.

Multi-parametric quantitative evaluation of murine cervical remodeling during pregnancy and postpartum

Article Open access 10 May 2025

Prediction of preterm birth from cervical length measurements in twin pregnancies using machine learning

Article Open access 14 October 2025

Automated cervix biometry, volumetry and normative models for 3D motion-corrected T2-weighted 0.55-3T fetal MRI during 2nd and 3rd trimesters

Article Open access 01 December 2025

Introduction

Preterm birth (PTB), defined as delivery before 37 weeks of gestation, is the leading cause of perinatal death^1,2 and a major contributor to long-term disabilities³, where earlier gestational age at birth corresponds to longer hospital stays, increased risk of long-term sequelae, and increased medical costs⁴. With persistently high global rates of PTB and 15 million premature births yearly, PTB remains a major public health problem with high emotional and financial burden^2,5. Despite significant advances in prenatal and perinatal care, 80% of preterm birth cases are considered spontaneous PTB (sPTB). Unlike induced, medically-indicated delivery to address maternal or fetal complications, sPTB is defined by premature labor, dilation (cervical insufficiency) or rupture of fetal membranes leading to preterm delivery. A major challenge to the treatment and prevention of sPTB is the lack of accurate diagnostic methods.

The cervix is a complex, 3D structure⁶ that, in normal pregnancy, maintains the growing fetus in utero and safely remodels to allow for delivery at term. This process, though certainly driven by molecular processes, is fundamentally biomechanical. Premature cervical shortening, a common feature of sPTB⁷ that is captured by TVUS-CL assessment (Fig. 1), can be thought of as structural biomechanical “failure” of the tissue. Biomechanical models can further explain 3D tissue behaviors by determining how overall shape, volume, intrinsic material properties, and alignment between the cervix/uterus against the load of the growing fetus affect structural biomechanical performance^8,9,10,11. These 3D biomechanical models, sometimes called digital twins, are powered by finite element analysis. Finite element analysis is a numerical computation method that discretizes 3D geometry and solves the equations of equilibrium to investigate tissue responses to mechanical forces given set boundary conditions and tissue material properties¹². A large amount of robust, clinical data describing the shape and size of maternal anatomy is needed to leverage these digital twins and quantify cervical structural biomechanical performance in pregnancy.

**Fig. 1: TVUS-measured CL for sPTB screening.**

While our group has put forth digital twins of pregnancy^9,11, these are inherently low-throughput and restricted by scarce labeled data. Because pregnancy is a highly protected condition, public datasets of images, let alone images with labeled maternal anatomy, are essentially nonexistent. This is dissimilar to parallel fields such as cardiac biomechanics, which have several highly annotated open source datasets in multiple image modalities and host collaborative data challenges to propel the field forward (https://www.cardiacatlas.org/). Instead, pregnancy biomechanics research currently relies on the generosity of clinicians who volunteer to provide time-consuming labels of ultrasounds, which are required to build these insightful 3D biomechanical models. If an artificial intelligence (AI) platform can provide fast, reliable measurements of maternal anatomy, this could revolutionize research, enabling more detailed finite element models of pregnancy and faster turnover of simulation results.

Labeling images with the level of detail necessary to extract relevant ultrasound biomarkers (measurements derived from US) is time-consuming, labor-intensive, and subject to inter-observer variation¹³. To combat this, convolutional neural networks (CNNs) are increasingly applied to image segmentation¹⁴. In pregnancy, CNNs have been used to segment the placenta and fetal biometry^15,16,17,18, which is important for understanding fetal health, but the application of CNNs to the cervix remains somewhat limited. Wlodarczyk et al. used a single-class UNet to segment curves approximating cervical shape, from which CL and AUCA measurements were then extracted. This group further utilized traditional single-class UNet, and DeepLabV3 for cervical segmentations¹⁹ coupled with original TVUS images in PTB prediction models²⁰. This work demonstrates the capability of a machine to learn single features, such as cervical length, and reproduce the clinical workflow linking cervical length to sPTB prediction. However, this research did not delve into multi-class networks and other more-complex architectures, and the segmentations provided were only designed to capture the part of the cervix along the cervical canal, not the outer boundaries of the cervical tissue which are important for cervical shape classification and 3D model development.

Our group has previously employed a deep learning framework to segment the entirety of anterior and posterior cervical tissue differentially using multi-class segmentation²¹. This work laid a foundation for a more complex, multi-class segmentation scheme, which can be used to characterize cervical shape in more detail. High label reproducibility (reported as Dice metric, where 1 is the optimal value) indicates that this work can ultimately be used to characterize the shape of the cervix for 3D modeling applications, as well as prediction models of sPTB²¹. Expanding upon our work, Pegios et al. labeled cervical outlines, trained a DTU-Net segmentation model to extract CL, and trained an additional SA-SonoNet classification model to predict PTB from CL and cervical outlines²². The findings from this research support our hypothesis that diverse biomechanical factors are important in understanding sPTB outcomes. However, the positive predictive value (PPV) of SA-Sononet is indistinguishable from that of TVUS-CL screening for the reported data; this indicates that more research is needed to understand the implications of these new cervical boundary features. Even though this model was trained on a large dataset, it is difficult to compare the prediction results to the existing standard of care (TVUS-CL screening, with different CL cutoffs depending on patient demographics) for sPTB because the parity and sPTB history of the patient population are not reported. The culmination of this work leaves a gap in the field to provide highly-accurate, whole-shape segmentations of cervical features that generalize to out-of-distribution data, and that can be used to derive cervical geometries as inputs to 3D digital twins of pregnancy. These models may ultimately inform machine learning (ML) based prediction models of sPTB, and we hypothesize that the biomechanical insight will push these models to outperform our existing clinical standards for sPTB screening.

The work presented here demonstrates multi-class segmentation as an automated tool to provide pixel-by-pixel predictions and identify boundaries between neighboring anatomical tissue regions. Specifically, we explored patient variations in cervical geometry during the second and third trimesters, and we developed a novel tool to segment the entire 2D cervical region from TVUS images into multiple anatomical classes, including anterior cervical tissue, posterior cervical tissue, and cervical canal space²¹. We improved on existing work by training and tuning additional model architectures, combining several architectures into an ensemble model, and leveraging the predicted masks to demonstrate how cervical features may be extracted in an explainable way, using predefined anatomical landmarks. In this work, CL is used as a representative example and an additional data point to demonstrate model accuracy, but predicting CL itself is not the final goal of our work. Rather than relying on deep-learning-based methods to measure CL as a singular feature of cervical anatomy and predictor of sPTB, this work highlights the importance of establishing and leveraging anatomical boundary conditions of the cervix to directly inform cervical measurements in an explainable fashion. This approach will enable future algorithms to extract structural features beyond CL that are likely to be biomechanically relevant to delivery¹⁰, and will be explored in future 3D simulations and prediction models of sPTB.

Results

Dataset overview

The Cervical Length Education and Review (CLEAR) dataset was divided into training, validation, and test sets using a 70:20:10 split. A separate out-of-distribution dataset was also used for testing. Images were labeled according to maternal anatomy, shown in Fig. 2. A more detailed description is available in the Methods section.

Model training and selection

All three similarity metrics (Dice, Jaccard, Hausdorff) indicated that SegResNet, Residual UNet, Attention UNet, and nn-UNet were the highest performing models. The hyperparameters found to optimize model performance are indicated by a preceding asterisk (*) in Table 1 with Dice metrics and Hausdorff distance reported for each individually optimized model in Supplementary Tables 1 and 2. Basic (or vanilla) UNet and transformer UNet also offered strong model performance, but had lower segmentation overlap scores. The Transformer UNet performed reasonably well, but the boundaries suffered from a pixelation-like quality (Supplementary Figs. 1 and 2). Each top-performing model (SegResNet, Residual UNet, Attention UNet, and nn-UNet) differed with statistical significance (adjusted p < 0.01) from the less well-performing models (basic UNet and transformer UNet). All comparisons were made with consistent results across Dice metric, Hausdorff distance, and Jaccard index. Hausdorff distance indicated a difference between basic UNet and transformer UNet (p < 0.01), whereas Dice metric and Jaccard index indicate no difference between the performance of basic UNet and transformer UNet.

Table 1 The depicted hyperparameter space was explored during model training on the CLEAR dataset

Full size table

We further compare model performance on the reserved (CLEAR) test set by plotting Dice metric and Hausdorff distance for each class across all model types (Fig. 3). These models were plotted in descending order from left to right with respect to time required for training (Supplementary Table 3). The 4 best performing architectures (SegResNet, Attention UNet, nn-UNet, and Residual UNet) also had the lowest training time (Supplementary Table 3), indicating sufficiency of these less complex models. Our previous work explored cross validation of individual models with this framework to demonstrate that model improvements were consistent across random seeds²¹. More detailed visual and numerical results of individual model performance within the validation set are available in Supplementary Figs. 1 and 3.

**Fig. 3: Model performance on CLEAR and out-of-distribution test data.**

Out-of-distribution dataset

To interrogate generalizability, models were evaluated on the separate out-of-distribution cohort from IH, comparing performance using class-specific Dice metric (Fig. 3). As expected with application of a model to an out-of-distribution test dataset, all models experienced a small performance drop compared to the CLEAR reserved test set. The 4-unit Residual UNet (shorthand ResUNet4) likely over-fit the CLEAR dataset, evidenced by the large drop in segmentation performance. Since this demonstrates lack of generalizability, it was excluded from further analysis. The 4 best performing models maintained high Dice metrics of approximately 0.8 for anterior and posterior cervix classes. Of these, the 2-unit Residual UNet (shorthand ResUNet2) had the highest model performance on the out-of-distribution test dataset (Dice metrics: 0.81 and 0.85 on the anterior and posterior cervix, respectively), the nn-UNet (0.79 and 0.84) and Attention UNet (0.79 and 0.82) performed similarly well, and the SegResNet (0.76 and 0.80) performed slightly less well. More detailed class-specific Dice metrics and Hausdorff distances are available in Supplementary Tables 1 and 2.

Final model selection

Among the 4 best performing individual models, no single model outperformed the others on the reserved or out-of-distribution test sets. Therefore, an ensemble approach was used to leverage the strength of each model and mitigate pixel-wise segmentation errors of individual predictions, thereby improving overall performance and reducing risk of over-fitting to the training dataset. This method concatenates all 4 best-performing model outputs and employs pixel-wise voting to determine the final model output. Per majority voting, our ensemble model incorporated 3 out of the 4 best performing models. This demonstrated an improvement in the Dice metric compared to individual models (Supplementary Tables 1 and 4).

Attention UNet, nn-UNet and SegResNet were combined in an ensemble model that was used for final evaluation. Similar model performance was observed across all 4 combinations of 3 models (Supplementary Tables 4–5), but this combination achieved a higher Dice metric for the anterior cervix, on the reserved test set. Hausdorff distance, which is more representative of performance for small segmentation classes, also indicates improved bladder performance (smaller Hausdorff distance) for the ensemble of Attention UNet, nn-UNet and SegResNet compared to other combinations (Supplementary Table 5). This ensemble model was thus used to generate predictions for the reserved and out-of-distribution test set, both of which demonstrated that the model generalizes well to new data. When applied to the reserved test set, the model performed well across diverse cervical presentations such as cervices that were of average length/width, curved, linear, long, short/squat, funneled, and adjacent to a full bladder (Fig. 4). Across all reserved test set images, the model achieved a high Dice metric for the anterior and posterior cervix of roughly 0.93 and 0.91, respectively.

**Fig. 4: Ensemble model segmentation on CLEAR test data**

Evaluation on the out-of-distribution dataset similarly indicated high model performance for the aforementioned diverse cervical shapes as well as in the presence of fetal anatomy near the internal os (Fig. 5). On this out-of-distribution dataset, the Dice metric dropped slightly to 0.80 and 0.85 for the anterior and posterior cervix class, respectively. However, visual inspection of the prediction images confirmed high model performance.

**Fig. 5: Ensemble model segmentation on out-of-distribution test data.**

Inter-operator metrics

To evaluate inter-operator variability, measures of similarity were calculated between the majority ground truth label and each expert label on the test set. These metrics were then averaged across all experts to derive inter-operator values (Supplementary Table 6). For the reserved (CLEAR) test set, the inter-operator Dice metric averaged across all classes except background was 0.82, with class specific Dice metrics of 0.94 for both anterior and posterior cervix classes. When evaluated on the reserved (CLEAR) test set, the combined model architecture achieved a high Dice metric of 0.77 averaged across every class except the background, with class-specific Dice metrics of 0.93 and 0.91 for the anterior and posterior cervix class, respectively. The similar interoperator and model Dice metrics indicate that the model performed slightly below the clinical expert agreement.

Cervical length

The proposed models accurately reproduce TVUS-CL (Fig. 6), with methods that leverage underlying geometry from the image inputs and predicted segmentation masks. Of the 29 patients in the out-of-distribution test dataset, 4 had anatomically improbable predicted segmentation labels (due to poor image quality) and were excluded from subsequent analysis (Supplementary Fig. 4). For the remaining 25 patients, CL was binned in 0.5 cm increments, and normal distributions were fit to histograms plotted for the algorithm and each sonographer. Normal curves were fit to the CL distributions and overlaid on the same graph (Supplementary Fig. 5). Most images have a positive percent error (Supplementary Fig. 6), indicating the algorithm-reported value is larger than the sonographer-reported value. To further examine differences between algorithm and sonographer reported values, the percent error was plotted for each patient across the dataset (Supplementary Fig. 6). Examples with relatively high absolute error (PE < −25% or PE > 25%) of CL measurements demonstrate the chain effect wherein CL measurements follow the underlying segmentation shape, which is determined by the TVUS image itself; a shadowing artifact or poor image quality is expected to create a poor segmentation mask which in turn results in an unreliable CL measurement (Supplementary Fig. 7). Bland-Altman plots compare the CL measures from the algorithm against the expert measures (Supplementary Fig. 8), finding a mean bias of 0.14 cm.

**Fig. 6: Algorithm-reported CL displayed on out-of-distribution test data.**

To confirm that these CL values were drawn from the same distribution, a Wilcoxon signed rank test was performed with the null hypothesis that there is no difference between the average sonographer-reported and corresponding algorithmic-reported CL value. The test failed to reject the null hypothesis, indicating that the algorithm and the sonographer measurements are drawn from the same cervical length distribution. Visually, the experts had nearly perfect agreement, and statistical tests confirmed that reported values from the algorithm and experts are likely drawn from the same distribution, meaning they agree.

Discussion

AI tools are rapidly being integrated into medical practice, creating new opportunities to leverage this technology in maternal and fetal health. We have developed an AI algorithm for mapping cervical shape to measure maternal anatomic features. Using CL as an example and an additional confirmation of model performance, this work describes an automated multi-class segmentation framework that labels cervical tissue in its entirety on TVUS images and automates CL measurement. Compared to prior single class or cervical outline segmentation approaches^19,22, our multi-class ensemble model segments the cervix in its entirety, expanding upon our previous experiments²¹. Compared to previous work^21,23, this ensemble model achieves a similar, slightly elevated Dice metric of 0.93 and 0.92 on in-distribution data for both anterior and posterior cervix classes. Unlike previous work, this model was deployed on an out-of-distribution dataset for the first time and maintains high model performance with a Dice metric of 0.80 and 0.85 for the anterior and posterior cervix classes, respectively. Furthermore, our model was trained on diverse data from multiple institutions and ultrasound manufacturers (including Siemens, General Electric, Toshiba, Philips, etc), performing as well as human experts for CL measurement. The 0.14 cm mean bias in CL measurement indicates that expert readers and the algorithm can be used interchangeably (Supplementary Fig. 8). The small positive percent error pattern in CL prediction is expected, as expert measurements were taken as a series of line segments, whereas the algorithm follows inherently longer, curvilinear traces. This study is strengthened by diverse, multi-institution training data with known quality measures (CLEAR scores) and multiple expert labels to develop a segmentation model. Similarly, high performance on a separate clinical dataset, drawn from a different distribution, reinforces trust in model generalizability across new, multi-site, diverse demographic data.

Three metrics were used to evaluate the performance of the segmentation model. While all three similarity measurements have merit, they fall into two main categories: overlap metrics such as Dice and Jaccard (interrelated) and distance metrics such as Hausdorff distance (HD). Overlap metrics measure how many pixels are shared between the ground truth and the predicted image, but they are highly dependent on the shape and size of the structure, which is challenging in small, elongated structures that could be displaced by a few millimeters and have no overlapping pixels^24,25. This explains lower Dice metrics for bladder and cervical canal versus anterior and posterior cervix classes. Distance metrics, by contrast, compare the surface distance between the ground truth and predicted image, explaining how close the masks are to each other, but they are highly sensitive to outliers^24,26,27. These distance metrics are particularly important when evaluating small classes, such as the bladder, which is the highest performing (smallest HD value) class across the 4 individual models considered for ensemble approach. Standard Hausdorff distance specifies the minimum distance (or expansion) that needs to be applied to both sets (ground truth and predicted segmentation) such that the expansion contains all segmentation pixels for both original sets. A single boundary error can lead to large Hausdorff distances even when the majority of the profile is accurate. When possible, it is best to report both overlap and distance metrics, as they offer different insight into model performance and an important baseline against which to judge future models. In this particular use case and in line with the Medical Imaging and Data Resource Center guidelines (https://www.midrc.org/performance-metrics-decision-tree), our results suggest that Hausdorff distance did not reflect overall segmentation quality of the larger classes (anterior and posterior cervix) as reliably as Dice metric in our multi-label segmentation task. In the future, average Hausdorff distance may better capture performance of the model, as it is less sensitive to outliers than standard Hausdorff distance. In agreement with qualitative images, the ensemble model demonstrated slightly better Dice metric performance than individual models, a distinction not captured by standard Hausdorff distance.

Overall, the ensemble-based segmentation model accurately reproduced cervical geometry and CL measurements. Dice metric indicates that segmentation performance remains limited for small anatomical landmarks (bladder and cervical canal), while Hausdorff distance indicates the bladder performs as well as the larger cervix classes. Both Dice metric and Hausdorff distance reiterate the small, often elongated, cervical canal class suffers a drop in segmentation performance. Although ensemble models are computationally complex, requiring either more computational power or time than individual models, the qualitative benefit can justify the computational expense. If computational cost is a concern, any of the individual models from the ensemble (Attention UNet, nn-UNet, SegResNet and ResUNet2) may be deployed in a stand-alone format. The training time, similarity metrics and representative segmentation images are available in Supplementary Information (Tables 1, 2, 3, 7, 8, 9, 10 and Figs. 1, 9, 10, 11, 12).

Despite high performance on the test set, the bladder boundary is inaccurately predicted in images with a full bladder (Supplementary Fig. 13a) that fail to meet CLEAR criteria. Although the bladder is meant to be emptied before TVUS acquisition, this procedure is frequently not followed, creating variability in bladder position and size. For a small feature that is already challenging for the model to learn, such variability magnifies the difficulty. Consequently, the inferior portion of the bladder flap is often under-predicted, lowering the Dice metric.

Furthermore, cervical canal shape and size can vary significantly among patients, creating heterogeneous anatomical regions. In select patients, the mucus plug may be large and visible; in other patients, it may be small and indistinguishable from cervix tissue. This variation makes it difficult for the model to learn without larger, diverse datasets. The cervix may also present extremely funneled in patients at high risk sPTB, creating a ground truth segmentation with a large surface area. In contrast, some cervices are strictly closed at the histological internal os, rendering nearly undistinguishable cervical canal on the TVUS image. In the event of extreme cervical funneling (Supplementary Fig. 13b), we found that the model may struggle to find the histological internal os. The low dataset representation of funneling may bias the model to under-predict large cervical funnel shapes.

If the placenta is located near the internal os (Supplementary Fig. 13c), the placenta is often mistaken for posterior cervix, likely due to similar echogenicity and texture. If the image violates CLEAR criteria because the cervix is small relative to the field of view (Supplementary Fig. 13d), the cervix may be over-predicted or misplaced. Although bladder predictions were less reliable, the inclusion of the bladder class likely improves the overall performance by providing a highly echogenic landmark with an anatomically prescribed location near the anterior/superior boundary of the cervix. Similarly, the cervical canal class may later inform the shape and size of a funnel or cervical mucus plug, in the TVUS image.

Although the algorithm successfully reports CL, its accuracy is intrinsically limited by segmentation prediction and image quality; any segmentation errors will propagate to CL measurements. However, the limitations in bladder and cervical canal segmentation do not greatly affect cervical shape analysis. Rather, the anterior and posterior cervical classes are more vital for measuring cervical features, evidenced by successful CL reproduction with minimal post-processing. Poor image quality had a greater influence on CL measurements. Higher CL errors were observed for images (Supplementary Fig. 7) that had large amounts of shadowing near the external os where highly echogenic regions, insufficient gel/probe contact, or defective transducer elements interfered with signal propagation. In low-quality images (Supplementary Fig. 7) where the internal os is not clearly visualized, the algorithm struggled to infer the location of the internal os and anterior/posterior cervical boundary. As with all machine learning models, outputs are limited by the quality of data fed to the model. This underscores and necessitates efforts towards automated quality metrics.

Future work will focus on improving segmentation of small bladder and cervical canal classes. One approach is using a customized Dice loss function that more heavily weighs these classes. Larger, more diverse datasets should be introduced to learn small features subject to large patient-to-patient variations. This can be achieved by generating synthetic data from the original dataset using generative models such as diffusion models²⁸ or cycle-GAN²⁹. Such domain adaptation can expand the pool of images for cervix and bladder shapes, capturing different bladder fullness, cervical funneling, and mucus plug thickness. Even if the model’s performance improves with synthetic data, it remains worthwhile to introduce more curated medical data to include additional cervix phenotypes, such as more images of short cervices, funneled cervices, and low-lying placentas; this is expected to improve model generalizability during inference. Across all of these images, the average Hausdorff distance can also be leveraged to better capture the performance of these small features, while limiting the sensitivity to outliers. Current and future work also includes enhanced post-processing techniques and refined geometric feature extraction, coupled with the introduction of larger, more diverse TVUS datasets linked to clinical outcomes.

Following rigorous testing and validation, the immediate clinical impact of this work would be to measure CL in real-time on an US scanner. In the longer term, this model could be deployed on US-machines to measure cervical shape features and provide a clinical risk score of sPTB. Our platform is designed to accommodate future integration of additional TVUS-derived features including cervical diameter, cervical curvature, AUCA, LUS thickness, and closed cervical area. This has broad applications for understanding patient-specific maternal geometry and implications for timing of delivery through predictive machine learning models and geometrically informed finite element analysis simulations of pregnancy. This technology may reveal new biomarkers signaling structural changes leading to birth, thereby improving the prediction of birth timing. Identifying these changes could guide targeted sPTB therapies.

Currently, TVUS-CL is the only clinical imaging biomarker of sPTB risk. Although automated CL measurement algorithms are being developed, their low PPV highlights the need for additional biomarkers that capture the cervix’s complex 3D biomechanics. Moreover, relying solely on 2D measurement cannot sufficiently capture the complex 3D biomechanics of cervical preparation for delivery. Our novel segmentation tool labels the entire cervix, enabling extraction of multiple geometric features to support generation of comprehensive computational models of the entire cervix and LUS^9,30, thereby enabling more personalized, biomechanically-informed decisions about delivery timing and targeted therapeutics. This segmentation tool holds promise in elucidating the pathways of sPTB, but more research is needed to fine-tune this model and ensure generalizability before wide deployment. Integrating such AI-based methods into clinical care could expand access to sPTB screening, particularly in underserved areas, but requires rigorous validation on larger datasets. Ultimately, these capabilities may facilitate in silico testing of interventions³⁰ and integrate seamlessly into existing clinical workflows.

In this work, we present a fully automated multi-class segmentation network to segment the pregnant cervical anatomy and nearby tissues on 2-dimensional transvaginal ultrasound images. This model was successfully deployed on our reserved test dataset as well as a newly introduced, out-of-distribution test dataset of pregnant patients at low risk of sPTB. Deploying this model in the clinical setting will further standardize CL measurements, removing observer variation with possible downstream effects to improve measurement sensitivity. Building upon these tools to obtain additional biomarkers will potentially improve both the understanding of biomechanical pathways leading to sPTB, as well as the prediction of sPTB itself.

Methods

CLEAR dataset images

Mirroring our previous work²¹, the Perinatal Quality Foundation (PQF), which hosted the CLEAR training program³¹, supplied 250 de-identified TVUS images, collected between 16 and 32 weeks gestation from various centers and ultrasound machines across the United States of America. As per the PQF privacy policy, candidates who participate in the CLEAR program accept that their information may be used in an aggregate, de-identified manner for research. In addition, there were no patient identifiers associated with any images submitted and reviewed. As such, this study was exempt from institutional review board approval. Images were graded based upon their adherence to 9 CLEAR criteria²¹, where a minimum score of 7 is required to pass. As in our previous work²¹, each image received a CLEAR score and a subset of images with scores 6–9 were used to train, validate and test the model. Ideally, all clinical TVUS scans would merit a perfect score, but a small subset of real-world data is expected to fail CLEAR criteria due to human error, even after appropriate training. To account for this and improve the model’s ability to generalize, a small subset of grade 6 images was included in the dataset, as these images still meet over half of the CLEAR criteria but fail to pass certification. Since the provided TVUS images were anonymized, no pregnancy outcome information is available for this training data, and it is assumed that patients do not have repeat images in the dataset. Further inspection of the images, as depicted in Fig. 7, reveals that clinically short cervices (CL < 2.5 cm) and cervical funneling are present in roughly 15% and 12% of images, respectively.

CLEAR dataset labels

For training labels, a CLEAR-certified sonographer and 2 clinicians provided annotations using the segmentation software Labelbox (https://labelbox.com/). During review and label generation, expert maskers were permitted to skip an image if the quality was too poor to distinguish the anatomical regions of interest (exclusion criteria in Fig. 7). Of the 250 original images, 4 images were excluded from the dataset during expert review leaving 174, 50, and 22 images in the grade 9, 8 and 6 groups, respectively. Experts were tasked with segmenting these images into 5 regions (background, bladder, anterior cervix + LUS, posterior cervix, and cervical canal + potential space) as shown in the segmentation label anatomy key of Fig. 2. Fleiss’ kappa coefficient was calculated to determine agreement among experts. Across all 246 labeled images in the dataset, the Fleiss’ kappa coefficient was 0.87, indicating high agreement between experts. To generate ground truth labels for training, a majority choice voting system was used (described and illustrated in our previous work)²¹. If at least 2 out of 3 experts labeled a pixel with a given class, then that pixel was set to true for that given class in the GT label.

Out-of-distribution images

To further validate model performance and generalizability to a population at low-risk of sPTB, we obtained an out-of-distribution test dataset of 30 pregnant patients at Intermountain Health (IH, Provo, UT) to test our algorithm. This study was approved by the institutional review board at IH (#1050495), and each subject provided written consent. Images were collected between 22 and 25 weeks’ gestational age. One subject was removed from analysis due to sPTB, leaving 9 (31%) nulliparous and 20 (69%) multiparous participants. Of these, n = 1 cervix was clinically short (CL < 2.5 cm).

Out-of-distribution labels

Labels for the out-of-distribution images were similarly generated, with the exception that only 1 clinician provided annotations, forgoing the need for ground truth majority choice voting. The use of 1 expert was justified by the high inter-rater agreement in the training dataset.

Data pre-processing

To remove the CL calipers placed by sonographers, the cv2 inpainting³² package was utilized. The CLEAR dataset was divided into training, validation, and test sets using a 70:20:10 split. Each set had a random distribution of images, but CLEAR scores were balanced within each set. Data augmentation techniques such as 180^∘ rotations, random zoom, center crop, random Gaussian noise, Gaussian blur, and random contrast adjustments were applied only to the training set²¹.

AI technical field overview

To build a standard AI-based model, the model is first trained by inputting a set of TVUS images and corresponding expert labels (manual segmentation of anatomy). During training, the model is exposed to labeled images to identify patterns and features, and the model iteratively learns by adjusting a set of variables, called hyperparameters. This fine tuning of hyperparameters informs model performance until it achieves the best possible output (predicted segmentation of anatomy). As is customary, the training data is used to tune the hyerparameters of the model during each epoch, and the validation dataset is used to evaluate model performance after each epoch during model training. A separate test set may be, and in this case was, introduced to further evaluate the model’s ability to generalize to never-before-seen data.

Model architecture

The MONAI library (https://monai.io/) was used to implement the following segmentation model architectures: SegResNet, UNet, Residual UNet, nn-UNet, Attention UNet and Transformer UNet. For all model architectures, image/mask pairs were resized to 256 × 256 pixels, and the mask was one-hot encoded before training. The images were converted to grayscale, and pixel values were normalized between 0 and 1, providing a 1-channel input to the network. The model computed a 5-channel output corresponding to background and the 4 classes depicted in Fig. 2.

Hyperparameter optimization

The multi-class SegResNet³³, and Transformer UNet (UNETR in MONAI)³⁴ models were trained with varied dropout, maintaining all other default parameters. Both the multi-class Residual UNet³⁵ and multi-class Attention UNet³⁶ architecture were trained with 5 convolutional layers (corresponding to 16, 32, 64, 128, and 256 channels), and a stride length of 2. The multi-class nn-UNet (DynUNet in MONAI)³⁷ architecture was trained with a kernel size of 3 and a stride length of 2. The number of residual units was varied only for the multi-class Residual UNet architecture. For each model architecture, both Adam and SGD optimizers were considered. The learning rate was varied from 0.001 to 0.01 for each optimizer, as shown in Table 1. Dropout of 0.1–0.4 was introduced for each model to decrease over-fitting.

Model training

During model training, Dice loss was fed through backpropagation to update model weights and Dice metric was monitored to assess model performance. An average Dice metric value was calculated for each epoch by averaging class-specific dice metric across every class except background. The model was allowed to run for 50 epochs during training, and early stopping was applied to monitor the validation loss with a patience of 5 epochs. The model checkpoint with the best average Dice metric on the validation set during training was saved. Predictions were generated by feeding inputs through the trained model, applying softmax activation along the class dimension and reporting the argmax value along the class dimension to determine the predicted class of each pixel in an image. For each prediction image, the Dice metric was analyzed for individual classes, with the anterior and posterior cervix class considered the most important when considering shape analysis of the cervix.

Model selection

Training identified the best-performing models for each architecture, iteratively checking the performance on the validation dataset after each training step. The predicted labels for the models under consideration were evaluated against ground truth using 3 similarity measures which assess how similar one image is to another image by comparing the pixel overlap (Dice Metric, Jaccard Index) or the degree of mismatch by assessing how far away one image representation is from another (Hausdorff distance).

Cervical length feature extraction

In select images such as the atypical cervix with a large cervical funnel shown in Fig. 13b, there are some disjointed regions and therefore multiple instances of the same class. Anatomically, this is an impossibility and therefore a post-processing step is warranted to correct for small disconnected regions. All subsequent model analysis was performed on the raw segmentation predictions, but minimal post-processing steps were performed to remove these disjointed regions or “islands” from the segmentation masks before applying the cervical length algorithms. This was done by examining multiple instances of the same class and preserving only the largest instance of a given class, provided that the smaller instances are no larger than 25% the size by area of the largest class. To reassign these instances, the post-processing step considered the most prominent class type bordering the region of interest by surveying the perimeter pixels of neighboring classes.

We developed custom Python scripts to automatically measure CL from segmentation masks (Fig. 2), leveraging the geometry of the cervix and clinically recognized anatomical landmarks. If the cervical canal class label is present, the algorithm starts by finding internal os with the following method: 1) The algorithm locates the superior most boundary of the cervical canal + potential space class (shown in green). 2) These superior (or leftmost) points are fit to a line, and the image is rotated such that this line is oriented vertically. 3) The algorithm then counts the number of green points per column and calculates the derivative, which indicates how quickly the width of the cervical canal + potential space class changes. 4) The derivative is graphed lengthwise across the image, and the first point where the derivative plateaus below a preset threshold is taken as the internal os location. Alternatively, if the cervical canal + potential space class is not present in the prediction image, the internal os location is derived from the leftmost point with adjacent anterior and posterior cervical tissue. The external os is then identified as the rightmost point of adjacent anterior and posterior cervical tissue. The cervical trace is finally taken as the adjacent anterior and posterior tissue between the internal and external os (Fig. 6). If a mucus plug is visible in the image and is labeled as the cervical canal class, the vertical midpoint of each column is taken as the point along the cervical trace. Finally, the model returns both a visual guidance tool where cervical length is traced atop the underling TVUS image and a numerical value for CL.

Model validation from predicted CL vs. ground truth CL

To further validate the model and compare it against existing clinical standards, we algorithmically extracted CL and measured this value against sonographer-reported CL. The model-predicted anatomy labels were fed into this CL extraction algorithm to return the automatically predicted CL, which was then compared against sonographer-reported CL. Quantitative comparisons in CL were made by evaluating percent error (PE) between sonographer-reported CL and algorithmically extracted CL. Similarly, Bland-Altman plots, which are commonly used to visualize the degree of agreement between the clinical gold standard and a new measurement technique, were used to identify possible systemic bias introduced by our algorithm.

Statistical tests

Given the small size of the reserved test dataset, the performance metrics cannot be assumed to follow a normal distribution. Therefore, non-parametric statistical tests were used to test the null hypothesis (p < 0.05 and p < 0.01) that the performance metrics for each model were drawn from the same underlying distribution. One-way paired Friedman test was used to detect differences between the performance across all models. The Friedman test indicated a difference between mean performance metrics across all model types. A paired multiple comparison Wilcoxon Signed-Rank test with Bonferroni corrections was used to compare the performance between each model in terms of Dice metric, Hausdorff distance, and Jaccard index.

Hardware and software

All models were run on a single Tesla V100-32GB GPU. Model training was performed in Python 3.9, using PyTorch and Medical Open Network for Artificial Intelligence (MONAI, a library which provides domain-specific capabilities for medical imaging: https://monai.io/) packages.

Data availability

The CLEAR dataset was requested from Perinatal Quality Foundation (PQF), under a data use agreement, and is not publicly available. The Intermountain Health dataset contains PHI and cannot be made publicly available at this time.

Code availability

Code for training and evaluating model performance is available at https://github.com/cumcrad/MulticlassSegmentationTVUS and CL algorithm code is available at https://github.com/cumcrad/TVUS_CervicalLength.

References

Ely, D. M. & Driscoll, A. K. Infant Mortality in the United States, 2022: Data From the Period Linked Birth/Infant Death File. In National Vital Statistics Reports [Internet], https://doi.org/10.15620/cdc/157006 (National Center for Health Statistics (US), 2024).
Blencowe, H. et al. National, regional, and worldwide estimates of preterm birth rates in the year 2010 with time trends since 1990 for selected countries: a systematic analysis and implications. Lancet 379, 2162–2172 (2012).
Article PubMed Google Scholar
Callaghan, W. M., MacDorman, M. F., Rasmussen, S. A., Qin, C. & Lackritz, E. M. The contribution of preterm birth to infant mortality rates in the United States. Pediatrics 118, 1566–1573 (2006).
Article PubMed Google Scholar
Institute of Medicine (US) Committee on Understanding Premature Birth and Assuring Healthy Outcomes. Preterm Birth: Causes, Consequences, and Prevention. The National Academies Collection: Reports funded by National Institutes of Health (National Academies Press, 2007).
Kassabian, S., Fewer, S., Yamey, G. & Brindis, C. D. Building a global policy agenda to prioritize preterm birth: a qualitative analysis on factors shaping global health policymaking. Gates Open Res. 4, 65 (2020).
Article PubMed PubMed Central Google Scholar
Myers, K. M. et al. The mechanical role of the cervix in pregnancy. J. Biomech. 48, 1511–1523 (2015).
Article PubMed PubMed Central Google Scholar
Vink, J. & Feltovich, H. Cervical etiology of spontaneous preterm birth. Semin. Fetal Neonatal Med. 21, 106–112 (2016).
Article PubMed PubMed Central Google Scholar
Louwagie, E. M. et al. The biomechanical evolution of the uterus and cervix and fetal growth in human pregnancy. npj Women’s Health 2, 33 (2024).
Louwagie, E. M. et al. Parametric Solid Models of the At-Term Uterus From Magnetic Resonance Images. J. Biomech. Eng. 146, 071008 (2024).
Westervelt, A. R. et al. A parameterized ultrasound-based finite element analysis of the mechanical environment of pregnancy. J. Biomech. Eng. 139, 051004 (2017).
Article Google Scholar
Fernandez, M. et al. Investigating the mechanical function of the cervix during pregnancy using finite element models derived from high-resolution 3D MRI. Comp. Methods Biomech. Biomed. Eng. 19, 404–417 (2016).
Article CAS Google Scholar
Bathe, K.-J. Numerical methods in finite element analysis (Prentice-Hall, Englewood Cliffs, N.J.).
Kuusela, P. et al. Second trimester cervical length measurements with transvaginal ultrasound: A prospective observational agreement and reliability study. Acta Obstet. Gynecol. Scand. 99, 1476–1485 (2020).
Article PubMed Google Scholar
Ronneberger, O., Fischer, P. & Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015, (eds Navab, N., Hornegger, J., Wells, W. M. & Frangi, A. F.) vol. 9351, 234–241 (Springer International Publishing, 2015).
Looney, P. et al. Fully automated, real-time 3D ultrasound segmentation to estimate first trimester placental volume using deep learning. JCI Insight 3, e120178 (2018).
Article PubMed PubMed Central Google Scholar
Andrews, W. W. What is new in preterm birth prevention? Important recent articles. Obstet. Gynecol. 122, 390–392 (2013).
Article PubMed Google Scholar
Qi, H., Collins, S. & Noble, A. Weakly Supervised Learning of Placental Ultrasound Images with Residual Networks. In Medical Image Understanding and Analysis, (eds Valdés Hernández, M. & González-Castro, V.) 723, 98–108, (Springer International Publishing, 2017).
Diniz, P. H. B., Yin, Y. & Collins, S. Deep learning strategies for ultrasound in pregnancy. Eur. Med. J. Reprod. Health 6, 73–80 (2020).
PubMed PubMed Central Google Scholar
Włodarczyk, T. et al. Spontaneous Preterm Birth Prediction Using Convolutional Neural Networks. In Medical Ultrasound, and Preterm, Perinatal and Paediatric Image Analysis, Lecture Notes in Computer Science, (eds Hu, Y. et al.) 274–283, (Springer International Publishing, 2020).
Włodarczyk, T. et al. Machine learning methods for preterm birth prediction: a review. Electronics 10, 586 (2021).
Article Google Scholar
Dagle, A. B. et al. Automated Segmentation of Cervical Anatomy to Interrogate Preterm Birth. In Perinatal, Preterm and Paediatric Image Analysis, Lecture Notes in Computer Science, (eds Licandro, R., Melbourne, A., Abaci Turk, E., Macgowan, C. & Hutter, J.) 48–59, (Springer Nature Switzerland, 2022).
Pegios, P. et al. Leveraging Shape and Spatial Information for Spontaneous Preterm Birth Prediction: In Proc 4th International Workshop of Advances in Simplifying Medical Ultrasound (ASMUS) vol. 14337, 57–67, (Springer Science and Business Media Deutschland GmbH, 2023).
Kwon, H. et al. Deep learning-based automated measurement of cervical length in transvaginal ultrasound images of pregnant women. IEEE. J. Biomed. Health Inform. 1–10, https://doi.org/10.1109/JBHI.2024.3433594 (2024).
Taha, A. A. & Hanbury, A. Metrics for evaluating 3D medical image segmentation: analysis, selection, and tool. BMC Med. Imaging 15, 29 (2015).
Article PubMed PubMed Central Google Scholar
Crum, W. R., Camara, O. & Hill, D. L. G. Generalized overlap measures for evaluation and validation in medical image analysis. IEEE Trans. Med. Imaging 25, 1451–1461 (2006).
Article PubMed Google Scholar
Gerig, G., Jomier, M. & Chakos, M. Valmet: A New Validation Tool for Assessing and Improving 3D Object Segmentation. In Medical Image Computing and Computer-Assisted Intervention—MICCAI 2001, (eds Niessen, W. J. & Viergever, M. A.) 516–523, (Springer, 2001).
Zhang, D. & Lu, G. Review of shape representation and description techniques. Pattern Recognit. 37, 1–19 (2004).
Article Google Scholar
Croitoru, F.-A., Hondru, V., Ionescu, R. T. & Shah, M. Diffusion Models in Vision: A Survey. IEEE Transactions on Pattern Analysis and Machine Intelligence 45, 10850–10869, https://doi.org/10.1109/TPAMI.2023.3261988 (2023).
Zhu, J.-Y., Park, T., Isola, P. & Efros, A. A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In 2017 IEEE International Conference on Computer Vision (ICCV), 2242–2251, https://doi.org/10.1109/ICCV.2017.244 (2017).
Delagarza, A. et al. In-Silico Models of In-Vivo Cervical Stiffness Measurements for Improving Preterm Birth Prediction. In SB3C Conference Proceedings, 1294–1295 (Lake Geneva, 2024).
ISUOG Practice Guidelines: Role of ultrasound in the prediction of spontaneous pretermbirth. CLEAR. https://www.isuog.org/static/d88e5dff-ced3-43ee-aa2229c2679b9484/ISUOG-Practice-Guidelines-ultrasoundin-preterm-birth.pdf#page=19&zoom=100,88,610.
Telea, A. An image inpainting technique based on the fast marching method. J. Graph. Tools 9, 23–34 (2004).
Article Google Scholar
Myronenko, A. 3D MRI Brain Tumor Segmentation Using Autoencoder Regularization. In Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries (eds Crimi, A. et al.) 311–320 (Springer International Publishing, Cham, 2019).
Hatamizadeh, A. et al. UNETR: Transformers for 3D medical image segmentation. In 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 1748–1758, (2022).
Kerfoot, E. et al. Left-Ventricle Quantification Using Residual U-Net. In Statistical Atlases and Computational Models of the Heart. Atrial Segmentation and LV Quantification Challenges, Lecture Notes in Computer Science, (eds Pop, M. et al.) 371–380, https://doi.org/10.1007/978-3-030-12029-0_40 (Springer International Publishing, 2019).
Schlemper, J. et al. Attention gated networks: Learning to leverage salient regions in medical images. Med. Image Anal. 53, 197–207 (2019).
Article PubMed PubMed Central Google Scholar
Isensee, F. et al. nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nat. Methods 18, 203–211 (2021).
Article CAS PubMed Google Scholar

Download references

Acknowledgements

Images were provided by the Perinatal Quality Foundation and were collected as part of the CLEAR training program. We would also like to thank Chai-Ling Nhan-Chang, MD of Columbia University for her feedback and insight during the ideation of the segmentation labels for this project, Lindsey Carlson, PhD for her guidance and assistance with data acquisition, and Keri Johnson, RDMS for her assistance with data acquisition and annotation. This study was supported in part by the National Science Foundation Graduate Research Fellowship Grant DGE-2036197 to Alicia B. Dagle, by Columbia University SEAS Interdisciplinary Research Seed (SIRS) Funding, and by The Iris Fund. The funders had no role in study design, data collection, data analysis, data interpretation, or writing of the manuscript.

Author information

Authors and Affiliations

Department of Mechanical Engineering, Columbia University, New York, NY, USA
Alicia B. Dagle & Kristin M. Myers
Department of Medical Physics, Atlantic Health System, Morristow, NJ, USA
Yucheng Liu
Department of Computer Science, Columbia University, New York, NY, USA
Madeline Skeel & Gabriel G. Trigo
Department of Obstetrics and Gynecology, National Maternity Hospital and University College, Dublin, Ireland
David Crosby
Department of Obstetrics and Gynecology, North Memorial Health System, Robbinsdale, MN, USA
Helen Feltovich
Department of Obstetrics and Gynecology, Tufts Medical Center, Boston, MA, USA
Michael House
Department of Obstetrics and Gynecology, Columbia University Irving Medical Center, New York, NY, USA
Qi Yan
Department of Radiology, Columbia University Irving Medical Center, New York, NY, USA
Sachin Jambawalikar

Authors

Alicia B. Dagle
View author publications
Search author on:PubMed Google Scholar
Yucheng Liu
View author publications
Search author on:PubMed Google Scholar
Madeline Skeel
View author publications
Search author on:PubMed Google Scholar
Gabriel G. Trigo
View author publications
Search author on:PubMed Google Scholar
David Crosby
View author publications
Search author on:PubMed Google Scholar
Helen Feltovich
View author publications
Search author on:PubMed Google Scholar
Michael House
View author publications
Search author on:PubMed Google Scholar
Qi Yan
View author publications
Search author on:PubMed Google Scholar
Kristin M. Myers
View author publications
Search author on:PubMed Google Scholar
Sachin Jambawalikar
View author publications
Search author on:PubMed Google Scholar

Contributions

A.D., K.M., S.J., M.H. and H.F. conceptualized the study. H.F. curated the out-of-distribution dataset. A.D., K.M., S.J., M.H. and H.F. acquired funding. Y.L. contributed to initial model implementation. A.D. organized data collection and oversaw labeling tasks. M.H. and D.C. provided anatomy labels for the in-distribution dataset, and M.H. provided labels for the out-of-distribution dataset. A.D., M.S. and G.T. developed working prototypes for the cervical length algorithm. Q.Y. oversaw statistical methodology. A.D. trained the models, analyzed the results, and created visuals. K.M. and S.J. supervised the research. The original manuscript draft was prepared by A.D. All authors reviewed the manuscript.

Corresponding author

Correspondence to Sachin Jambawalikar.

Ethics declarations

Competing interests

A.D., Y.L., H.F., M.H., Q.Y., K.M. and S.J. have a pending patent application on this technology. All the other authors declare that they have no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information (download PDF )

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Dagle, A.B., Liu, Y., Skeel, M. et al. Generating cervical anatomy labels using a deep ensemble multi-class segmentation model applied to transvaginal ultrasound images. npj Womens Health 3, 28 (2025). https://doi.org/10.1038/s44294-025-00075-x

Download citation

Received: 04 November 2024
Accepted: 25 April 2025
Published: 10 May 2025
Version of record: 10 May 2025
DOI: https://doi.org/10.1038/s44294-025-00075-x