Introduction

Preterm birth (PTB), defined as delivery before 37 weeks of gestation, is the leading cause of perinatal death1,2 and a major contributor to long-term disabilities3, where earlier gestational age at birth corresponds to longer hospital stays, increased risk of long-term sequelae, and increased medical costs4. With persistently high global rates of PTB and 15 million premature births yearly, PTB remains a major public health problem with high emotional and financial burden2,5. Despite significant advances in prenatal and perinatal care, 80% of preterm birth cases are considered spontaneous PTB (sPTB). Unlike induced, medically-indicated delivery to address maternal or fetal complications, sPTB is defined by premature labor, dilation (cervical insufficiency) or rupture of fetal membranes leading to preterm delivery. A major challenge to the treatment and prevention of sPTB is the lack of accurate diagnostic methods.

The cervix is a complex, 3D structure6 that, in normal pregnancy, maintains the growing fetus in utero and safely remodels to allow for delivery at term. This process, though certainly driven by molecular processes, is fundamentally biomechanical. Premature cervical shortening, a common feature of sPTB7 that is captured by TVUS-CL assessment (Fig. 1), can be thought of as structural biomechanical “failure” of the tissue. Biomechanical models can further explain 3D tissue behaviors by determining how overall shape, volume, intrinsic material properties, and alignment between the cervix/uterus against the load of the growing fetus affect structural biomechanical performance8,9,10,11. These 3D biomechanical models, sometimes called digital twins, are powered by finite element analysis. Finite element analysis is a numerical computation method that discretizes 3D geometry and solves the equations of equilibrium to investigate tissue responses to mechanical forces given set boundary conditions and tissue material properties12. A large amount of robust, clinical data describing the shape and size of maternal anatomy is needed to leverage these digital twins and quantify cervical structural biomechanical performance in pregnancy.

Fig. 1: TVUS-measured CL for sPTB screening.
Fig. 1: TVUS-measured CL for sPTB screening.The alternative text for this image may have been generated using AI.
Full size image

The clinical standard for assessing the risk of sPTB is CL measured from TVUS images. Here, a long, closed cervix (left) is shown next to a short, funneled cervix (right). The yellow dashed line measures the CL along the inner canal of the cervix. Based on cervical presentation, the patient with a long cervix (left) may be considered low risk of sPTB, whereas the patient with the short cervix (right) may be considered high risk of sPTB.

While our group has put forth digital twins of pregnancy9,11, these are inherently low-throughput and restricted by scarce labeled data. Because pregnancy is a highly protected condition, public datasets of images, let alone images with labeled maternal anatomy, are essentially nonexistent. This is dissimilar to parallel fields such as cardiac biomechanics, which have several highly annotated open source datasets in multiple image modalities and host collaborative data challenges to propel the field forward (https://www.cardiacatlas.org/). Instead, pregnancy biomechanics research currently relies on the generosity of clinicians who volunteer to provide time-consuming labels of ultrasounds, which are required to build these insightful 3D biomechanical models. If an artificial intelligence (AI) platform can provide fast, reliable measurements of maternal anatomy, this could revolutionize research, enabling more detailed finite element models of pregnancy and faster turnover of simulation results.

Labeling images with the level of detail necessary to extract relevant ultrasound biomarkers (measurements derived from US) is time-consuming, labor-intensive, and subject to inter-observer variation13. To combat this, convolutional neural networks (CNNs) are increasingly applied to image segmentation14. In pregnancy, CNNs have been used to segment the placenta and fetal biometry15,16,17,18, which is important for understanding fetal health, but the application of CNNs to the cervix remains somewhat limited. Wlodarczyk et al. used a single-class UNet to segment curves approximating cervical shape, from which CL and AUCA measurements were then extracted. This group further utilized traditional single-class UNet, and DeepLabV3 for cervical segmentations19 coupled with original TVUS images in PTB prediction models20. This work demonstrates the capability of a machine to learn single features, such as cervical length, and reproduce the clinical workflow linking cervical length to sPTB prediction. However, this research did not delve into multi-class networks and other more-complex architectures, and the segmentations provided were only designed to capture the part of the cervix along the cervical canal, not the outer boundaries of the cervical tissue which are important for cervical shape classification and 3D model development.

Our group has previously employed a deep learning framework to segment the entirety of anterior and posterior cervical tissue differentially using multi-class segmentation21. This work laid a foundation for a more complex, multi-class segmentation scheme, which can be used to characterize cervical shape in more detail. High label reproducibility (reported as Dice metric, where 1 is the optimal value) indicates that this work can ultimately be used to characterize the shape of the cervix for 3D modeling applications, as well as prediction models of sPTB21. Expanding upon our work, Pegios et al. labeled cervical outlines, trained a DTU-Net segmentation model to extract CL, and trained an additional SA-SonoNet classification model to predict PTB from CL and cervical outlines22. The findings from this research support our hypothesis that diverse biomechanical factors are important in understanding sPTB outcomes. However, the positive predictive value (PPV) of SA-Sononet is indistinguishable from that of TVUS-CL screening for the reported data; this indicates that more research is needed to understand the implications of these new cervical boundary features. Even though this model was trained on a large dataset, it is difficult to compare the prediction results to the existing standard of care (TVUS-CL screening, with different CL cutoffs depending on patient demographics) for sPTB because the parity and sPTB history of the patient population are not reported. The culmination of this work leaves a gap in the field to provide highly-accurate, whole-shape segmentations of cervical features that generalize to out-of-distribution data, and that can be used to derive cervical geometries as inputs to 3D digital twins of pregnancy. These models may ultimately inform machine learning (ML) based prediction models of sPTB, and we hypothesize that the biomechanical insight will push these models to outperform our existing clinical standards for sPTB screening.

The work presented here demonstrates multi-class segmentation as an automated tool to provide pixel-by-pixel predictions and identify boundaries between neighboring anatomical tissue regions. Specifically, we explored patient variations in cervical geometry during the second and third trimesters, and we developed a novel tool to segment the entire 2D cervical region from TVUS images into multiple anatomical classes, including anterior cervical tissue, posterior cervical tissue, and cervical canal space21. We improved on existing work by training and tuning additional model architectures, combining several architectures into an ensemble model, and leveraging the predicted masks to demonstrate how cervical features may be extracted in an explainable way, using predefined anatomical landmarks. In this work, CL is used as a representative example and an additional data point to demonstrate model accuracy, but predicting CL itself is not the final goal of our work. Rather than relying on deep-learning-based methods to measure CL as a singular feature of cervical anatomy and predictor of sPTB, this work highlights the importance of establishing and leveraging anatomical boundary conditions of the cervix to directly inform cervical measurements in an explainable fashion. This approach will enable future algorithms to extract structural features beyond CL that are likely to be biomechanically relevant to delivery10, and will be explored in future 3D simulations and prediction models of sPTB.

Results

Dataset overview

The Cervical Length Education and Review (CLEAR) dataset was divided into training, validation, and test sets using a 70:20:10 split. A separate out-of-distribution dataset was also used for testing. Images were labeled according to maternal anatomy, shown in Fig. 2. A more detailed description is available in the Methods section.

Fig. 2: CL model workflow.
Fig. 2: CL model workflow.The alternative text for this image may have been generated using AI.
Full size image

The original, raw TVUS image input (column 1) is fed to the segmentation model to generate a predicted mask (column 2) of the labeled anatomy, which is then provided as input to the CL extraction algorithm. During an intermediate step, the internal and external os are identified from the segmentation mask (column 3) in order to identify the cervical trace (column 3) and then visualize this cervical trace feature overlaid on the original US image (column 4) to measure CL.

Model training and selection

All three similarity metrics (Dice, Jaccard, Hausdorff) indicated that SegResNet, Residual UNet, Attention UNet, and nn-UNet were the highest performing models. The hyperparameters found to optimize model performance are indicated by a preceding asterisk (*) in Table 1 with Dice metrics and Hausdorff distance reported for each individually optimized model in Supplementary Tables 1 and 2. Basic (or vanilla) UNet and transformer UNet also offered strong model performance, but had lower segmentation overlap scores. The Transformer UNet performed reasonably well, but the boundaries suffered from a pixelation-like quality (Supplementary Figs. 1 and 2). Each top-performing model (SegResNet, Residual UNet, Attention UNet, and nn-UNet) differed with statistical significance (adjusted p < 0.01) from the less well-performing models (basic UNet and transformer UNet). All comparisons were made with consistent results across Dice metric, Hausdorff distance, and Jaccard index. Hausdorff distance indicated a difference between basic UNet and transformer UNet (p < 0.01), whereas Dice metric and Jaccard index indicate no difference between the performance of basic UNet and transformer UNet.

Table 1 The depicted hyperparameter space was explored during model training on the CLEAR dataset

We further compare model performance on the reserved (CLEAR) test set by plotting Dice metric and Hausdorff distance for each class across all model types (Fig. 3). These models were plotted in descending order from left to right with respect to time required for training (Supplementary Table 3). The 4 best performing architectures (SegResNet, Attention UNet, nn-UNet, and Residual UNet) also had the lowest training time (Supplementary Table 3), indicating sufficiency of these less complex models. Our previous work explored cross validation of individual models with this framework to demonstrate that model improvements were consistent across random seeds21. More detailed visual and numerical results of individual model performance within the validation set are available in Supplementary Figs. 1 and 3.

Fig. 3: Model performance on CLEAR and out-of-distribution test data.
Fig. 3: Model performance on CLEAR and out-of-distribution test data.The alternative text for this image may have been generated using AI.
Full size image

All models are compared using class-specific Dice metrics (left) and Hausdorff distances measured in pixels (right), averaged across all images in the respective dataset; the top panel shows reserved test set from the original dataset (N  = 26, CLEAR) and the bottom panel shows the out-of-distribution test set (N = 29, IH). Error bars indicate 1 standard deviation from the mean across images in the test set. Models are ordered from left to right in terms of descending time required for model training.

Out-of-distribution dataset

To interrogate generalizability, models were evaluated on the separate out-of-distribution cohort from IH, comparing performance using class-specific Dice metric (Fig. 3). As expected with application of a model to an out-of-distribution test dataset, all models experienced a small performance drop compared to the CLEAR reserved test set. The 4-unit Residual UNet (shorthand ResUNet4) likely over-fit the CLEAR dataset, evidenced by the large drop in segmentation performance. Since this demonstrates lack of generalizability, it was excluded from further analysis. The 4 best performing models maintained high Dice metrics of approximately 0.8 for anterior and posterior cervix classes. Of these, the 2-unit Residual UNet (shorthand ResUNet2) had the highest model performance on the out-of-distribution test dataset (Dice metrics: 0.81 and 0.85 on the anterior and posterior cervix, respectively), the nn-UNet (0.79 and 0.84) and Attention UNet (0.79 and 0.82) performed similarly well, and the SegResNet (0.76 and 0.80) performed slightly less well. More detailed class-specific Dice metrics and Hausdorff distances are available in Supplementary Tables 1 and 2.

Final model selection

Among the 4 best performing individual models, no single model outperformed the others on the reserved or out-of-distribution test sets. Therefore, an ensemble approach was used to leverage the strength of each model and mitigate pixel-wise segmentation errors of individual predictions, thereby improving overall performance and reducing risk of over-fitting to the training dataset. This method concatenates all 4 best-performing model outputs and employs pixel-wise voting to determine the final model output. Per majority voting, our ensemble model incorporated 3 out of the 4 best performing models. This demonstrated an improvement in the Dice metric compared to individual models (Supplementary Tables 1 and 4).

Attention UNet, nn-UNet and SegResNet were combined in an ensemble model that was used for final evaluation. Similar model performance was observed across all 4 combinations of 3 models (Supplementary Tables 45), but this combination achieved a higher Dice metric for the anterior cervix, on the reserved test set. Hausdorff distance, which is more representative of performance for small segmentation classes, also indicates improved bladder performance (smaller Hausdorff distance) for the ensemble of Attention UNet, nn-UNet and SegResNet compared to other combinations (Supplementary Table 5). This ensemble model was thus used to generate predictions for the reserved and out-of-distribution test set, both of which demonstrated that the model generalizes well to new data. When applied to the reserved test set, the model performed well across diverse cervical presentations such as cervices that were of average length/width, curved, linear, long, short/squat, funneled, and adjacent to a full bladder (Fig. 4). Across all reserved test set images, the model achieved a high Dice metric for the anterior and posterior cervix of roughly 0.93 and 0.91, respectively.

Fig. 4: Ensemble model segmentation on CLEAR test data
Fig. 4: Ensemble model segmentation on CLEAR test dataThe alternative text for this image may have been generated using AI.
Full size image

.Within the original reserved test set, ground truth and predictions from the combined model are shown, calculated by majority vote of Attention UNet, nn-Unet and SegResNet. The model segmented cervical tissue well across diverse cervical etiologies including: (a) a large cervical funnel, (b) an average length/width cervix, (c) a curved cervix, (d) a linear cervix, (e) a long cervix, (f) a short/squat cervix, and (g) a cervix with an adjacent full bladder. Dice metric and Hausdorff distance (pixels) are reported in color-coded class values below each image pair.

Evaluation on the out-of-distribution dataset similarly indicated high model performance for the aforementioned diverse cervical shapes as well as in the presence of fetal anatomy near the internal os (Fig. 5). On this out-of-distribution dataset, the Dice metric dropped slightly to 0.80 and 0.85 for the anterior and posterior cervix class, respectively. However, visual inspection of the prediction images confirmed high model performance.

Fig. 5: Ensemble model segmentation on out-of-distribution test data.
Fig. 5: Ensemble model segmentation on out-of-distribution test data.The alternative text for this image may have been generated using AI.
Full size image

Within the out-of-distribution test set, ground truth and predictions from the combined model are shown, calculated by majority vote of Attention UNet, nn-Unet and SegResNet. When evaluated on this previously unseen dataset, the model performed well across diverse cervical etiologies including: (a) an average length/width cervix, (b) a short/squat cervix, (c) a long/curved cervix, (d) fetal anatomy placed near the internal os of the cervix, and (e) a full bladder pressing on the anterior cervix and lower uterine segment. Dice metric and Hausdorff distance (pixels) are reported in color-coded class values below each image pair.

Inter-operator metrics

To evaluate inter-operator variability, measures of similarity were calculated between the majority ground truth label and each expert label on the test set. These metrics were then averaged across all experts to derive inter-operator values (Supplementary Table 6). For the reserved (CLEAR) test set, the inter-operator Dice metric averaged across all classes except background was 0.82, with class specific Dice metrics of 0.94 for both anterior and posterior cervix classes. When evaluated on the reserved (CLEAR) test set, the combined model architecture achieved a high Dice metric of 0.77 averaged across every class except the background, with class-specific Dice metrics of 0.93 and 0.91 for the anterior and posterior cervix class, respectively. The similar interoperator and model Dice metrics indicate that the model performed slightly below the clinical expert agreement.

Cervical length

The proposed models accurately reproduce TVUS-CL (Fig. 6), with methods that leverage underlying geometry from the image inputs and predicted segmentation masks. Of the 29 patients in the out-of-distribution test dataset, 4 had anatomically improbable predicted segmentation labels (due to poor image quality) and were excluded from subsequent analysis (Supplementary Fig. 4). For the remaining 25 patients, CL was binned in 0.5 cm increments, and normal distributions were fit to histograms plotted for the algorithm and each sonographer. Normal curves were fit to the CL distributions and overlaid on the same graph (Supplementary Fig. 5). Most images have a positive percent error (Supplementary Fig. 6), indicating the algorithm-reported value is larger than the sonographer-reported value. To further examine differences between algorithm and sonographer reported values, the percent error was plotted for each patient across the dataset (Supplementary Fig. 6). Examples with relatively high absolute error (PE < −25% or PE > 25%) of CL measurements demonstrate the chain effect wherein CL measurements follow the underlying segmentation shape, which is determined by the TVUS image itself; a shadowing artifact or poor image quality is expected to create a poor segmentation mask which in turn results in an unreliable CL measurement (Supplementary Fig. 7). Bland-Altman plots compare the CL measures from the algorithm against the expert measures (Supplementary Fig. 8), finding a mean bias of 0.14 cm.

Fig. 6: Algorithm-reported CL displayed on out-of-distribution test data.
Fig. 6: Algorithm-reported CL displayed on out-of-distribution test data.The alternative text for this image may have been generated using AI.
Full size image

The combined ensemble-based segmentation model and CL algorithm demonstrate nearly perfect CL measurement agreement between the algorithm and the clinical experts. Three representative examples show (a) the original TVUS image, (b) the CL prediction overlay in white with expert caliper label in dashed-green, and (c) the CL prediction overlay in white atop the predicted segmentation mask. In column (b), the algorithm and expert-reported CL measurements are reported on each image and the PE between the algorithm and trained expert CL is displayed in the bottom right-hand corner.

To confirm that these CL values were drawn from the same distribution, a Wilcoxon signed rank test was performed with the null hypothesis that there is no difference between the average sonographer-reported and corresponding algorithmic-reported CL value. The test failed to reject the null hypothesis, indicating that the algorithm and the sonographer measurements are drawn from the same cervical length distribution. Visually, the experts had nearly perfect agreement, and statistical tests confirmed that reported values from the algorithm and experts are likely drawn from the same distribution, meaning they agree.

Discussion

AI tools are rapidly being integrated into medical practice, creating new opportunities to leverage this technology in maternal and fetal health. We have developed an AI algorithm for mapping cervical shape to measure maternal anatomic features. Using CL as an example and an additional confirmation of model performance, this work describes an automated multi-class segmentation framework that labels cervical tissue in its entirety on TVUS images and automates CL measurement. Compared to prior single class or cervical outline segmentation approaches19,22, our multi-class ensemble model segments the cervix in its entirety, expanding upon our previous experiments21. Compared to previous work21,23, this ensemble model achieves a similar, slightly elevated Dice metric of 0.93 and 0.92 on in-distribution data for both anterior and posterior cervix classes. Unlike previous work, this model was deployed on an out-of-distribution dataset for the first time and maintains high model performance with a Dice metric of 0.80 and 0.85 for the anterior and posterior cervix classes, respectively. Furthermore, our model was trained on diverse data from multiple institutions and ultrasound manufacturers (including Siemens, General Electric, Toshiba, Philips, etc), performing as well as human experts for CL measurement. The 0.14 cm mean bias in CL measurement indicates that expert readers and the algorithm can be used interchangeably (Supplementary Fig. 8). The small positive percent error pattern in CL prediction is expected, as expert measurements were taken as a series of line segments, whereas the algorithm follows inherently longer, curvilinear traces. This study is strengthened by diverse, multi-institution training data with known quality measures (CLEAR scores) and multiple expert labels to develop a segmentation model. Similarly, high performance on a separate clinical dataset, drawn from a different distribution, reinforces trust in model generalizability across new, multi-site, diverse demographic data.

Three metrics were used to evaluate the performance of the segmentation model. While all three similarity measurements have merit, they fall into two main categories: overlap metrics such as Dice and Jaccard (interrelated) and distance metrics such as Hausdorff distance (HD). Overlap metrics measure how many pixels are shared between the ground truth and the predicted image, but they are highly dependent on the shape and size of the structure, which is challenging in small, elongated structures that could be displaced by a few millimeters and have no overlapping pixels24,25. This explains lower Dice metrics for bladder and cervical canal versus anterior and posterior cervix classes. Distance metrics, by contrast, compare the surface distance between the ground truth and predicted image, explaining how close the masks are to each other, but they are highly sensitive to outliers24,26,27. These distance metrics are particularly important when evaluating small classes, such as the bladder, which is the highest performing (smallest HD value) class across the 4 individual models considered for ensemble approach. Standard Hausdorff distance specifies the minimum distance (or expansion) that needs to be applied to both sets (ground truth and predicted segmentation) such that the expansion contains all segmentation pixels for both original sets. A single boundary error can lead to large Hausdorff distances even when the majority of the profile is accurate. When possible, it is best to report both overlap and distance metrics, as they offer different insight into model performance and an important baseline against which to judge future models. In this particular use case and in line with the Medical Imaging and Data Resource Center guidelines (https://www.midrc.org/performance-metrics-decision-tree), our results suggest that Hausdorff distance did not reflect overall segmentation quality of the larger classes (anterior and posterior cervix) as reliably as Dice metric in our multi-label segmentation task. In the future, average Hausdorff distance may better capture performance of the model, as it is less sensitive to outliers than standard Hausdorff distance. In agreement with qualitative images, the ensemble model demonstrated slightly better Dice metric performance than individual models, a distinction not captured by standard Hausdorff distance.

Overall, the ensemble-based segmentation model accurately reproduced cervical geometry and CL measurements. Dice metric indicates that segmentation performance remains limited for small anatomical landmarks (bladder and cervical canal), while Hausdorff distance indicates the bladder performs as well as the larger cervix classes. Both Dice metric and Hausdorff distance reiterate the small, often elongated, cervical canal class suffers a drop in segmentation performance. Although ensemble models are computationally complex, requiring either more computational power or time than individual models, the qualitative benefit can justify the computational expense. If computational cost is a concern, any of the individual models from the ensemble (Attention UNet, nn-UNet, SegResNet and ResUNet2) may be deployed in a stand-alone format. The training time, similarity metrics and representative segmentation images are available in Supplementary Information (Tables 1, 2, 3, 7, 8, 9, 10 and Figs. 1, 9, 10, 11, 12).

Despite high performance on the test set, the bladder boundary is inaccurately predicted in images with a full bladder (Supplementary Fig. 13a) that fail to meet CLEAR criteria. Although the bladder is meant to be emptied before TVUS acquisition, this procedure is frequently not followed, creating variability in bladder position and size. For a small feature that is already challenging for the model to learn, such variability magnifies the difficulty. Consequently, the inferior portion of the bladder flap is often under-predicted, lowering the Dice metric.

Furthermore, cervical canal shape and size can vary significantly among patients, creating heterogeneous anatomical regions. In select patients, the mucus plug may be large and visible; in other patients, it may be small and indistinguishable from cervix tissue. This variation makes it difficult for the model to learn without larger, diverse datasets. The cervix may also present extremely funneled in patients at high risk sPTB, creating a ground truth segmentation with a large surface area. In contrast, some cervices are strictly closed at the histological internal os, rendering nearly undistinguishable cervical canal on the TVUS image. In the event of extreme cervical funneling (Supplementary Fig. 13b), we found that the model may struggle to find the histological internal os. The low dataset representation of funneling may bias the model to under-predict large cervical funnel shapes.

If the placenta is located near the internal os (Supplementary Fig. 13c), the placenta is often mistaken for posterior cervix, likely due to similar echogenicity and texture. If the image violates CLEAR criteria because the cervix is small relative to the field of view (Supplementary Fig. 13d), the cervix may be over-predicted or misplaced. Although bladder predictions were less reliable, the inclusion of the bladder class likely improves the overall performance by providing a highly echogenic landmark with an anatomically prescribed location near the anterior/superior boundary of the cervix. Similarly, the cervical canal class may later inform the shape and size of a funnel or cervical mucus plug, in the TVUS image.

Although the algorithm successfully reports CL, its accuracy is intrinsically limited by segmentation prediction and image quality; any segmentation errors will propagate to CL measurements. However, the limitations in bladder and cervical canal segmentation do not greatly affect cervical shape analysis. Rather, the anterior and posterior cervical classes are more vital for measuring cervical features, evidenced by successful CL reproduction with minimal post-processing. Poor image quality had a greater influence on CL measurements. Higher CL errors were observed for images (Supplementary Fig. 7) that had large amounts of shadowing near the external os where highly echogenic regions, insufficient gel/probe contact, or defective transducer elements interfered with signal propagation. In low-quality images (Supplementary Fig. 7) where the internal os is not clearly visualized, the algorithm struggled to infer the location of the internal os and anterior/posterior cervical boundary. As with all machine learning models, outputs are limited by the quality of data fed to the model. This underscores and necessitates efforts towards automated quality metrics.

Future work will focus on improving segmentation of small bladder and cervical canal classes. One approach is using a customized Dice loss function that more heavily weighs these classes. Larger, more diverse datasets should be introduced to learn small features subject to large patient-to-patient variations. This can be achieved by generating synthetic data from the original dataset using generative models such as diffusion models28 or cycle-GAN29. Such domain adaptation can expand the pool of images for cervix and bladder shapes, capturing different bladder fullness, cervical funneling, and mucus plug thickness. Even if the model’s performance improves with synthetic data, it remains worthwhile to introduce more curated medical data to include additional cervix phenotypes, such as more images of short cervices, funneled cervices, and low-lying placentas; this is expected to improve model generalizability during inference. Across all of these images, the average Hausdorff distance can also be leveraged to better capture the performance of these small features, while limiting the sensitivity to outliers. Current and future work also includes enhanced post-processing techniques and refined geometric feature extraction, coupled with the introduction of larger, more diverse TVUS datasets linked to clinical outcomes.

Following rigorous testing and validation, the immediate clinical impact of this work would be to measure CL in real-time on an US scanner. In the longer term, this model could be deployed on US-machines to measure cervical shape features and provide a clinical risk score of sPTB. Our platform is designed to accommodate future integration of additional TVUS-derived features including cervical diameter, cervical curvature, AUCA, LUS thickness, and closed cervical area. This has broad applications for understanding patient-specific maternal geometry and implications for timing of delivery through predictive machine learning models and geometrically informed finite element analysis simulations of pregnancy. This technology may reveal new biomarkers signaling structural changes leading to birth, thereby improving the prediction of birth timing. Identifying these changes could guide targeted sPTB therapies.

Currently, TVUS-CL is the only clinical imaging biomarker of sPTB risk. Although automated CL measurement algorithms are being developed, their low PPV highlights the need for additional biomarkers that capture the cervix’s complex 3D biomechanics. Moreover, relying solely on 2D measurement cannot sufficiently capture the complex 3D biomechanics of cervical preparation for delivery. Our novel segmentation tool labels the entire cervix, enabling extraction of multiple geometric features to support generation of comprehensive computational models of the entire cervix and LUS9,30, thereby enabling more personalized, biomechanically-informed decisions about delivery timing and targeted therapeutics. This segmentation tool holds promise in elucidating the pathways of sPTB, but more research is needed to fine-tune this model and ensure generalizability before wide deployment. Integrating such AI-based methods into clinical care could expand access to sPTB screening, particularly in underserved areas, but requires rigorous validation on larger datasets. Ultimately, these capabilities may facilitate in silico testing of interventions30 and integrate seamlessly into existing clinical workflows.

In this work, we present a fully automated multi-class segmentation network to segment the pregnant cervical anatomy and nearby tissues on 2-dimensional transvaginal ultrasound images. This model was successfully deployed on our reserved test dataset as well as a newly introduced, out-of-distribution test dataset of pregnant patients at low risk of sPTB. Deploying this model in the clinical setting will further standardize CL measurements, removing observer variation with possible downstream effects to improve measurement sensitivity. Building upon these tools to obtain additional biomarkers will potentially improve both the understanding of biomechanical pathways leading to sPTB, as well as the prediction of sPTB itself.

Methods

CLEAR dataset images

Mirroring our previous work21, the Perinatal Quality Foundation (PQF), which hosted the CLEAR training program31, supplied 250 de-identified TVUS images, collected between 16 and 32 weeks gestation from various centers and ultrasound machines across the United States of America. As per the PQF privacy policy, candidates who participate in the CLEAR program accept that their information may be used in an aggregate, de-identified manner for research. In addition, there were no patient identifiers associated with any images submitted and reviewed. As such, this study was exempt from institutional review board approval. Images were graded based upon their adherence to 9 CLEAR criteria21, where a minimum score of 7 is required to pass. As in our previous work21, each image received a CLEAR score and a subset of images with scores 6–9 were used to train, validate and test the model. Ideally, all clinical TVUS scans would merit a perfect score, but a small subset of real-world data is expected to fail CLEAR criteria due to human error, even after appropriate training. To account for this and improve the model’s ability to generalize, a small subset of grade 6 images was included in the dataset, as these images still meet over half of the CLEAR criteria but fail to pass certification. Since the provided TVUS images were anonymized, no pregnancy outcome information is available for this training data, and it is assumed that patients do not have repeat images in the dataset. Further inspection of the images, as depicted in Fig. 7, reveals that clinically short cervices (CL < 2.5 cm) and cervical funneling are present in roughly 15% and 12% of images, respectively.

Fig. 7: Cohort flow chart.
Fig. 7: Cohort flow chart.The alternative text for this image may have been generated using AI.
Full size image

Chart illustrates quantity of excluded data, for model training and testing. Population is further categorized based upon short cervical length (<2.5 cm) and the presence of cervical funneling (inclusive of grade 6, 8 and 9 images).

CLEAR dataset labels

For training labels, a CLEAR-certified sonographer and 2 clinicians provided annotations using the segmentation software Labelbox (https://labelbox.com/). During review and label generation, expert maskers were permitted to skip an image if the quality was too poor to distinguish the anatomical regions of interest (exclusion criteria in Fig. 7). Of the 250 original images, 4 images were excluded from the dataset during expert review leaving 174, 50, and 22 images in the grade 9, 8 and 6 groups, respectively. Experts were tasked with segmenting these images into 5 regions (background, bladder, anterior cervix + LUS, posterior cervix, and cervical canal + potential space) as shown in the segmentation label anatomy key of Fig. 2. Fleiss’ kappa coefficient was calculated to determine agreement among experts. Across all 246 labeled images in the dataset, the Fleiss’ kappa coefficient was 0.87, indicating high agreement between experts. To generate ground truth labels for training, a majority choice voting system was used (described and illustrated in our previous work)21. If at least 2 out of 3 experts labeled a pixel with a given class, then that pixel was set to true for that given class in the GT label.

Out-of-distribution images

To further validate model performance and generalizability to a population at low-risk of sPTB, we obtained an out-of-distribution test dataset of 30 pregnant patients at Intermountain Health (IH, Provo, UT) to test our algorithm. This study was approved by the institutional review board at IH (#1050495), and each subject provided written consent. Images were collected between 22 and 25 weeks’ gestational age. One subject was removed from analysis due to sPTB, leaving 9 (31%) nulliparous and 20 (69%) multiparous participants. Of these, n = 1 cervix was clinically short (CL < 2.5 cm).

Out-of-distribution labels

Labels for the out-of-distribution images were similarly generated, with the exception that only 1 clinician provided annotations, forgoing the need for ground truth majority choice voting. The use of 1 expert was justified by the high inter-rater agreement in the training dataset.

Data pre-processing

To remove the CL calipers placed by sonographers, the cv2 inpainting32 package was utilized. The CLEAR dataset was divided into training, validation, and test sets using a 70:20:10 split. Each set had a random distribution of images, but CLEAR scores were balanced within each set. Data augmentation techniques such as 180 rotations, random zoom, center crop, random Gaussian noise, Gaussian blur, and random contrast adjustments were applied only to the training set21.

AI technical field overview

To build a standard AI-based model, the model is first trained by inputting a set of TVUS images and corresponding expert labels (manual segmentation of anatomy). During training, the model is exposed to labeled images to identify patterns and features, and the model iteratively learns by adjusting a set of variables, called hyperparameters. This fine tuning of hyperparameters informs model performance until it achieves the best possible output (predicted segmentation of anatomy). As is customary, the training data is used to tune the hyerparameters of the model during each epoch, and the validation dataset is used to evaluate model performance after each epoch during model training. A separate test set may be, and in this case was, introduced to further evaluate the model’s ability to generalize to never-before-seen data.

Model architecture

The MONAI library (https://monai.io/) was used to implement the following segmentation model architectures: SegResNet, UNet, Residual UNet, nn-UNet, Attention UNet and Transformer UNet. For all model architectures, image/mask pairs were resized to 256 × 256 pixels, and the mask was one-hot encoded before training. The images were converted to grayscale, and pixel values were normalized between 0 and 1, providing a 1-channel input to the network. The model computed a 5-channel output corresponding to background and the 4 classes depicted in Fig. 2.

Hyperparameter optimization

The multi-class SegResNet33, and Transformer UNet (UNETR in MONAI)34 models were trained with varied dropout, maintaining all other default parameters. Both the multi-class Residual UNet35 and multi-class Attention UNet36 architecture were trained with 5 convolutional layers (corresponding to 16, 32, 64, 128, and 256 channels), and a stride length of 2. The multi-class nn-UNet (DynUNet in MONAI)37 architecture was trained with a kernel size of 3 and a stride length of 2. The number of residual units was varied only for the multi-class Residual UNet architecture. For each model architecture, both Adam and SGD optimizers were considered. The learning rate was varied from 0.001 to 0.01 for each optimizer, as shown in Table 1. Dropout of 0.1–0.4 was introduced for each model to decrease over-fitting.

Model training

During model training, Dice loss was fed through backpropagation to update model weights and Dice metric was monitored to assess model performance. An average Dice metric value was calculated for each epoch by averaging class-specific dice metric across every class except background. The model was allowed to run for 50 epochs during training, and early stopping was applied to monitor the validation loss with a patience of 5 epochs. The model checkpoint with the best average Dice metric on the validation set during training was saved. Predictions were generated by feeding inputs through the trained model, applying softmax activation along the class dimension and reporting the argmax value along the class dimension to determine the predicted class of each pixel in an image. For each prediction image, the Dice metric was analyzed for individual classes, with the anterior and posterior cervix class considered the most important when considering shape analysis of the cervix.

Model selection

Training identified the best-performing models for each architecture, iteratively checking the performance on the validation dataset after each training step. The predicted labels for the models under consideration were evaluated against ground truth using 3 similarity measures which assess how similar one image is to another image by comparing the pixel overlap (Dice Metric, Jaccard Index) or the degree of mismatch by assessing how far away one image representation is from another (Hausdorff distance).

Cervical length feature extraction

In select images such as the atypical cervix with a large cervical funnel shown in Fig. 13b, there are some disjointed regions and therefore multiple instances of the same class. Anatomically, this is an impossibility and therefore a post-processing step is warranted to correct for small disconnected regions. All subsequent model analysis was performed on the raw segmentation predictions, but minimal post-processing steps were performed to remove these disjointed regions or “islands” from the segmentation masks before applying the cervical length algorithms. This was done by examining multiple instances of the same class and preserving only the largest instance of a given class, provided that the smaller instances are no larger than 25% the size by area of the largest class. To reassign these instances, the post-processing step considered the most prominent class type bordering the region of interest by surveying the perimeter pixels of neighboring classes.

We developed custom Python scripts to automatically measure CL from segmentation masks (Fig. 2), leveraging the geometry of the cervix and clinically recognized anatomical landmarks. If the cervical canal class label is present, the algorithm starts by finding internal os with the following method: 1) The algorithm locates the superior most boundary of the cervical canal + potential space class (shown in green). 2) These superior (or leftmost) points are fit to a line, and the image is rotated such that this line is oriented vertically. 3) The algorithm then counts the number of green points per column and calculates the derivative, which indicates how quickly the width of the cervical canal + potential space class changes. 4) The derivative is graphed lengthwise across the image, and the first point where the derivative plateaus below a preset threshold is taken as the internal os location. Alternatively, if the cervical canal + potential space class is not present in the prediction image, the internal os location is derived from the leftmost point with adjacent anterior and posterior cervical tissue. The external os is then identified as the rightmost point of adjacent anterior and posterior cervical tissue. The cervical trace is finally taken as the adjacent anterior and posterior tissue between the internal and external os (Fig. 6). If a mucus plug is visible in the image and is labeled as the cervical canal class, the vertical midpoint of each column is taken as the point along the cervical trace. Finally, the model returns both a visual guidance tool where cervical length is traced atop the underling TVUS image and a numerical value for CL.

Model validation from predicted CL vs. ground truth CL

To further validate the model and compare it against existing clinical standards, we algorithmically extracted CL and measured this value against sonographer-reported CL. The model-predicted anatomy labels were fed into this CL extraction algorithm to return the automatically predicted CL, which was then compared against sonographer-reported CL. Quantitative comparisons in CL were made by evaluating percent error (PE) between sonographer-reported CL and algorithmically extracted CL. Similarly, Bland-Altman plots, which are commonly used to visualize the degree of agreement between the clinical gold standard and a new measurement technique, were used to identify possible systemic bias introduced by our algorithm.

Statistical tests

Given the small size of the reserved test dataset, the performance metrics cannot be assumed to follow a normal distribution. Therefore, non-parametric statistical tests were used to test the null hypothesis (p < 0.05 and p < 0.01) that the performance metrics for each model were drawn from the same underlying distribution. One-way paired Friedman test was used to detect differences between the performance across all models. The Friedman test indicated a difference between mean performance metrics across all model types. A paired multiple comparison Wilcoxon Signed-Rank test with Bonferroni corrections was used to compare the performance between each model in terms of Dice metric, Hausdorff distance, and Jaccard index.

Hardware and software

All models were run on a single Tesla V100-32GB GPU. Model training was performed in Python 3.9, using PyTorch and Medical Open Network for Artificial Intelligence (MONAI, a library which provides domain-specific capabilities for medical imaging: https://monai.io/) packages.