Introduction

Induction of labor (IOL) is a common obstetric procedure that involves artificially initiating uterine contractions to start labor and delivery1. IOL rates are increasing worldwide, particularly in developed countries, accounting for 25% of deliveries in the UK and US2,3. IOL can lead to a cesarean section (CS), although it does not increase its rate4. Efforts are being made to reduce unnecessary CS and achieve suitable rates, as recommended by the World Health Organization (WHO)5. Factors such as Bishop score, parity, previous CS, maternal body mass index and weight gain during pregnancy are known to play a crucial role in these rates6,7.

Obstetric ultrasound (US) is an essential imaging modality for fetal monitoring during pregnancy, which provides an economic and non-invasive way of assessing fetal organ development and growth8. Ultrasound assessment of fetal biometry and growth is performed by obtaining measures of the fetal head circumference (HC), biparietal diameter (BPD), abdominal circumference (AC), and femur length (FL)9,10. In clinical practice, fetal biometry, fetal weight estimation, and/or Doppler blood flow are frequently used as biomarkers for fetal growth disorder screening, performed following international guidelines11,12. These images obey protocols that produce comparable images acquired in specific planes, to be useful for diagnosis, reduce intra- and interobserver variability and allow measuring of particular structures8.

Artificial Intelligence (AI) is the branch of computer science that focuses on creating systems capable of performing tasks that require some level of intelligence such as learning, reasoning, problem-solving, perception, and decision-making10. Machine learning (ML), a subset of AI, uses algorithms that enable computers to learn from data and improve their performance without explicitly programming them13. AI and ML are gaining popularity in healthcare due to their ability to analyze complex data structures and patterns, such as electronic health records and medical images, and create prediction models, ultimately improving individual health outcomes14. In the specific field of medical imaging, including US, ML has shown several advancements with the employment of deep learning (DL) models10,15. They excel in image recognition, classification, detection, and segmentation, surpassing human capabilities13,15,16. DL models can use supervised or unsupervised learning approaches. Supervised models use labeled data during training, followed by unlabeled data testing. Unlike traditional software, which use preset logic rules, DL models use raw data as input10,17. DL models are being proposed to support sonographers in US, overcoming issues like subjectivity and interobserver variability. They can also reduce examination times and tutor young doctors. It is important to note, however, that AI is not a replacement for human healthcare professionals, but rather a human support decision-making tool18,19.

The success of DL models depends on convolutional neural networks (CNNs), that are able to extract patterns automatically from images and use them to learn to map the input data (i.e. US images) and output data (a label, such as vaginal delivery (VD) and CS)15,18. Moreover we can use transfer learning to adapt existing CNN architectures that were trained in large, labeled datasets (i.e., ImageNet) to reduce the amount of data and computation needed to produce accurate models, thereby overcoming the occasionally observed scarcity of large volumes of labeled data in the imaging medical field13,17.

AI methodology has been used in obstetrics and gynecology to evaluate adnexal masses, endometrial cancer risk, pelvic organ function, breast lesions, and predict fetal outcomes, improving prenatal diagnosis13,18. CNNs are used in obstetric US for fetal weight estimation by measuring fetal biometry, identification of normal and abnormal anatomy, and detection and localization of structures and standard planes. This has already instituted clinical applications in fetal imaging, including echocardiography and neurosonography15,20. Recent literature findings by Kim et al.’s21 study group have developed a DL algorithm for automated measurement of BPD and HC, improving localization of fetal head shapes and caliper placement in later gestational ages. Rizzo et al.22 developed a DL software to automate fetal central nervous system assessment measurements, reducing examination time and reliance on fetal position and 2D-ultrasound expertise. Intrapartum ultrasound has been reported as a valuable tool for providing accurate and reproducible labor progression findings, including fetal head position, station, and flexion, crucial for labor management. In this context, Ghi et al.23 proposed a DL implementation for assessing fetal occiput position before vaginal delivery, achieving an accuracy of 90.4% in recognizing the fetal head position.

Several international societies have suggested that US evaluation may help predict delivery mode after IOL, using evaluations such as cervical length and posterior cervical angle measurements24. However, to our knowledge, no studies have considered using fetal biometry images for this purpose, although obstetricians rely on US fetal weight estimation to guide delivery counseling. Therefore, we hypothesized that DL could aid in mode of delivery prediction after IOL by analyzing maternal and fetal data obtained from electronic medical records (EMR) and fetal US biometry images. Although several maternal characteristics have been related with an increased risk of CS25 this relationship has been harder to prove with fetal US features. Since US fetal biometry evaluation is influenced by experience, maternal habitus and fetal position, among other factors10, potential errors significantly impair our ability to detect anomalous fetal growth11. That is why the International Society of Ultrasound in Obstetrics and Gynecology (ISUOG) recommends that fetal biometry ought to be just one part of our fetal growth screening process, and that a combined approach using other clinical, biological and/or imaging markers may also be applied11.

As such, the main objectives of this study are to develop and test ML and DL models for predicting mode of delivery (VD or CS) after IOL using tabular and US fetal imaging data in standard biometry planes. Secondary objectives are to create an ensemble of the best performing models and calculate and compare their diagnostic predictive accuracy.

Results

Tabular data

From January 2018 and December 2021, 808 patients with singleton vertex pregnancies were included in our longitudinal retrospective study, with 563 (69.7%) patients culminating in VD and 245 (30.3%) categorized as unplanned CS. The participants’ average age was 32.2 years [18–47 ± 5.7]. Demographic features and maternal and neonatal outcomes are shown in Table 1. Comparison between the two delivery modes showed significant differences in terms of age, height, body mass index (BMI) and parity, with more parous women in the VD group (37.8% vs 32.2%; p < 0.001). Women in the CS group were older, shorter, had a higher BMI and heavier babies at birth (p < 0.001). Other characteristics, such as gestational diabetes, 5-min Apgar scores ≤ 7, and neonatal intensive care unit admission rates were similar between groups. Mean gestational age (GA) on third trimester US was 30.9 ± 0.94 weeks [27–32 weeks]. The mean fetal biometry measures were, respectively: HC 290.7 ± 12.7 mm [245.8–326.6 mm], BPD 81.1 ± 4.0 mm [57.0–91.1 mm], AC 278.6 ± 15.1 mm [215.0–325.6 mm], FL 60.0 ± 2.9 mm [47.1–68.0 mm] and estimated fetal weight (EFW) 1848.8 ± 247.9 mm [903.0–2610.0 mm]. The recommended Hadlock formula was used to calculate EFW9. All these measurements were significantly different between both groups, except for FL, which showed similar values (see Table 1).

Table 1 Demographic, induction of labour and delivery outcome data for women who delivered vaginally vs. unplanned caesarean.

Mean GA at IOL was similar between groups (39.9 vs 40.1 weeks). Dinoprostone was more frequently used in the CS group (66.1%), and the contrary was true for misoprostol (49.6%). Also, 91.8% of pregnant women in the CS group presented significantly lower Bishop scores (≤ 3) before IOL. IOL indications did not differ between groups. Time to delivery was significantly longer (20.1 versus 28.9 h) in the CS group. A third of CS were due to non-reassuring fetal heart rate (30.2%), while the majority (67%) corresponded to “failed induction/labor dystocia” (see Table 1).

Figures 1 and 2 and Table 2 show the tabular data models’ performance in predicting CS likelihood. These models take into consideration maternal clinical data as well as fetal information provided by the third trimester US. The best performing model was selected for further interpretation due to its superior positive predictive value (PPV) and F1-score weighted, meaning that it has the best ratio between true positives (TP) and false positives (FP). The rationale for this choice is because a mode of delivery prediction model should detect as many CS as possible—true positives—while avoiding misclassifying a VD as a CS—false positives. All models showed good predictive performance, with F1-scores ranging from 0.59 to 0.74. The AdaBoost model presented a high predictive power (F1-score = 0.736 ± 0.024 and PPV 0.734 ± 0.024) and accurately predicted 86.7% VD and 46.9% of CS, corresponding to a 13.3% FP rate and 53.1% FN rate and an overall accuracy of 74.7% (201/269; see Fig. 2a). All results were obtained using cross-validation, which ensures that the model generalizes from training to test data not previously seen while evaluating all the dataset.

Figure 1
figure 1

The ROCs for prediction of mode of delivery for tabular data (a) and image-based data (b). ROC, receiver operating characteristic.

Figure 2
figure 2

Confusion matrices on (a) the best tabular data model AdaBoost and (b) the best image-based model of the femur, (c) head and (d) abdomen are shown. Confusion matrix depicting in reading order from left to right, top to bottom: true-negative, false-negative, false-positive and true-positive rates.

Table 2 Performance metrics of the DL models on the datasets.

Imaging data

Of the total 808 pregnant women included in the tabular data, each contributed with 3 US images of the third trimester (comprising the fetal head, abdomen and femur), totaling 2424 images. These were analyzed using a threefold cross validation, comprehending 1126 VD and 490 CS images for training and validation, and 563 VD and 245 CS images for testing the imaging-based models. Figures 1 and 2 and Table 2 present the imaging models’ performance in classifying VD vs CS. True delivery outcome served as the ground truth for training and testing. Overall, the best DL model for fetal US images was Inception, based on the same rationale as previously explained for the AdaBoost model. F1-score weighted and PPV for our test dataset were 0.594 ± 0.022, 0.580 ± 0.027 for femur (the best image model), and 0.590 ± 0.015; 0.571 ± 0.025 for abdomen, respectively. The head view’s F1-score weighted (0.587 ± 0.043) and PPV (0.565 ± 0.068) were the least helpful for mode of delivery prediction.

Ensemble models

Additionally, to test whether DL can improve mode of delivery prediction using multimodal imaging associated with tabular features, we implemented an ensemble of neural networks to provide classification on mode of delivery and compared their performance measures (see Fig. 3 for further explanation). We explored this approach by applying both average voting and majority voting strategies, the latter providing the best results, as shown in Table 3. The first ensemble model gathered the best US models of fetal head, abdomen and femur (image-based ensemble model), returning weak results in distinguishing VD vs CS, with a F1-score weighted of 0.584 ± 0.032 and a PPV of 0.585 ± 0.031. Marginally better results were shown by an ensemble model considering the previous three models and the AdaBoost model, providing a F1-score weighted of 0.628 ± 0.018 and a PPV of 0.675 ± 0.021 (see Table 3 and Fig. 4). The final classification ensemble model was the best ensemble model, aggregating the best tabular model (AdaBoost) and the best US image model (Inception femur). It achieved a F1-score weighted of 0.689 ± 0.042 and a PPV of 0.693 ± 0.038 (Table 3 and Fig. 4). It accurately predicted 75.9% VD and 51.9% of CS, corresponding to a 24.1% FP rate and 48.1% FN rate, with an overall accuracy of 68.7% (184/268; see Fig. 5c). The confusion matrix and respective AUROC of the final classification ensemble model are displayed in Figs. 4 and 5.

Figure 3
figure 3

Process involved in the establishment of the ensemble models. The three image-based models (Inception head, abdomen and femur) were associated with the best tabular data model, AdaBoost in three different ways. Green box: Image-based model, using the CNN Inception model of the femur, abdomen and head; Orange box: AdaBoost tabular data model with Inception models of the femur, abdomen and head; and Blue box: the Final classification model, which consists of the AdaBoost tabular data model and the Inception model of the femur, which is the ensemble model which provides the best metrics.

Table 3 Performance metrics of the best tabular, image model and ensemble models using majority voting and average strategies. The results are the averages and standard deviations of the threefold validation.
Figure 4
figure 4

The ROC curves for prediction of mode of delivery for the ensemble models and their comparison with the ROC curves of the Adaboost and Inception femur models. ROC, receiver operating characteristic.

Figure 5
figure 5

Confusion matrices on the following ensemble models: (a) image-based model, (b) AdaBoost and Inception models of the femur, head and abdomen (majority voting) (c) the final classification model (majority voting). Confusion matrix depicting in reading order from left to right, top to bottom: true-negative, false-negative, false-positive and true-positive rates.

The best tabular data model (AdaBoost) provided an average accuracy improvement of 6.0% over the final classification ensemble model for CS prediction. However, concerning CS prediction, the final classification ensemble model correctly predicted 51.9% (vs 46.9%) of CS, with a FP rate of 24.1% (vs 13.3%), compared to the AdaBoost model. Therefore, the tabular data model missed 4 correct CS predictions (TP) over the final classification ensemble model, while avoiding 20 unnecessary CS (FP) (see Figs. 2a and 5c).

Discussion

This study is the first to verify the feasibility of DL algorithms for the binary classification of mode of delivery after IOL using maternal and fetal electronic medical data and third trimester fetal US images. We developed ML models using tabular data and DL models for imaging data using transfer-learning methods. Our best-performing models were AdaBoost on tabular data, with a PPV 0.734; and the DL model Inception evaluating femur US images, with a PPV 0.580. Then, using ensemble-learning methodology, we developed various composite models, the best being based on AdaBoost and Inception US femur images, yielding a PPV of 0.693, matching the metrics of our best tabular model.

Recent studies use electronic medical information on maternal and fetal characteristics to construct prediction models regarding mode of delivery after IOL26,27,28. However, very few use ML with the same goal2,29. Also, several research studies explored third trimester US biometry planes for image segmentation30,31, image or plane classification8,15,16 and fetal biometry estimation32,33. US image classification has been mainly used for automatic fetal malformation detection18,34. However, to our knowledge, no study has yet reported the relation between third trimester US fetal plane imaging and mode of delivery outcomes after IOL.

In our study, maternal characteristics related to CS outcomes were compatible with literature findings (see Table 1). In our dataset, women submitted to unplanned CS were older, shorter, with higher BMI and lower Bishop scores compared with the VD group6,26,35. Fetal US characteristics such as EFW and fetal biometry measures were also significantly larger for fetuses who underwent CS, which is also compatible with literature36,37,38. However, FL showed no difference between groups. This is an exquisite finding because our best US image model uses femur images (see Fig. 1b and Fig. 2b). The explanation may lie in prenatal predictors of increased fetal adipose deposition, namely on the fetal thigh, which were found to be strong predictors of unplanned CS, compared to traditional fetal biometry and EFW39,40,41,42.

In fact, when analyzing individually, each DL image model underperformed, revealing the model’s difficulty in ascertaining which image features could aid in mode of delivery prediction (see Fig. 2 and Table 2). This was expected, for two main reasons: the first relates to the fact that DL models for object-detection and segmentation tasks are more accurate in identifying fetal standard biometry planes than classification models, because they can localize anatomical landmarks before classifying the plane, similar to human reasoning10; the other reason lies on understanding AI’s effectiveness in complementing clinical processes, since there is no study evaluating the accuracy of human evaluation of fetal third trimester US planes and their association with CS, probably due to an empirically unlikely association. Consequently, there is no practical way of evaluating if our metrics are reduced or if they can eventually supersede human intervention.

Regarding metrics for evaluation, our prediction model aims to counsel pregnant women undergoing IOL. Therefore, the main objective is to correctly advise those at high risk of CS and try to reduce the psychological and monetary burden of IOL on these women, as well as to confidently initiate and continue an IOL in women with a high probability of VD. Hence, the aim is to correctly identify true CS (TP) and avoid performing a CS on women who would have a VD (FP). As such, the most useful metrics in our study would be PPV, or the ratio of TP predictions to the total number of predicted positive observations; accuracy, or the proportion of correct predictions made by the model out of the total number of predictions; and sensitivity, defined as the ratio of TP predictions to all observations in the class13. That is why F1-score works better in our study than AUC, and because the presence of imbalanced data can influence the latter43. Hence, the DL models that provided the overall best PPV and F1-scores were the Inception group (see Table 2), which were consequently chosen for the ensemble models construction.

The first attempt on the ensemble model aggregated all Inception models (US images of the fetal head, femur and abdomen). Its performance showed a worse F1-score than the best Inception image model (femur) (0.584 vs 0.594) with a slightly superior PPV value (0.585 vs 0.580), probably because it aggregated the lower scores of the head and abdomen Inception models. Therefore, the next attempted ensemble model grouped all three Inception models and the Adaboost model. The latter probably influenced this ensemble positively, with F1-score and PPV of 0.628 (vs 0.584) and 0.675 (vs 0.585), compared with the image-based model, respectively. Since AI models can only account for information ‘seen’ during training, this model improved its performance by integrating imaging and electronic health record data18. Consequently, the last ensemble model, named final classification model, gathered the best tabular ML model and the best image model. Its performance was similar to the AdaBoost model, retrieving a F1-score of 0.689 (vs 0.736) and a PPV of 0.693 (vs 0.734). However, on a closer look at the confusion matrix, results show that the final classification model correctly predicted 51.9% of CS, more than the 46.9% rate of the AdaBoost model. On the other hand, the FP rates were more favorable for the AdaBoost model, showing a 13.2% rate (vs 24.1% on the final classification model). This trade-off between TP and FP can be explained by the difference in specificity (0.867 vs 0.758) and sensitivity (0.746 vs 0.689) of the AdaBoost model over the final classification one. Hence, we could infer that using DL femur US image models could help increase TP diagnosis at the expense of a marginal increase of FP cases15. As such, the model could be a useful clinical screening tool to distinguish women who are clear candidates for VD from those who have an extremely high risk of CS, or those who would benefit from a personalized mode of delivery planning. However, as emphasized in recent literature, AI tools should be used as an adjunct to the decision-making process, and the choices of the obstetrician and the pregnant woman should prevail when counseling on mode of delivery19.

This study has several strengths. To our knowledge, we present a novel database, comprising 2024 images from 808 fetuses, annotated for mode of delivery classification tasks using ground-truth information. This contrasts with most databases using similar images, which focus on image segmentation and plane classification and do not provide information regarding mode of delivery8.

The dataset accurately represents a real clinical setting, by being unbalanced and by using images collected retrospectively by various operators using various US machines. We opted not to use oversampling methods, i.e., to artificially increase the representation of minority classes and balance the dataset13. This would enhance our models’ performance but refrained from a real clinical scenario. Also, since our study used routine examination images suffering from speckle noise, low contrast, and variations of machines and settings, our models worked on their heterogeneity and complexities8,16. We argue that learning from diverse images enhances models’ adaptability and applicability in real-world scenarios by identifying consistent patterns and features8,13,44.

Data augmentation and use of clinical data along imaging data enhanced robustness and flexibility of the final models16,17. Finally, our model was thought to be plug and play and user-friendly without many restrictions to deal with real world clinical scenarios, allowing centers to upload deidentified images directly from workstations or hospitals to a cloud platform, with or without requiring additional patient data17.

The study is not without limitations. It is retrospective and uses data from a single center. This, especially for class imbalance databases such as ours, may have affected model training and testing and subsequently influenced model metrics, with emphasis on ROC curves10,18. Future developments may address this limitation by ensuring more CS images are available for successful binary classifier training. Also due to retrospective data collection, our model could not account for clinical or imagiological intrapartum variables such as fetal occiput position and engagement. The authors recognize the significance of these assessments, as supported by current research24.

The inclusion of numerous predictors in our sample size leads to concerns about overfitting. There might also exist a lack of a robust predictive accuracy assessment when using other data14. Therefore, we emphasize the importance of external validation as our next step, to assess the constraints of generalization and the possibility of multisite deployment of our model17. Finally, the results suggest it could be worth exploring data fusion approaches that combine into one model both streams of information, clinical data and image data.

In summary, this study proposed an ensemble AI model using US images of the fetal femur and maternal–fetal tabular data, yielding a relatively good performance. This is the first attempt to use this type of imaging data for mode of delivery prediction after IOL. The proposed model may become part of a promising tool in assisting mode of delivery counseling in clinical practice.

Materials and methods

Datasets

The dataset was retrospectively collected at the Obstetrics Department of University Hospital of Coimbra, a center with two sites (Obstetrics Department A and B), which are specialized maternal–fetal departments that manage thousands of births annually. Sample size was based on feasibility.

Tabular data included 2672 consecutive singleton vertex term pregnant women referred for IOL between January 2018 and December 2021. Other inclusion criteria were pregnant women ≥ 18 years of age and baseline Bishop score of ≤ 6. Planned CS, antepartum fetal demise, major fetal anomalies, and preterm births were excluded from analysis. EMR were analyzed, and, to ensure reliability of data, cases with no information on cervical examination at the time of admission were also excluded (n = 3). The final tabular dataset included 2434 deliveries.

The image dataset was collected based on the previous case selection, taking into consideration pregnant women attending our department for routine third trimester US evaluation. Images acquired during standard clinical practice were collected. Gestational age was computed from crown-rump length measurements on first trimester US45. Images were taken as a part of the Portuguese screening program, which recommends that the third trimester US should be performed between 30 and 32 weeks and 6 days of gestation46. Therefore, we decided to include a range of gestational ages from early third trimester (27 weeks) to 32 weeks and 6 days. Only third trimester US were considered for our visual computational model because first and second trimester US have specific goals that do not provide relevant information regarding mode of delivery planning. Of the 2434 subjects selected for tabular data, we excluded those who did not perform the third trimester US in our institution. Of the ones who did, we excluded those with missing US images, including only examinations which provided at least three US images per fetus (fetal head, abdomen and femur). This resulted in a final dataset of 808 deliveries (cases) and a total of 2024 US images.

Approval was obtained from the ethics committee of our center (protocol number CE-047/2022). Given the retrospective nature of the analysis, written informed permission was not required. Methods and results are reported in accordance with the TRIPOD guidelines47.

Data collection

Regarding tabular data collection, maternal age, gravidity, parity, BMI, height, GA, Bishop score, IOL indications, mode of delivery, CS indications, intrapartum complications, neonatal birth weight and neonatal outcomes were among the features examined2. Data were collected on admission and at the onset of the first stage of labor, after pelvic examination and assessment of both mother and fetus.

Ten different US machines provided by three different manufacturers (Voluson E8, Voluson P8, GE Healthcare, Zipf, Austria; Xario 200G, Aplio a550, Aplio i700, Aplio a, Aplio 400, Aplio 500 Xario 200, Toshiba, Canon Medical, Netherlands, Europe; H540 Samsung) were used for examinations. The percentage and absolute number of images from GE, Toshiba and Samsung ultrasound machines were 3.0% (n = 24), 96.9% (n = 783) and 0.1% (n = 1), respectively. Images were taken using a curved transducer with a frequency range from 3 to 7.5 MHz. Twelve examiners with significant experience (5–35 years) in obstetric US conducted the examinations according to ISUOG guidelines11. All images were stored in the original Digital Imaging and Communication in Medicine (DICOM) format in our Astraia database and were retrospectively collected. This process was made to comply with minimal quality requirements, i.e., omitting those with an improper anatomical plane (badly taken or cropped).

Regarding IOL, the choice of cervical ripening methods varied according to WHO recommendations and Bishop scores. These included oxytocin infusion, prostaglandin analogues and extra‐amniotic balloon catheters1,48,49. Premature rupture of membranes was defined as membrane rupture at term before labor onset. Prolonged pregnancy was determined at ≥ 41 weeks48. The definition of labor was the presence of regular uterine contractions with cervical changes50. Our institution performs IOL according to ACOG and NICE recommendations51,52.

IOL indications were categorized into prolonged pregnancy, pregnancy hypertensive disorders (i.e., gestational hypertension; pre-eclampsia), gestational diabetes, pregnancy cholestasis, premature rupture of membranes, intrauterine growth restriction, and other fetal or maternal pathology (e.g.: thrombophilia)1.

The primary outcome was mode of delivery, with VD and CS as algorithm outputs. Secondary outcomes included IOL indication and method, time to delivery and maternal-neonatal outcomes53. Successful IOL was determined as VD after induction. CS indications were stratified between non-reassuring fetal heart rate and failed induction/labor dystocia2. IOL duration was the time from IOL initiation to delivery1. Failed induction referred to not reaching an active phase of labor within 48 h, with ruptured membranes and oxytocin for at least 12 h. Labor dystocia was defined as cessation of dilation or descent during labor1,54.

Data processing

All images were saved in PNG (Portable Network Graphic) format, without compression to prevent quality loss. Each fetal subject’s head, femur, and abdominal photos were labeled with the appropriate classification for VD or CS. Every image was cropped to square proportions, respecting 537 × 537 pixels, centered in the ultrasound window, and then downsampled to 80 × 80 pixels. Through this procedure, the uniformity, comparability, and compatibility of image dimensions were ensured for DL techniques, which specifically call for square-dimensional images as input55. During the resizing process, all patient data was eliminated by cropping the image header, hence avoiding ethical concerns. The original ultrasound image’s margins were also cropped to remove unnecessary information.

Most prospective studies use images without caliper overlays in their models8. However, we chose to maintain caliper overlays, since these were burned to the image and could not be removed without altering the original image. We are also supported by other authors performing retrospective analysis, which state that the presence of caliper measurements had no discernible effect on their model’s ability to predict the primary result17.

Study design and training and test sets

The ML models tested for tabular data were logistic regression (LR), multi-layer perceptron (MLP), random forest (RF), support vector machines (SVM), extreme gradient boosted trees (XGBoost) and AdaBoost classifiers56,57. The framework parameters of these models are off-the-shelf default scikit-learn and are available at both https://scikit-learn.org/ and our Github (https://github.com/PugtgYosuky/ensemble-prediction-delivery-type). Although an exhaustive study is beyond the scope of this article, we plan to do a more in-depth analysis for the best hyperparams in our future work. Before modeling, missing data were handled by simple imputation. Numerical features were transformed with RobustScaler method, removing data median and scales according to the quartile range. We used the interquartile range. Categorical features were “one-hot encoded” resulting in binary features.

Our source image dataset began with all fetal third trimester ultrasounds fitting the inclusion and exclusion criteria above (n = 808 studies), which correspond to 563 (69.7%) VD and 245 (30.3%) CS. Hence, the dataset’s composition exhibits an imbalance, similar to most real-world clinical scenarios8. We applied a stratified threefold cross-validation, similar to the work of Moon-Grady et al., thus the full image dataset was divided into three different datasets, using a proportion of two-thirds for training and a third for testing. Validation data used 30% of the training dataset. There was no subject overlap between the test, validation and training sets, to guarantee that no images from training cases were included in the test dataset. The models’ parameters were learned using the training set; the prediction error for hyperparameter tuning and model selection was estimated using the validation set. The test set evaluated the generalization error for each of the final models, to avoid potential overfitting to the training set17,20.

Images from the training dataset were used to train (1) a VD versus CS classifier, (2) construct an ensemble model between each image view (fetal head, abdomen and femur), (3) compare it with the tabular dataset model and (4) with an ensemble model that associates tabular clinical information and the imaging ensemble model.

Model architecture and training parameters of image classification models

Convolution neural network (CNN) is a deep neural network architecture that excels at extracting features from unstructured data, such as medical images, allowing the automatic feature extraction of crucial information for the learning task44,55. We used three traditional CNN architectures to train on our data: Inception58, Resnet 5059 and Xception60. This study used transfer learning, a method developed to transfer fully trained model parameters from one large database to another to fine-tune training17. All networks underwent pre-training on the ImageNet Large Scale Visual Recognition Challenge (http://www.image-net.org/) and were trained using our training data61. To perform the intended binary classification task (VD vs CS), the top fully connected layers were replaced by a single layer of one neuron.

All nets were trained using Adam optimizer12,13. Nets were given a maximum of 1500 epochs to train before early stopping if the loss on the validation set did not decrease after 200 epochs13. Learning rates were 0,001 and batch size was 3213. Additionally, a set of image data augmentation approaches were applied to the images in the training dataset. These are required for the model to anticipate changes to imaging capture conditions, such as fetal position, amniotic fluid and placental tissue diverse locations and different zoom and focus adjustments17,55. It consisted of image random horizontal flips, rotations of up to 20° and random brightness up to 20°15. Several randomly augmented images were produced for every image in the training set and added to it, while keeping the non-augmented version20.

The networks were applied to the test images after training and frozen, producing a predicted confidence score between 0 and 1 for the CS outcome20. For both the VD and CS classifications, the network output yields likelihood percentages, both totaling 100%. The criterion for classifying a case as VD or CS was thus 50%, and the classification with the higher percentage was chosen as the outcome15. The top-performing models were recorded as the final output models (Table 2).

Composite diagnostic classification

Different CNN algorithms extract image information in different ways. Ensemble learning is a method that uses multiple algorithms to achieve a better predictive performance than any individual algorithm62. We computed an ensemble model using our highest performing individual models evaluating US images of the fetal head, abdomen and femur (image-based ensemble model; see Table 2 and Figs. 1 and 2)55. We tested two types of voting strategies based on the outcome probabilities of each model, which included the mean and max voting. We found that the max voting strategy was the best voting strategy for our ensemble models (Table 3)55. For every model, a score of 50% and above was considered CS, and below 50% VD.

Image-based ensemble model was further compared with the tabular data classification model. Lastly, a final classification ensemble model was created by aggregating the probabilities of the image-based ensemble model and the best tabular data classification model. This model’s metrics were also compared with the metrics of the previously described models.

Framework and training and prediction times

All models were implemented in Python using Keras (https://keras.io/, GitHub,2015) with TensorFlow (https:// www.tensorflow.org/) backend. Numpy (https://numpy.org), Matplotlib (https://matplotlib.org), and Scikit-learn (https://scikit-learn.org/stable/) were used to train and evaluate the models involved. Training was performed in a server with AMD RYZEN 5, 32 RAM and 2 GPUs Nvidia RTX 3080TI 12G. Prediction times per image averaged 3 ms for classification on a standard laptop (2.6-GHz Intel core, 16 GB of RAM).

Model evaluation

AUROC, F1 score, sensitivity, specificity, PPV and negative predictive values (NPV) were used for model performance assessment. PPV and F1-score were selected as the most appropriate measures for the study’s model to predict mode of delivery, but AUROC measures were displayed nevertheless because they are a familiar metric in the medical field for model performance evaluation63. The model was also evaluated using a confusion matrix portraying TP, FP, TN and FN55. In the matrix, each row denotes an instance in the real label, and each column denotes an instance in the predicted label.

Statistical methods

Categorical variables were shown as frequencies and percentages, and continuous variables were shown as mean ± SD. The χ2 test was used to compare categorical variables, and continuous variables were compared using two-sided Student’s t-test. The normal distribution of all continuous variables was assessed a priori using skewness, kurtosis, mean, standard deviation, and histogram curve. A p-value < 0.05 was considered significant.

Ethics declarations

The Ethics Committee of the University Hospital Centre of Coimbra reviewed and approved the study on 22 April 2022 (CE-047/2022). Requirement for written informed consent was waived due to the retrospective, de-identified nature of the patient data.