Introduction

Infertility, defined as the inability to conceive after at least 12 months of regular unprotected sex, has been a global health problem for a long time1. It is estimated that one in six people of reproductive age worldwide will experience infertility in their lifetime1. In China, there has been a marked increase in infertility over the past two decades2, and currently the infertile population is approximately to one quarter3. Available evidence suggests that infertility significantly impairs quality of life and weakens partnerships when compared to couples without infertility distress4,5,6. Simultaneously, infertility problems can also affect female mental health with worse levels of depression and anxiety7,8.

Assisted reproductive technology (ART) has been rapidly evolving since its emergence in 1978, and its use to help infertile couples achieve pregnancy has led to the birth of more than 8 million newborns worldwide9. ART mainly consists of in vitro fertilization (IVF) with or without intracytoplasmic sperm injection (ICSI), embryo transfer, and frozen embryo transfer, of which IVF and/or ICSI are the recommended treatment options for couples with long-term unresolved fertility problems10. However, close to half of couples treated with IVF failed to get pregnant, even after multiple treatment cycles11. Therefore, live birth is the most important outcome of ART treatment. Clinical prediction models on live birth outcome of ART treatment that incorporate multiple patient characteristics are able to help couples establish reasonable psychological expectations and costs, and support consultation between physicians and couples regarding treatment decisions12. Currently, common predictors of live birth outcomes in patients treated by ART included demographic characteristics (maternal age, body mass index, ethnicity, etc.), clinically factors (cause of infertility, duration of infertility, type of infertility, etc.) and laboratory parameters (serum sex hormones, ovarian reserve, number of oocytes collected, sperm motility and morphology, etc.)13,14.

A review of prediction models on live birth outcomes of ART showed that currently available models generally suffer from methodological or study design limitations, such as the use of inefficiently randomized split data for validation, unclear reporting of missing values, only reporting on the discrimination, and the inclusion of pregnant women treated with IVF only15,16,17,18. Although the prediction model developed by Dhillon et al. had the high quality of reporting, it was derived from the UK population, the applicability to other populations remains unclear15,19. Another review noted that only one prognostic prediction study for live births was at low risk of bias, but it only included couples treated with ICSI20,21.

Under such circumstances, we aim to develop and internally validate a prognostic prediction model for live birth by using easily obtainable demographic characteristics and clinical features at the beginning of IVF within a representative, large sample of Chinese patients. Recently, several machine learning (ML) algorithms suitable for classification outcomes such as random forest, extreme gradient boosting, and light gradient boosting have been extensively used for construction of clinical prediction models22,23,24. Thus, we developed models using both traditional regression and these ML algorithms in order to choose the optimal one.

Materials and methods

Data source and study sample

Participants were recruited between January 2015 and December 2022 from couples who accepted ART treatment at the Second Affiliated Hospital of Kunming Medical University in southwest China Yunnan Province. Our database contains data on all treatment cycles for 13,620 patients who initiated the first and subsequent IVF with ICSI treatment. Patients were further excluded if: (1) ART initiation before the study period, or; (2) restarted ART after a live birth, or; (3) missing vital information, or; (4) lost to follow-up for at least one year. Finally, 11,486 couples were included in the analysis. Detailed process for selection of patients is illustrated in Fig. 1.

Fig. 1
figure 1

Analysis workflow.

The study was approved by the ethics committee of the Second Affiliated Hospital of Kunming Medical University. Due to its retrospective nature, informed consent was allowed to be waived by the committee. We confirmed that all methods were carried out following relevant guidelines and regulations. The study was reported consistent with the extension and update guideline of the original TRIPOD-2015 (transparent reporting of a multivariable prediction model for individual prognosis or diagnosis), known as TRIPOD + AI25.

Definitions of outcome and candidate predictors

The primary outcome of interest in this study was whether a live birth occurred during a single ART cycle. Various variables of the patients were extracted as candidate predictors based on previous studies and clinical practice considerations, such as demographic characteristics (couple’s age, ethnicity, maternal body mass index), treatment-related information (duration of infertility, type of infertility, cause of infertility, previous ART cycles, insemination method, starting dosage of gonadotropins (Gn), duration of Gn, and total dosage of Gn), and laboratory test results (basal follicle-stimulating hormone (FSH), estradiol (E2), and luteinizing hormone (LH), progressive sperm, non-progressive sperm motility, and E2, LH and progesterone (P) on HCG day). Candidate predictors were measured at the time of receiving ART treatment.

Statistical analysis

Firstly, we performed descriptive analyses of the study subjects, and continuous variables were described separately according to the type of distribution, e.g., mean with standard deviation (SD) for a normal distribution and median with inter-quartiles range (IQR) for a non-normal distribution. Differences between groups for continuous variables were tested by t-test or rank-sum test. For categorical variables, frequency (proportion) was used to describe them, and differences between or among groups were compared using the Chi-squared test. In consistent with previous ART studies, we used the same age-stratification criteria (≤ 35 years, 35–39 years, and ≥ 40 years)26,27. We classified the remaining unspecified continuous variables into categorical variables based on quartiles. Univariate and multivariate logistic regressions (LR) were fitted to measure the crude and adjusted associations between the candidate predictors and the live birth outcome.

Three machine learning algorithms (random forest, RF; extreme gradient boosting, XGBoost; light gradient boosting machine, LightGBM) were used to further confirm the most important predictors for live birth outcome from candidate predictors screened out by multivariate LR: variables that ranked among the top 6 in at least 2 of the 3 used algorithms were chosen. For the chosen important predictors, the receiver operating characteristic (ROC) curves were applied to ascertain their optimal cut-off values with regard to the live birth outcome.

We used the optimal cut-offs to dichotomize all chosen important predictors, then included them into the prediction model by using the four different algorithms (LR, RF, XGBoost, LightGBM). Randomly split sample is considered to be the simplest internal validation approach, but it is sub-optimal because it loses sample information and decreases statistical power28. For this reason, we used two other internal validation approaches (tenfold cross-validation, 500 times bootstrap), which were also recommended by the TRIPOD guideline25. Area under the receiver operating characteristic (AUROC) curve was used to assess discrimination, with the value closer to 1 indicating the greater the ability in discriminating live birth outcomes29. The Brier score was used to estimate the calibration, with the value closer to 0 indicating that the predicted probability of the outcome by the model coincides with the actual probability30.

The statistical significance was set as a two-tailed p < 0.05, except for p < 0.10 for univariate logistic regressions in searching for all possible covariates. All data analysis was done in R software (Version: 4.4.0, Vienna, Austria).

Results

General characteristics of study subjects

As shown in Fig. 1, between 2015 and 2022, there were a total of 13,620 couples included. After data sorting, we excluded participants who reported incomplete data (1682/13,620) or lost to follow-up (452/11,938). The final analysis was based on a total of 11,486 couples with complete information. Altogether 3097 couples successfully reached the live birth outcome, with the live birth rate of 26.96% (95% CI 26.15%-27.79%).

Among all study subjects, husbands were 34.61 ± 5.85 years old, wives were 33.18 ± 5.20 years old, and more than half of the individuals were Han ethnicity (74.04% of husbands and 70.08% of wives), the average duration of infertility was 4.32 ± 3.39 years. The differences between couples with or without live births were statistically significant for age of husband and wife, maternal BMI, duration of infertility, and previous ART cycles. In ovulation induction therapy and laboratory indicators, there were statistically significant differences in all features except for total dosage of Gn, basal E2, and non-progressive sperm motility (Table 1).

Table 1 General characteristics of study subjects.

Association between factors and live birth

To initially explore the impact of quantitative variables on live births, we classified age as a categorical variable according to the recommended thresholds, whereas the other quantitative variables were classified into categorical variables with four levels based on their quartiles: very low level (< P25), low level (P25–P50), moderate level (P50–P75), and high level (> P75). After fitting univariate binary logistic regression, we included statistically significant variables (p < 0.01) into further multivariate analyses, and the results showed: maternal age and BMI, duration of infertility, previous ART cycles, progressive sperm motility, duration of Gn, total dosage of Gn, basal FSH, E2 on HCG day, and LH on HCG day were significantly associated with live birth (Table 2).

Table 2 Univariate and multivariate logistic regression fitting results on associated factors of live birth.

Machine learning results

We incorporate the screened out variables into the multivariate analysis by using three different machine learning algorithms (RF, XGBoost and LightGBM). Seven indicators were identified as the most important in all three algorithms: maternal age, duration of infertility, basal FSH, progressive sperm motility, and E2, LH and P on HCG day (Fig. 2). With the exception of duration of infertility, we identified the optimal cut-off values in predicting the live birth outcomes by using the ROC curves for the rest 6 quantitative variables, and the ascertained cut-offs were: The optimal cut-off values for maternal age, basal FSH, progressive sperm motility, and E2, LH and P on HCG day were 36.97 years for maternal age, 5.57 mIU/mL for basal FSH, 33.52% for progressive sperm motility, 7227.50 pg/mL for E2, 3.04 mIU/mL for LH on HCG day, and 1.33 ng/mL for P on HCG day (Fig. 3).

Fig. 2
figure 2

Importance of screened variables in different algorithms.

Fig. 3
figure 3

Receiver operating characteristic (ROC) curves of screened variables by ML algorithms.

Finally, we built predictive models using only the seven variables mentioned above with logistic regression and three different machine learning algorithms (RF, XGBoost, LightGBM). Both cross-validation and bootstrap methods indicated that LR and RF had the optimal model performance. Specifically, LR yielded an AUROC of 0.671 (95% CI 0.630–0.713) and Brier score of 0.183 (95% CI 0.170–0.196) for cross-validation, and an AUROC of 0.671 (95% CI 0.662–0.683) and Brier score of 0.183 (95% CI 0.179–0.187) for bootstrap. RF had similar discrimination and calibration performance, followed by XGBoost and LightGBM (Table 3). Standardized regression coefficients suggest that among the 7 included indicators, maternal age showed the strongest association with live birth outcome, followed by P on HCG day, E2 on HCG day, whereas basal FSH presented as the weakest predictor (see in Supplementary material, Table S1).

Table 3 Performance of different machine learning algorithms by using internal validation.

Discussion

In this study, we screened for potential predictors among easily obtained demographic and clinical indicators for live birth in a large sample of Chinese patients who received ART treatment. Based on statistical models and multiple machine learning algorithms, we have identified 7 promising indicators in predicting live birth outcome among ART patients: maternal age, duration of infertility, basal FSH, progressive sperm motility, and E2, LH and P on HCG day. The predictive models based on the 7 identified indicators provided fair and robust prediction accuracy, irrespective of different algorithms. The major findings of our study are expected to provide useful information in helping clinicians better triage patients at the baseline for upcoming ART treatments.

Among the 7 indicators that we screened out, maternal age had the strongest association with live birth outcomes, followed by P on HCG day, E2 on HCG day, LH on HCG day, years of infertility and progressive sperm motility, with the basal FSH showed the weakest influence. It is not surprising to find that maternal age is the strongest predictor of live birth, considering the fact that along with the aging process, especially after the age of 37 years, female fertility will decline rapidly31. This is attributed to the decline in the number of oocytes in women and age-related poor quality of embryos31.

A higher level of P or E2 on HCG day also significantly related to lower probability of live birth. It is hypothesized that elevated follicular-phase P concentration produced by ovarian stimulation-induced multiple follicle growth may contribute to changes in the endometrium, leading to embryo-endometrial asynchrony, which may adversely affect implantation, leading to reduced live birth chances32. However, the role of estradiol levels during HCG days on pregnancy probability is still controversial. A meta-analysis indicated that there was insufficient evidence of an association between high E2 levels and pregnancy probability33. The previous studies have found that high E2 levels on HCG day were significantly predictive of lower live birth rates for couples undergoing frozen embryo transfer34,35. In recent years, basal FSH has been recognized as a predictor of live birth outcome after IVF treatment13. A higher level of basal FSH has also been connected to poor ovarian response36. However, although in this study we have included basal FSH into the final prediction model, as it significantly improved prediction accuracy, unlike maternal age, P and E2 on HCG day, its association with live birth outcome is generally weak. All the above inconsistencies between our study and currently available sparse evidence warrant further investigation.

During development of the prediction model, we initially included previous IVF history in the attempt to adjust for its influence on live birth outcomes. However, previous IVF history presented only negligible influence on live birth and was subsequently eliminated. The general predictive performance of our models, as measured by AUROC, was similar to their comparable models16,17,18. Among all the prediction models that we fitted by using the ML algorithms, the RF model outperformed the others. However, its performance was similar to the LR model in both discrimination and calibration parameters. As the LR model is a widely used generalized linear model that much more easily to be fitted, it should be preferred when comparing with complicated ML models.

Our study results are based on a sufficiently large sample of IVF patients to develop predictive models on live birth outcomes, a large group of easily obtained baseline predictors were screened for. The similar predictive accuracy between the models fitted by different algorithms partly supports the robustness of prediction accuracy for identified factors. Nevertheless, the present study still has some limitations that should be noticed. Firstly, the overall discrimination for the predictive models was not high, only around 67%, which suggests that there are other important predictors that to be found. For instance, serum anti-müllerian hormone (AHM), which reflects ovarian reserve, has been identified as the most important predictor for live birth-related outcomes of ART treatment in existing prediction models37. Also, embryo quality was considered to be a valuable predictor14. However, due to the unavailability of data, we cannot include these important variables into our current prediction models. Secondly, we only screened for baseline indicators that are predictive of live birth outcomes for IVF patients, since the period from treatment inception to live birth is long, it would be interesting to investigate the role of time-varying factors on the IVF outcomes by using dynamic prediction models. Finally, the study sample was derived from a single medical institution by using retrospective study design, therefore information bias and selection bias could not be avoided. Multicenter, prospective studies should be done in the future to externally validate our major findings.

In summary, we constructed prognostic prediction models for the live birth outcome in couples undergoing IVF, with or without ICSI treatment, by using logistic regression and machine learning algorithms. The models resulting from different approaches yielded similar predictive performance, and the logistic regression model was considered to have the best performance and was recommended for further validation. Future studies of longitudinal design and incorporate more meaningful indicators are warranted to validate and improve the prediction accuracy of current models.