Introduction

Assessing ovarian reserve is crucial for determining female reproductive potential1, formulating controlled ovarian stimulation (COS) protocols2,3, predicting menopause risk4, and diagnosing premature ovarian insufficiency5. Common evaluation methods include biochemical tests and ultrasound exams, such as basal follicle-stimulating hormone, inhibin B, anti-Müllerian hormone (AMH), the clomiphene citrate challenge test, antral follicle count (AFC), ovarian volume, and ovarian blood supply6,7,8,9,10,11,12,13,14. AMH and AFC are ideal for evaluating the ovarian reserve and predicting COS outcomes15,16,17. They both have a high accuracy in the prediction of ovarian response category18. For predicting high and low response to ovarian stimulation, use of either AFC or AMH is recommended19.

Limitations of AMH include the absence of an international standard for concentration evaluation, high costs, and time-consuming procedures, necessitating more stable and automated detection methods20,21. AFC is advantageous due to its non-invasive nature, ease of performance, and ability to separately evaluate each ovary15. Transvaginal ultrasound provides a comprehensive view of the structure and potential lesions of the uterus and bilateral adnexal areas22. However, manual follicle counting lacks standardization and is affected by inter-observer and inter-equipment variability16,21,23.

Three-dimensional ultrasound and intelligent analysis software can shorten transvaginal ultrasound scanning duration, allow offline AFC data analysis, and enhance inter-observer reliability24,25. However, these methods may inaccurately assess follicles averaging 2–10 mm in diameter26. Additionally, the high cost of equipment and software limits widespread use27,28. Poor image quality more adversely affects three-dimensional ultrasound than manual methods24.

We hypothesized that artificial intelligence could analyze static ultrasound images of bilateral ovaries to predict the ovarian reserve. We combined radiomics for high-throughput image feature quantification29 with deep convolutional neural networks for efficient feature learning30, integrating these methods into a unified system31,32.

Materials and methods

Patients and study design

Appropriate ethical standards were followed throughout the entire study. Due to the retrospective nature of the study, the Ethics Committee of the Women and Children’s Hospital of Chongqing Medical University (IRB-2024021) waived the need of obtaining informed consent. The study was carried out in accordance with the 1964 Declaration of Helsinki and the STROBE guidelines.

Data were retrospectively collected from 395 patients clinically diagnosed with infertility33, who underwent ovarian reserve assessment including AFC and AMH testing, at the Women and Children’s Hospital of Chongqing Medical University between October 2016 and August 2023. As illustrated in Fig. 1, patients from Center 1 (Qixinggang branch) were divided into a training cohort (n = 210) and an internal-test cohort (n = 91) to develop and evaluate the model. To ensure model robustness and assess real-world generalizability across different clinical settings, patients from Center 2 (Ranjiaba branch), which operates independently with different equipment and sonographers, were used as the external-test cohort (n = 94).

Fig. 1
figure 1

Flowchart summarizing patient enrollment process and study cohorts. AFC, antral follicle count; AMH, anti-Müllerian hormone.

Inclusion criteria: (1) premenopausal women aged ≥ 18; (2) individuals with a sexual history willing to undergo transvaginal ultrasound and AMH testing; and (3) regular menstrual cycles. Exclusion criteria: (1) ovarian disease, previous ovarian surgery, radiotherapy, or chemotherapy; (2) prior ovarian stimulation; (3) functional cysts or follicles > 10 mm in diameter; and (4) poor-quality ultrasound images.

Additionally, considering the correlation between different ovarian responses after COS and AFC17,34, the participants were categorized based on their AFC values (≤ 5, 6–19, and ≥ 20), corresponding to poor, normal, and high ovarian responses, respectively1,15,35,36,37,38. AFC predictions was conducted for each of these categories.

AFC (dynamic scanning)—AFC3D

Transvaginal examinations were performed using ultrasound equipment, including the GE E8, GE S6, GE S8, GE E6 (GE Healthcare, USA), and the IE33, Q5 (Philips Company, USA). Antral follicles were counted manually after a comprehensive scan of both ovaries with the probe frequency set at 3–11 MHz, and the results were used as the AFC3D. All doctors who conducted the examinations had over 5 years of experience in gynecological ultrasound.

AFC (static image)—AFC2D

During the AFC3D assessment, fan-type scanning was performed in the long axis view of each ovary, and images were frozen at the section displaying the highest number of antral follicles. Subsequently, duplex images of both ovaries were stored in a Picture Archiving and Communication System (Fig. 2). A doctor with 5 years of experience in gynecological ultrasound examinations (Reader 1) counted the antral follicles in the duplex images, resulting in the AFC2D. Next, another doctor with 11 years of experience in gynecological ultrasound verified the counts (Reader 2). When discrepancies occurred, the counts from Reader 2 were considered the final decision.

Fig. 2
figure 2

Static ultrasound images of the long axis of both ovaries with the most follicles. A: High risk of poor ovarian response (AFC3D ≤ 5). B: Normal ovarian reserve (6 ≤ AFC3D ≤ 19). C: High risk of high ovarian response (AFC3D ≥ 20). AFC, antral follicle count.

AMH test

Venous blood samples (10 mL) were collected on the day of the transvaginal ultrasound. The samples were incubated at room temperature for 1 h and then centrifuged at 3,000 r/min for 10 min using a centrifugation radius of 12 cm. The supernatant was subsequently transferred to another clean centrifuge tube and stored at -20 °C for analysis using enzyme-linked immunosorbent assay kit (Roche Diagnostics International Ltd, Rotkreuz, Switzerland).

Image segmentation and pre-processing

Reader 2 segmented duplex ultrasound images of both ovaries using an open-source labeling tool (LabelMe; http://labelme.csail.mit.edu/Release3.0) to identify the regions of interest (ROIs). To enhance the visual contrast of these ROIs, a histogram equalization technique was applied to adjust the gray level distribution of the images. Subsequently, the pre-processed images were normalized and resized to 224 × 224 pixels for input into the deep-learning (DL) models. Specifically, the bicubic sampling method was employed for the images, while the nearest neighbor method was used for the masks, expanding the images into three channels to leverage the pretrained weights of ImageNet. The entire image-processing workflow was executed using Python (version 3.12.1) and OpenCV (version 4.9.0).

Development of machine learning-based radiomic model

Radiomic feature extraction selection

Radiomic features were extracted from each ovary’s ROI using Pyradiomics (version 3.0, https://pypi.org/project/pyradiomics/) per the Image Biomarker Standardization Initiative guidelines39. Original images underwent wavelet and Gaussian Laplacian filtering with kernel sizes of three and five, respectively, to extract first-order and textural features. Bin width parameters were set using the Freedman–Diaconis rule. The mean value of the bilateral ovarian features was standardized using z-score normalization to serve as the final radiomic feature. Details are presented in Online Resource 1 and 2.

Feature selection

The stability of features was evaluated using the intraclass correlation coefficient (ICC). Features with ICC ≥ 0.75 were considered highly stable and retained for further analysis. The selection of relevant radiomic features involved three steps: (1) Univariate linear regression was used for feature selection in the regression tasks (AFC3D and AMH), while the Kruskal–Wallis test was applied for the three-class classification task to identify features with significant group differences; (2) redundant features with Pearson correlation coefficients > 0.90 were removed to eliminate collinearity; and (3) the most relevant radiomic features in the training cohort were identified using the least absolute shrinkage and selection operator (LASSO) method. The optimal parameter λ was determined through 10-fold cross-validation with 5,000 iterations. Coefficients for each feature were calculated based on the optimal λ, and features with non-zero coefficients were selected for model development.

Model development

Multivariate linear regression models were created to quantitatively predict AFC3D and AMH levels. For the three-class classification of AFC3D, five classifiers were developed: support vector machine, random forest, logistic regression, decision tree, and XGboost. Optimal parameters for each model were identified using five-fold cross-validation combined with a grid search strategy within the training cohort. The final model was chosen based on a comparative analysis of each classifier’s predictive performance in the internal-test cohort.

Development of DL and combined (rad-DL) models

Pre-processed bilateral ovarian ROI images were individually input into ResNet18. Deep features from the final convolutional layer were concatenated and processed through two fully connected layers with 256 hidden neurons each. Data augmentation techniques, including random horizontal and vertical flips, 0–30° random rotation, and gray-scale adjustments between 0.5 and 1.5, were applied to enhance model generalization. For quantitative prediction, the mean square error (MSE) loss function, Adam optimizer, and cosine-annealing strategies with an initial learning rate of 1e–3 were used. The cross-entropy loss function was utilized for classification tasks. The entire process was executed using PyTorch (https://pytorch.org/) and a GTX 4090 GPU (Nvidia GeForce).

Post ICC stability screening, radiomic features were combined with deep features and input into a two-layer fully connected network with 512 hidden neurons. We developed the Rad-DL model using the specified training strategy and parameters. For enhanced visualization, we applied the Grad-CAM40 technique to the last convolutional layer, offering visual explanations of the model’s decisions. The study design is depicted in Fig. 3.

Fig. 3
figure 3

The workflow of this study. Bilateral ovaries were contoured manually on the duplex static images with the largest number of antral follicles in the long axis, and the ROI enclosing the ovary was cropped as the predictive model input. The deep features and the radiomic features were separately extracted. The deep learning, radiomic, and Rad-DL models were developed for quantitative prediction of AFC3D and AMH levels, as well as sequential three-class classification prediction of AFC3D. Model evaluation was performed using ROC curves, calibration curves, and DCA. Additionally, Grad-CAM was employed to visualize the interpretability of deep learning. AFC, antral follicle count; AMH, anti-Müllerian hormone; DCA, decision curve analysis; DL, deep learning; Grad-CAM, Gradient-weighted class activation map; Rad, radiomics; ROC, receiver operating characteristic curve; ROI, region of interest.

Model evaluation

Evaluation indicators for quantitative predictions included R2, mean absolute error (MAE), MSE, median absolute error (MedAE), and explained variance (EV). The classification prediction models were evaluated using the area under the curve (AUC) of the receiver operating characteristic curve, as well as accuracy (ACC), confusion matrix, sensitivity, specificity, and F1 score to evaluate model discrimination. Additionally, the calibration curve was utilized to assess model calibration, and decision curve analysis (DCA) was performed to evaluate the clinical utility of the prediction models by calculating the net benefit at different threshold probabilities.

Statistical analysis

Continuous data are presented as mean ± standard deviation or median and interquartile range, based on the Shapiro–Wilk test for normality. Spearman correlation analysis assessed correlations between continuous variables. The Wilcoxon signed-rank test was used for single-sample tests. The Kruskal–Wallis test analyzed differences among three groups of sequential categorical variables, and the significance values were adjusted for multiple testing using Bonferroni correction. Bland–Altman plots verified the agreement between methods. Categorical variables are described as frequency and rate, and a two-sided test with P<0.05 indicated statistical significance.

Results

Study population characteristics

The baseline characteristics of the quantitative predictions are presented in Table 1. The AFC3D was significantly greater than the AFC2D in the training cohort and in the internal-test and external-test cohorts (P < 0.05). The baseline characteristics of the sequential three-class classification prediction are presented in Table 2. The distribution of AFC3D ≤ 5, 6 ≤ AFC3D ≤ 19, and 20 ≤ AFC3D in the training, internal-test, and external-test cohorts were 62/27/34, 96/41/45, and 52/23/15, respectively. Additionally, as the number of AFC3D increased, the age of patients decreased progressively (P < 0.001), while AFC2D and AMH levels increased progressively (P < 0.001) across the three cohorts.

Table 1 Baseline characteristics for quantitative prediction in patients undergoing transvaginal ultrasound and AMH testing.
Table 2 Baseline characteristics for sequential classification prediction in patients undergoing transvaginal ultrasound and AMH testing.

Correlation and consistency analysis

As shown in Table 3, AFC2D and AFC3D were highly correlated in both branches (P < 0.001). Additionally, AMH levels were strongly correlated with AFC3D (P < 0.001). The Bland–Altman plots (Fig. 4) indicated that only a few cases fell outside the 95% limit of agreement in the training, internal-test, and external-test cohorts. Specifically, this accounted for 4.3% (9 of the 210), 2.2% (2 of the 91), and 3.2% (3 of the 94), respectively. These findings demonstrate good agreement between the AFC2D and AFC3D.

Table 3 Correlation analysis of case data in two branches of the hospital.
Fig. 4
figure 4

Differences in AFC3D and AFC2D. Bland–Altman plots showing the mean (x-axis) and the difference (y-axis) in AFC between the methods. The full line marks the mean difference, and the dashed lines indicate the 95% limits of agreement (± 1.96 SD). A: training cohort; B: internal test cohort; C: external test cohort. AFC, antral follicle count; SD, standard deviation.

Radiomic feature extraction and selection

Overall, 674 quantitative radiomic features were extracted from the ultrasound images. After removing features with poor reproducibility (ICCs < 0.75) and stability, 437 radiomic features were selected for ANOVA and Pearson correlation analysis. Ultimately, the LASSO method identified 10 radiomic features for constructing a quantitative prediction model for AFC3D (Online Resource 3(a), (b), (c)), five radiomic features for the quantitative prediction model for AMH level (Online Resource 4(a), (b), (c)), and 10 radiomic features for the sequential three-class classification prediction of AFC3D (Online Resource 5(a), (b), (c)).

Establishment and evaluation of quantitative prediction models

Multiple linear regression was used as the kernel function for the radiomic model to quantitatively predict AFC3D and AMH levels. As presented in Table 4, in the training cohort, the R2 values for the radiomics, DL, and Rad-DL models were 0.641, 0.993, and 0.976, respectively. In the internal-test and external-test cohorts, the R2 values of the radiomic model were closely aligned, while those of the DL model indicated a significant decline in the external-test cohort. Additionally, the R2 values of the Rad-DL model in the internal-test and external-test cohorts were 0.743 and 0.583, respectively.

Table 4 Performance of radiomic, deep learning, and combined Rad-DL models in quantitative prediction of AFC3D.

In the quantitative prediction of the AMH levels, as presented in Table 5, in the training cohort, the R2 values of the radiomic, DL, and Rad-DL models were 0.520, 0.950, and 0.962, respectively. In the internal-test and external-test cohorts, the R2 values of the radiomic model were 0.323 and 0.420, respectively, while those of the DL model were 0.591 and 0.352, respectively, showing a significant decline in the external-test cohort. The R2 values of the Rad-DL model in the internal-test and external-test cohorts were 0.550 and 0.509, respectively.

Table 5 Performance of radiomic, deep learning, and combined Rad-DL models in quantitative prediction of AMH.

Higher R2 and EV, along with lower MAE, MSE, and MedAE values, demonstrate the superior regression performance of the models. As illustrated in Fig. 5, the radar charts based on each regression performance evaluation index indicate that the Rad-DL model exhibited the best overall performance in both prediction tasks.

Fig. 5
figure 5

A comprehensive evaluation of the quantitative prediction of AFC3D and AMH. A: the quantitative prediction of AFC3D; B: the quantitative prediction of AMH. MAE: mean absolute error; MSE: mean square error; MedAE: median absolute error; EV: explained variance; AMH, anti-Müllerian hormone; Rad, radiomics; DL, deep learning.

Establishment and evaluation of prediction model performance for the sequential three-class classification

Based on the results of the training cohort’s five-fold cross-validation and grid search, five machine learning models were compared in the internal-test cohort. As presented in Table 6, random forest demonstrated the best overall predictive performance according to metrics including AUC, ACC, F1, sensitivity, and specificity. Therefore, a random forest was selected as the final model for internal-test and external-test tasks.

Table 6 Classification results on the internal test cohort of five machine learning classifiers for model selection.

The AUC values of the radiomic, DL, and Rad-DL models across the AFC3D categories in the training cohort, as well as the internal-test and external-test cohorts, are presented in Table 7; Fig. 6 (A–C). Detailed performance evaluation of the sequential three-class classification prediction for AFC3D is shown in Online Resource 6. According to the results of the DeLong test, the fusion of radiomic features into DL improved the predictive performance of the model in the internal-test and external-test cohorts (P = 0.15 and P = 0.08, respectively) (Table 7). The calibration curves of the three models showed good predictive performances in all cohorts, and the MAE values of the three models were 0.008, 0.035, and 0.033 for the external-test cohort, respectively (Fig. 6(D–F) and Online Resource 7(A–F)). Furthermore, the DCA results indicated that the fusion model achieved a greater net benefit across most ranges of threshold probabilities than the radiomic model, DL model, treat-all strategy, and treat-none strategy (Fig. 6(G–I)).

Table 7 Performance of radiomic, deep learning, and combined Rad-DL models in sequential classification prediction of AFC3D.
Fig. 6
figure 6

The performance evaluation of the sequential three-class classification prediction for AFC3D. A–C: Micro-average AUC in ROC curves of the sequential three-class classification in the training cohort and the internal and external test cohorts; D–F: calibration curves of the sequential three-class classification in the external test cohorts; G–I: DCA curves of the sequential three-class classification in the training cohort and the internal and external test cohorts. AFC, antral follicle count; AUC, the area under the curve; DCA, decision curve analysis; DL, deep learning; Rad, radiomics; ROC, receiver operating characteristic curve.

Interpretability of the DL model

We utilized Grad-CAM to visualize the interpretability of the DL model. Heatmaps and their superpositions on the ultrasound images were acquired from the external-test cohort (Fig. 7). We found that the anechoic regions of the antral follicles had the most active algorithm activation, demonstrating that these regions had the most obvious predictive value for the DL model.

Fig. 7
figure 7

Two-dimensional ultrasonic gray-scale map of ovaries and corresponding activation heat map. The color change from red to blue corresponds to the degree of algorithm activation within the region. From the examples presented, it can be observed that the anechoic regions where the antral follicles are located have the most obvious red coding, proving that these regions have the most active algorithm activation and the highest predictive value for deep learning models.

Discussion

Radiomics and DL are extensively used in medical image recognition and analysis. Transvaginal ultrasound-guided manual counting of antral follicles is standard for assessing the ovarian reserve41. Advances in ultrasound and computerized analysis now enable more accurate evaluations by analyzing comprehensive bilateral ovarian images rather than just counting antral follicles. Studies on automating the measurement of antral follicle diameter, volume, and number using three-dimensional ultrasonographic DL28,42 demonstrated reduced time and improved inter-observer reliability24,25,43. Clinically, the focus is on assessing the ovarian reserve rather than exact follicle counts, making the use of expensive ultrasound equipment with three-dimensional automated volume imaging and counting22 impractical.

We selected duplex static images showing the most antral follicles along the long axes of both ovaries. Despite static images displaying significantly fewer antral follicles than dynamic scans, there was a strong correlation and consistency between the two imaging methods. This indicates that ovarian storage function information from static images is closely related to that from dynamic scanning, suggesting a potential for technical replacement. This established the basis for subsequent intelligent image analysis. The radiomics model demonstrated greater robustness than the DL model in quantitatively predicting AFC3D in both internal and external cohorts. Notably, the radiomics model outperformed the DL model on the external-test cohort, a difference partly attributable to the lower resolution of external ultrasound images, which can adversely affect DL performance44,45,46,47. This effect is especially pronounced for more complex tasks, such as AMH prediction, where instability in learned deep features can reduce generalization. To mitigate this and enhance robustness, we developed a hybrid Rad-DL approach that integrates handcrafted radiomics features with automatically learned deep features, enabling the model to leverage both stable, interpretable features and rich data-driven representations. As shown in our results, the Rad-DL model achieved the best overall performance across three tasks: it performed comparably to radiomics in the simpler AFC regression task but substantially outperformed both radiomics and DL in the more challenging AMH regression, particularly under lower image-quality conditions. In the quantitative AMH study, all three models underperformed relative to AFC3D across all cohorts. This disparity is largely attributable to the strong positive correlation between AMH and AFC3D; although both are established ovarian-reserve biomarkers, they do not always concur. Zhang et al. reported that approximately 20% of infertile patients exhibited discordant AFC and AMH levels in clinical practice48. Moreover, in the AFC regression task, the three models demonstrated similar performance on the external-test cohort, likely because AFC is a morphological trait that is directly visible and measurable on ultrasound, making it a relatively straightforward learning problem for image-based models. In contrast, AMH lacks a direct visual correlate on ultrasound and must be inferred indirectly via morphological proxies such as AFC, which increases task complexity and reduces learnability from image data alone. Additionally, regarding the models themselves, we implemented several strategies during development to mitigate overfitting, including early stopping, dropout, and cross-validation. Nevertheless, the complex nature of high-dimensional data and the relatively limited sample size in the deep-learning tasks contribute to this challenge. Moreover, quantitative regression problems—because of their sensitivity to noise and the need to predict continuous values—are more prone to overfitting than classification tasks. By contrast, for the sequential three-class classification task, overfitting was much more controlled: the performance of the DL and Rad-DL models remained stable across both test cohorts. These results suggest that the overfitting-mitigation strategies were more effective for classification, and the observed generalization gap is within an acceptable range.

Accurate assessment of the ovarian reserve through AFC is crucial for identifying patients at high risk of poor or excessive ovarian response, enabling the selection of appropriate COS protocols49,50,51,52. However, counting antral follicles accurately is challenging due to factors like operator experience22, examination equipment53, and patient conditions (e.g., obesity, intestinal gas interference). Consequently, a sequential three-class classification prediction was conducted following the quantitative prediction of the AFC3D. The three models showed strong discrimination in identifying high risk patients with poor or excessive ovarian response but were less effective for those with a normal ovarian reserve. The primary clinical challenge is to prevent poor and excessive ovarian responses during COS. The proposed sequential three-class classification model offers a practical and efficient solution, particularly when AFC is challenging because of factors such as high ovaries, bowel gas interference, obesity, and the presence of pathological lesions that may obscure follicle visualization. By streamlining the assessment workflow, the model significantly reduces both the sonographer’s workload and ultrasound acquisition time. The static images for analysis can be obtained from the most common and widely used ultrasound devices. The analysis does not require the examined equipment to have additional computational capabilities, as it can be performed centrally in the cloud. By streamlining the assessment workflow, the model significantly reduces both the sonographer’s workload and the ultrasound acquisition time. Since there is no need for rigid dynamic scanning to count all antral follicles for accurate screening of patients with different ovarian responses, we believe this approach is particularly well-suited for clinical settings where high-end ultrasound equipment is lacking, operator experience is limited, or procedural standardization is weak.

The ultimate goal of our research is to elucidate the correlation between ultrasound imaging features and fertility treatment outcomes, including precision-driven gonadotropin dosing strategies, the yield of mature (M2) oocytes, blastocyst formation rates, and, ultimately, clinical pregnancy rates. However, the focus of the current study was to validate the effectiveness of our new method compared with that of conventional AFC before accounting for confounding factors such as stimulation protocols, with the aim of demonstrating that this new method can be as effective as conventional AFC in predicting outcomes. Future studies should further explore the model’s applicability for long-term outcome prediction and its potential integration into personalized assisted reproductive technology protocols. This study had some limitations. First, the image resolution of the external-test cohort was significantly compromised due to disparities in image acquisition cards between the two branches, which affected the model’s performance in quantitative prediction. Developing models that can be effectively applied to ultrasound images of varying quality is essential. Future research will use DL-based super-resolution reconstruction technology54,55 to enhance image resolution and establish an improved cohort for external model testing. Second, operator subjectivity influenced the selection of static images with the most antral follicles along the long axis of both ovaries, though overall image classification was less subjective than manual methods. Third, the applicability of the model was somewhat limited due to the selection of only cases with regular menstrual cycles, therefore not suitable for patients with polycystic ovary syndrome, and the exclusion of cases with ovarian cysts or follicles larger than 10 mm in diameter. Lastly, the data comprised a moderate sample size from two branches of the same hospital. Expanding the sample size and increasing the number of participating centers is critical for enhancing prediction accuracy.

Conclusion

This study utilized radiomics and DL on static ultrasound images of the ovaries to predict AFC and AMH levels. The models can quantitatively estimate antral follicle counts and accurately identify patients with infertility who have abnormal ovarian reserves. This research lays a solid foundation for advancing beyond the conventional AFC method and exploring intelligent, non-invasive approaches for assessing the ovarian reserve.