Ultrasound radiomics and deep learning for predicting antral follicle count and anti-Müllerian hormone

Zhang, Jinwei; Liu, Shangqing; Liu, Shuang; Rong, Yue; Zhong, Chunyan; Ni, Dong; Ran, Suzhen

doi:10.1038/s41598-025-33010-w

Download PDF

Article
Open access
Published: 20 December 2025

Ultrasound radiomics and deep learning for predicting antral follicle count and anti-Müllerian hormone

Jinwei Zhang¹^na1,
Shangqing Liu²^na1,
Shuang Liu¹,
Yue Rong¹,
Chunyan Zhong¹,
Dong Ni^3,4 &
…
Suzhen Ran¹

Scientific Reports volume 16, Article number: 3115 (2026) Cite this article

1297 Accesses
Metrics details

Subjects

Abstract

To overcome inter-observer variability in conventional antral follicle count (AFC) assessment and AMH testing limitations, we developed an AI-powered framework using routine 2D ultrasound to standardize ovarian-reserve evaluation in assisted reproductive technology (ART). This multicenter retrospective study analyzed 395 women with infertility from two affiliated hospitals. The cohort was divided into training (n = 210), internal-test (n = 91), and external-test (n = 94) cohorts. We established three prediction models: radiomics model, 674 IBSI-compliant features; deep-learning model, ResNet50-based feature extraction; fusion model, hybrid approach combining both modalities. Model performance was validated against the manual AFC and serum AMH levels. Sequential classification categorized patients into low, moderate, or high ovarian-response risk groups. Strong correlation and consistency existed between routine 2D ultrasound image AFCs and three-dimensional dynamic-scan AFCs. The deep learning–radiomics fusion model displayed superior AFC prediction (R²=0.743 internal/0.583 external), surpassing the performance of single-modality models (radiomics: 0.586/0.572; deep learning: 0.737/0.541). For AMH prediction, the fusion model maintained generalizability (external R²=0.509 vs. 0.420 radiomics and 0.352 deep learning, p < 0.05). In ovarian-response stratification, the fusion model achieved an AUC of 0.881 (95%CI: 0.828–0.925), which was 8.0% higher than that of individual models, with 69.1% sensitivity and 84.6% specificity for identifying high-risk patients requiring stimulation-protocol modifications. The developed AI framework enables standardized ovarian-reserve evaluation using routine 2D ultrasound, effectively bridging imaging limitations by synergizing radiomics and deep learning. Meanwhile, the model achieves clinical applicability by enabling personalized ovarian-stimulation protocol optimization, demonstrating particular value in resource-limited clinical environments without requiring advanced imaging infrastructure.

International multicenter validation of AI-driven ultrasound detection of ovarian cancer

Article Open access 02 January 2025

Development and validation of a nomogram for failure to collect oocytes in POSEIDON Groups 3 and 4 undergoing IVF/ICSI treatment

Article Open access 28 December 2024

Assessment of ovarian reserve function in patients with adenomyosis and ovarian endometriosis by three-dimensional transvaginal ultrasound and Anti-Müllerian hormone

Article Open access 28 December 2024

Introduction

Assessing ovarian reserve is crucial for determining female reproductive potential¹, formulating controlled ovarian stimulation (COS) protocols^2,3, predicting menopause risk⁴, and diagnosing premature ovarian insufficiency⁵. Common evaluation methods include biochemical tests and ultrasound exams, such as basal follicle-stimulating hormone, inhibin B, anti-Müllerian hormone (AMH), the clomiphene citrate challenge test, antral follicle count (AFC), ovarian volume, and ovarian blood supply^{6,7,8,9,10,11,12,13,14}. AMH and AFC are ideal for evaluating the ovarian reserve and predicting COS outcomes^15,16,17. They both have a high accuracy in the prediction of ovarian response category¹⁸. For predicting high and low response to ovarian stimulation, use of either AFC or AMH is recommended¹⁹.

Limitations of AMH include the absence of an international standard for concentration evaluation, high costs, and time-consuming procedures, necessitating more stable and automated detection methods^20,21. AFC is advantageous due to its non-invasive nature, ease of performance, and ability to separately evaluate each ovary¹⁵. Transvaginal ultrasound provides a comprehensive view of the structure and potential lesions of the uterus and bilateral adnexal areas²². However, manual follicle counting lacks standardization and is affected by inter-observer and inter-equipment variability^16,21,23.

Three-dimensional ultrasound and intelligent analysis software can shorten transvaginal ultrasound scanning duration, allow offline AFC data analysis, and enhance inter-observer reliability^24,25. However, these methods may inaccurately assess follicles averaging 2–10 mm in diameter²⁶. Additionally, the high cost of equipment and software limits widespread use^27,28. Poor image quality more adversely affects three-dimensional ultrasound than manual methods²⁴.

We hypothesized that artificial intelligence could analyze static ultrasound images of bilateral ovaries to predict the ovarian reserve. We combined radiomics for high-throughput image feature quantification²⁹ with deep convolutional neural networks for efficient feature learning³⁰, integrating these methods into a unified system^31,32.

Materials and methods

Patients and study design

Appropriate ethical standards were followed throughout the entire study. Due to the retrospective nature of the study, the Ethics Committee of the Women and Children’s Hospital of Chongqing Medical University (IRB-2024021) waived the need of obtaining informed consent. The study was carried out in accordance with the 1964 Declaration of Helsinki and the STROBE guidelines.

Data were retrospectively collected from 395 patients clinically diagnosed with infertility³³, who underwent ovarian reserve assessment including AFC and AMH testing, at the Women and Children’s Hospital of Chongqing Medical University between October 2016 and August 2023. As illustrated in Fig. 1, patients from Center 1 (Qixinggang branch) were divided into a training cohort (n = 210) and an internal-test cohort (n = 91) to develop and evaluate the model. To ensure model robustness and assess real-world generalizability across different clinical settings, patients from Center 2 (Ranjiaba branch), which operates independently with different equipment and sonographers, were used as the external-test cohort (n = 94).

Inclusion criteria: (1) premenopausal women aged ≥ 18; (2) individuals with a sexual history willing to undergo transvaginal ultrasound and AMH testing; and (3) regular menstrual cycles. Exclusion criteria: (1) ovarian disease, previous ovarian surgery, radiotherapy, or chemotherapy; (2) prior ovarian stimulation; (3) functional cysts or follicles > 10 mm in diameter; and (4) poor-quality ultrasound images.

Additionally, considering the correlation between different ovarian responses after COS and AFC^17,34, the participants were categorized based on their AFC values (≤ 5, 6–19, and ≥ 20), corresponding to poor, normal, and high ovarian responses, respectively^{1,15,35,36,37,38}. AFC predictions was conducted for each of these categories.

AFC (dynamic scanning)—AFC_3D

Transvaginal examinations were performed using ultrasound equipment, including the GE E8, GE S6, GE S8, GE E6 (GE Healthcare, USA), and the IE33, Q5 (Philips Company, USA). Antral follicles were counted manually after a comprehensive scan of both ovaries with the probe frequency set at 3–11 MHz, and the results were used as the AFC_3D. All doctors who conducted the examinations had over 5 years of experience in gynecological ultrasound.

AFC (static image)—AFC_2D

During the AFC_3D assessment, fan-type scanning was performed in the long axis view of each ovary, and images were frozen at the section displaying the highest number of antral follicles. Subsequently, duplex images of both ovaries were stored in a Picture Archiving and Communication System (Fig. 2). A doctor with 5 years of experience in gynecological ultrasound examinations (Reader 1) counted the antral follicles in the duplex images, resulting in the AFC_2D. Next, another doctor with 11 years of experience in gynecological ultrasound verified the counts (Reader 2). When discrepancies occurred, the counts from Reader 2 were considered the final decision.

AMH test

Venous blood samples (10 mL) were collected on the day of the transvaginal ultrasound. The samples were incubated at room temperature for 1 h and then centrifuged at 3,000 r/min for 10 min using a centrifugation radius of 12 cm. The supernatant was subsequently transferred to another clean centrifuge tube and stored at -20 °C for analysis using enzyme-linked immunosorbent assay kit (Roche Diagnostics International Ltd, Rotkreuz, Switzerland).

Image segmentation and pre-processing

Reader 2 segmented duplex ultrasound images of both ovaries using an open-source labeling tool (LabelMe; http://labelme.csail.mit.edu/Release3.0) to identify the regions of interest (ROIs). To enhance the visual contrast of these ROIs, a histogram equalization technique was applied to adjust the gray level distribution of the images. Subsequently, the pre-processed images were normalized and resized to 224 × 224 pixels for input into the deep-learning (DL) models. Specifically, the bicubic sampling method was employed for the images, while the nearest neighbor method was used for the masks, expanding the images into three channels to leverage the pretrained weights of ImageNet. The entire image-processing workflow was executed using Python (version 3.12.1) and OpenCV (version 4.9.0).

Development of machine learning-based radiomic model

Radiomic feature extraction selection

Radiomic features were extracted from each ovary’s ROI using Pyradiomics (version 3.0, https://pypi.org/project/pyradiomics/) per the Image Biomarker Standardization Initiative guidelines³⁹. Original images underwent wavelet and Gaussian Laplacian filtering with kernel sizes of three and five, respectively, to extract first-order and textural features. Bin width parameters were set using the Freedman–Diaconis rule. The mean value of the bilateral ovarian features was standardized using z-score normalization to serve as the final radiomic feature. Details are presented in Online Resource 1 and 2.

Feature selection

The stability of features was evaluated using the intraclass correlation coefficient (ICC). Features with ICC ≥ 0.75 were considered highly stable and retained for further analysis. The selection of relevant radiomic features involved three steps: (1) Univariate linear regression was used for feature selection in the regression tasks (AFC_3D and AMH), while the Kruskal–Wallis test was applied for the three-class classification task to identify features with significant group differences; (2) redundant features with Pearson correlation coefficients > 0.90 were removed to eliminate collinearity; and (3) the most relevant radiomic features in the training cohort were identified using the least absolute shrinkage and selection operator (LASSO) method. The optimal parameter λ was determined through 10-fold cross-validation with 5,000 iterations. Coefficients for each feature were calculated based on the optimal λ, and features with non-zero coefficients were selected for model development.

Model development

Multivariate linear regression models were created to quantitatively predict AFC_3D and AMH levels. For the three-class classification of AFC_3D, five classifiers were developed: support vector machine, random forest, logistic regression, decision tree, and XGboost. Optimal parameters for each model were identified using five-fold cross-validation combined with a grid search strategy within the training cohort. The final model was chosen based on a comparative analysis of each classifier’s predictive performance in the internal-test cohort.

Development of DL and combined (rad-DL) models

Pre-processed bilateral ovarian ROI images were individually input into ResNet18. Deep features from the final convolutional layer were concatenated and processed through two fully connected layers with 256 hidden neurons each. Data augmentation techniques, including random horizontal and vertical flips, 0–30° random rotation, and gray-scale adjustments between 0.5 and 1.5, were applied to enhance model generalization. For quantitative prediction, the mean square error (MSE) loss function, Adam optimizer, and cosine-annealing strategies with an initial learning rate of 1e–3 were used. The cross-entropy loss function was utilized for classification tasks. The entire process was executed using PyTorch (https://pytorch.org/) and a GTX 4090 GPU (Nvidia GeForce).

Post ICC stability screening, radiomic features were combined with deep features and input into a two-layer fully connected network with 512 hidden neurons. We developed the Rad-DL model using the specified training strategy and parameters. For enhanced visualization, we applied the Grad-CAM⁴⁰ technique to the last convolutional layer, offering visual explanations of the model’s decisions. The study design is depicted in Fig. 3.

Model evaluation

Evaluation indicators for quantitative predictions included R², mean absolute error (MAE), MSE, median absolute error (MedAE), and explained variance (EV). The classification prediction models were evaluated using the area under the curve (AUC) of the receiver operating characteristic curve, as well as accuracy (ACC), confusion matrix, sensitivity, specificity, and F1 score to evaluate model discrimination. Additionally, the calibration curve was utilized to assess model calibration, and decision curve analysis (DCA) was performed to evaluate the clinical utility of the prediction models by calculating the net benefit at different threshold probabilities.

Statistical analysis

Continuous data are presented as mean ± standard deviation or median and interquartile range, based on the Shapiro–Wilk test for normality. Spearman correlation analysis assessed correlations between continuous variables. The Wilcoxon signed-rank test was used for single-sample tests. The Kruskal–Wallis test analyzed differences among three groups of sequential categorical variables, and the significance values were adjusted for multiple testing using Bonferroni correction. Bland–Altman plots verified the agreement between methods. Categorical variables are described as frequency and rate, and a two-sided test with P<0.05 indicated statistical significance.

Results

Study population characteristics

The baseline characteristics of the quantitative predictions are presented in Table 1. The AFC_3D was significantly greater than the AFC_2D in the training cohort and in the internal-test and external-test cohorts (P < 0.05). The baseline characteristics of the sequential three-class classification prediction are presented in Table 2. The distribution of AFC_3D ≤ 5, 6 ≤ AFC_3D ≤ 19, and 20 ≤ AFC_3D in the training, internal-test, and external-test cohorts were 62/27/34, 96/41/45, and 52/23/15, respectively. Additionally, as the number of AFC_3D increased, the age of patients decreased progressively (P < 0.001), while AFC_2D and AMH levels increased progressively (P < 0.001) across the three cohorts.

Table 1 Baseline characteristics for quantitative prediction in patients undergoing transvaginal ultrasound and AMH testing.

Full size table

Table 2 Baseline characteristics for sequential classification prediction in patients undergoing transvaginal ultrasound and AMH testing.

Full size table

Correlation and consistency analysis

As shown in Table 3, AFC_2D and AFC_3D were highly correlated in both branches (P < 0.001). Additionally, AMH levels were strongly correlated with AFC_3D (P < 0.001). The Bland–Altman plots (Fig. 4) indicated that only a few cases fell outside the 95% limit of agreement in the training, internal-test, and external-test cohorts. Specifically, this accounted for 4.3% (9 of the 210), 2.2% (2 of the 91), and 3.2% (3 of the 94), respectively. These findings demonstrate good agreement between the AFC_2D and AFC_3D.

Table 3 Correlation analysis of case data in two branches of the hospital.

Full size table

Radiomic feature extraction and selection

Overall, 674 quantitative radiomic features were extracted from the ultrasound images. After removing features with poor reproducibility (ICCs < 0.75) and stability, 437 radiomic features were selected for ANOVA and Pearson correlation analysis. Ultimately, the LASSO method identified 10 radiomic features for constructing a quantitative prediction model for AFC_3D (Online Resource 3(a), (b), (c)), five radiomic features for the quantitative prediction model for AMH level (Online Resource 4(a), (b), (c)), and 10 radiomic features for the sequential three-class classification prediction of AFC_3D (Online Resource 5(a), (b), (c)).

Establishment and evaluation of quantitative prediction models

Multiple linear regression was used as the kernel function for the radiomic model to quantitatively predict AFC_3D and AMH levels. As presented in Table 4, in the training cohort, the R² values for the radiomics, DL, and Rad-DL models were 0.641, 0.993, and 0.976, respectively. In the internal-test and external-test cohorts, the R² values of the radiomic model were closely aligned, while those of the DL model indicated a significant decline in the external-test cohort. Additionally, the R² values of the Rad-DL model in the internal-test and external-test cohorts were 0.743 and 0.583, respectively.

Table 4 Performance of radiomic, deep learning, and combined Rad-DL models in quantitative prediction of AFC_3D.

Full size table

In the quantitative prediction of the AMH levels, as presented in Table 5, in the training cohort, the R² values of the radiomic, DL, and Rad-DL models were 0.520, 0.950, and 0.962, respectively. In the internal-test and external-test cohorts, the R² values of the radiomic model were 0.323 and 0.420, respectively, while those of the DL model were 0.591 and 0.352, respectively, showing a significant decline in the external-test cohort. The R² values of the Rad-DL model in the internal-test and external-test cohorts were 0.550 and 0.509, respectively.

Table 5 Performance of radiomic, deep learning, and combined Rad-DL models in quantitative prediction of AMH.

Full size table

Higher R² and EV, along with lower MAE, MSE, and MedAE values, demonstrate the superior regression performance of the models. As illustrated in Fig. 5, the radar charts based on each regression performance evaluation index indicate that the Rad-DL model exhibited the best overall performance in both prediction tasks.

Establishment and evaluation of prediction model performance for the sequential three-class classification

Based on the results of the training cohort’s five-fold cross-validation and grid search, five machine learning models were compared in the internal-test cohort. As presented in Table 6, random forest demonstrated the best overall predictive performance according to metrics including AUC, ACC, F1, sensitivity, and specificity. Therefore, a random forest was selected as the final model for internal-test and external-test tasks.

Table 6 Classification results on the internal test cohort of five machine learning classifiers for model selection.

Full size table

The AUC values of the radiomic, DL, and Rad-DL models across the AFC_3D categories in the training cohort, as well as the internal-test and external-test cohorts, are presented in Table 7; Fig. 6 (A–C). Detailed performance evaluation of the sequential three-class classification prediction for AFC_3D is shown in Online Resource 6. According to the results of the DeLong test, the fusion of radiomic features into DL improved the predictive performance of the model in the internal-test and external-test cohorts (P = 0.15 and P = 0.08, respectively) (Table 7). The calibration curves of the three models showed good predictive performances in all cohorts, and the MAE values of the three models were 0.008, 0.035, and 0.033 for the external-test cohort, respectively (Fig. 6(D–F) and Online Resource 7(A–F)). Furthermore, the DCA results indicated that the fusion model achieved a greater net benefit across most ranges of threshold probabilities than the radiomic model, DL model, treat-all strategy, and treat-none strategy (Fig. 6(G–I)).

Table 7 Performance of radiomic, deep learning, and combined Rad-DL models in sequential classification prediction of AFC_3D.

Full size table

Interpretability of the DL model

We utilized Grad-CAM to visualize the interpretability of the DL model. Heatmaps and their superpositions on the ultrasound images were acquired from the external-test cohort (Fig. 7). We found that the anechoic regions of the antral follicles had the most active algorithm activation, demonstrating that these regions had the most obvious predictive value for the DL model.

Discussion

Radiomics and DL are extensively used in medical image recognition and analysis. Transvaginal ultrasound-guided manual counting of antral follicles is standard for assessing the ovarian reserve⁴¹. Advances in ultrasound and computerized analysis now enable more accurate evaluations by analyzing comprehensive bilateral ovarian images rather than just counting antral follicles. Studies on automating the measurement of antral follicle diameter, volume, and number using three-dimensional ultrasonographic DL^28,42 demonstrated reduced time and improved inter-observer reliability^24,25,43. Clinically, the focus is on assessing the ovarian reserve rather than exact follicle counts, making the use of expensive ultrasound equipment with three-dimensional automated volume imaging and counting²² impractical.

We selected duplex static images showing the most antral follicles along the long axes of both ovaries. Despite static images displaying significantly fewer antral follicles than dynamic scans, there was a strong correlation and consistency between the two imaging methods. This indicates that ovarian storage function information from static images is closely related to that from dynamic scanning, suggesting a potential for technical replacement. This established the basis for subsequent intelligent image analysis. The radiomics model demonstrated greater robustness than the DL model in quantitatively predicting AFC_3D in both internal and external cohorts. Notably, the radiomics model outperformed the DL model on the external-test cohort, a difference partly attributable to the lower resolution of external ultrasound images, which can adversely affect DL performance^44,45,46,47. This effect is especially pronounced for more complex tasks, such as AMH prediction, where instability in learned deep features can reduce generalization. To mitigate this and enhance robustness, we developed a hybrid Rad-DL approach that integrates handcrafted radiomics features with automatically learned deep features, enabling the model to leverage both stable, interpretable features and rich data-driven representations. As shown in our results, the Rad-DL model achieved the best overall performance across three tasks: it performed comparably to radiomics in the simpler AFC regression task but substantially outperformed both radiomics and DL in the more challenging AMH regression, particularly under lower image-quality conditions. In the quantitative AMH study, all three models underperformed relative to AFC_3D across all cohorts. This disparity is largely attributable to the strong positive correlation between AMH and AFC_3D; although both are established ovarian-reserve biomarkers, they do not always concur. Zhang et al. reported that approximately 20% of infertile patients exhibited discordant AFC and AMH levels in clinical practice⁴⁸. Moreover, in the AFC regression task, the three models demonstrated similar performance on the external-test cohort, likely because AFC is a morphological trait that is directly visible and measurable on ultrasound, making it a relatively straightforward learning problem for image-based models. In contrast, AMH lacks a direct visual correlate on ultrasound and must be inferred indirectly via morphological proxies such as AFC, which increases task complexity and reduces learnability from image data alone. Additionally, regarding the models themselves, we implemented several strategies during development to mitigate overfitting, including early stopping, dropout, and cross-validation. Nevertheless, the complex nature of high-dimensional data and the relatively limited sample size in the deep-learning tasks contribute to this challenge. Moreover, quantitative regression problems—because of their sensitivity to noise and the need to predict continuous values—are more prone to overfitting than classification tasks. By contrast, for the sequential three-class classification task, overfitting was much more controlled: the performance of the DL and Rad-DL models remained stable across both test cohorts. These results suggest that the overfitting-mitigation strategies were more effective for classification, and the observed generalization gap is within an acceptable range.

Accurate assessment of the ovarian reserve through AFC is crucial for identifying patients at high risk of poor or excessive ovarian response, enabling the selection of appropriate COS protocols^49,50,51,52. However, counting antral follicles accurately is challenging due to factors like operator experience²², examination equipment⁵³, and patient conditions (e.g., obesity, intestinal gas interference). Consequently, a sequential three-class classification prediction was conducted following the quantitative prediction of the AFC_3D. The three models showed strong discrimination in identifying high risk patients with poor or excessive ovarian response but were less effective for those with a normal ovarian reserve. The primary clinical challenge is to prevent poor and excessive ovarian responses during COS. The proposed sequential three-class classification model offers a practical and efficient solution, particularly when AFC is challenging because of factors such as high ovaries, bowel gas interference, obesity, and the presence of pathological lesions that may obscure follicle visualization. By streamlining the assessment workflow, the model significantly reduces both the sonographer’s workload and ultrasound acquisition time. The static images for analysis can be obtained from the most common and widely used ultrasound devices. The analysis does not require the examined equipment to have additional computational capabilities, as it can be performed centrally in the cloud. By streamlining the assessment workflow, the model significantly reduces both the sonographer’s workload and the ultrasound acquisition time. Since there is no need for rigid dynamic scanning to count all antral follicles for accurate screening of patients with different ovarian responses, we believe this approach is particularly well-suited for clinical settings where high-end ultrasound equipment is lacking, operator experience is limited, or procedural standardization is weak.

The ultimate goal of our research is to elucidate the correlation between ultrasound imaging features and fertility treatment outcomes, including precision-driven gonadotropin dosing strategies, the yield of mature (M2) oocytes, blastocyst formation rates, and, ultimately, clinical pregnancy rates. However, the focus of the current study was to validate the effectiveness of our new method compared with that of conventional AFC before accounting for confounding factors such as stimulation protocols, with the aim of demonstrating that this new method can be as effective as conventional AFC in predicting outcomes. Future studies should further explore the model’s applicability for long-term outcome prediction and its potential integration into personalized assisted reproductive technology protocols. This study had some limitations. First, the image resolution of the external-test cohort was significantly compromised due to disparities in image acquisition cards between the two branches, which affected the model’s performance in quantitative prediction. Developing models that can be effectively applied to ultrasound images of varying quality is essential. Future research will use DL-based super-resolution reconstruction technology^54,55 to enhance image resolution and establish an improved cohort for external model testing. Second, operator subjectivity influenced the selection of static images with the most antral follicles along the long axis of both ovaries, though overall image classification was less subjective than manual methods. Third, the applicability of the model was somewhat limited due to the selection of only cases with regular menstrual cycles, therefore not suitable for patients with polycystic ovary syndrome, and the exclusion of cases with ovarian cysts or follicles larger than 10 mm in diameter. Lastly, the data comprised a moderate sample size from two branches of the same hospital. Expanding the sample size and increasing the number of participating centers is critical for enhancing prediction accuracy.

Conclusion

This study utilized radiomics and DL on static ultrasound images of the ovaries to predict AFC and AMH levels. The models can quantitatively estimate antral follicle counts and accurately identify patients with infertility who have abnormal ovarian reserves. This research lays a solid foundation for advancing beyond the conventional AFC method and exploring intelligent, non-invasive approaches for assessing the ovarian reserve.

Data availability

The datasets generated or analyzed during the study are available from the corresponding author on reasonable request.

References

Ferraretti, A. P. et al. ESHRE consensus on the definition of poor response to ovarian stimulation for in vitro fertilization: the Bologna criteria. Hum. Reprod. 26, 1616–1624. https://doi.org/10.1093/humrep/der092 (2011).
Article CAS PubMed Google Scholar
Tsakos, E., Tolikas, A., Daniilidis, A. & Asimakopoulos, B. Predictive value of anti-Mullerian hormone, follicle-stimulating hormone and antral follicle count on the outcome of ovarian stimulation in women following GnRH-antagonist protocol for IVF/ET. Arch. Gynecol. Obstet. 290, 1249–1253. https://doi.org/10.1007/s00404-014-3332-3 (2014).
Article CAS PubMed Google Scholar
Practice Committee of American Society for Reproductive Medicine & Suppl Ovarian hyperstimulation syndrome. Fertil. Steril. 90, S188–S193. https://doi.org/10.1016/j.fertnstert.2008.08.034 (2008).
Article Google Scholar
Depmann, M. et al. Can we predict age at natural menopause using ovarian reserve tests or mother’s age at menopause? A systematic literature review. Menopause 23, 224–232. https://doi.org/10.1097/GME.0000000000000509 (2016).
Article PubMed Google Scholar
Ke, H. et al. Landscape of pathogenic mutations in premature ovarian insufficiency. Nat. Med. 29, 483–492. https://doi.org/10.1038/s41591-022-02194-3 (2023).
Article CAS PubMed PubMed Central Google Scholar
Navot, D., Rosenwaks, Z. & Margalioth, E. J. Prognostic assessment of female fecundity. Lancet 2, 645–647. https://doi.org/10.1016/s0140-6736(87)92439-1 (1987).
Article CAS PubMed Google Scholar
Fanchin, R. et al. Exogenous follicle stimulating hormone ovarian reserve test (EFORT): a simple and reliable screening test for detecting poor responders in in-vitro fertilization. Hum. Reprod. 9, 1607–1611. https://doi.org/10.1093/oxfordjournals.humrep.a138760 (1994).
Article CAS PubMed Google Scholar
Broekmans, F. J., Kwee, J., Hendriks, D. J., Mol, B. W. & Lambalk, C. B. A systematic review of tests predicting ovarian reserve and IVF outcome. Hum. Reprod. Update. 12, 685–718. https://doi.org/10.1093/humupd/dml034 (2006).
Article CAS PubMed Google Scholar
Hall, J. E., Welt, C. K. & Cramer, D. W. Inhibin A and inhibin B reflect ovarian function in assisted reproduction but are less useful at predicting outcome. Hum. Reprod. 14, 409–415. https://doi.org/10.1093/humrep/14.2.409 (1999).
Article CAS PubMed Google Scholar
La Marca, A. et al. Anti-Mullerian hormone (AMH) as a predictive marker in assisted reproductive technology (ART). Hum. Reprod. Update. 16, 113–130. https://doi.org/10.1093/humupd/dmp036 (2010).
Article CAS PubMed Google Scholar
Steiner, A. Z. et al. Association between biomarkers of ovarian reserve and infertility among older women of reproductive age. JAMA 318, 1367–1376. https://doi.org/10.1001/jama.2017.14588 (2017).
Article CAS PubMed PubMed Central Google Scholar
Leonhardt, H., Gull, B., Stener-Victorin, E. & Hellström, M. Ovarian volume and antral follicle count assessed by MRI and transvaginal ultrasonography: a methodological study. Acta Radiol. 55, 248–256. https://doi.org/10.1177/0284185113495835 (2014).
Article PubMed Google Scholar
Kelsey, T. W. et al. Ovarian volume throughout life: a validated normative model. PLOS One. 8, e71465. https://doi.org/10.1371/journal.pone.0071465 (2013).
Article CAS PubMed PubMed Central ADS Google Scholar
Deb, S., Kannamannadiar, J., Campbell, B. K., Clewes, J. S. & Raine-Fenning, N. J. The interovarian variation in three-dimensional ultrasound markers of ovarian reserve in women undergoing baseline investigation for subfertility. Fertil. Steril. 95, 667–672. https://doi.org/10.1016/j.fertnstert.2010.09.031 (2011).
Article PubMed Google Scholar
Lima, M. L. et al. Assessment of ovarian reserve by antral follicle count in ovaries with endometrioma. Ultrasound Obstet. Gynecol. 46, 239–242. https://doi.org/10.1002/uog.14763 (2015).
Article CAS PubMed Google Scholar
Fleming, R., Seifer, D. B., Frattarelli, J. L. & Ruman, J. Assessing ovarian response: antral follicle count versus anti-Müllerian hormone. Reprod. Biomed. Online. 31, 486–496. https://doi.org/10.1016/j.rbmo.2015.06.015 (2015).
Article CAS PubMed Google Scholar
Farquhar, C. et al. Management of ovarian stimulation for IVF: narrative review of evidence provided for world health organization guidance. Reprod. Biomed. Online. 35, 3–16. https://doi.org/10.1016/j.rbmo.2017.03.024 (2017).
Article PubMed Google Scholar
Liu, Y., Pan, Z., Wu, Y., Song, J. & Chen, J. Comparison of anti-Müllerian hormone and antral follicle count in the prediction of ovarian response: a systematic review and meta-analysis. J. Ovarian Res. 16, 117. https://doi.org/10.1186/s13048-023-01202-5 (2023).
Article PubMed PubMed Central Google Scholar
The ESHRE Guideline Group on Ovarian Stimulation et al. Ovarian Stimulation for IVF/ICSI^†. European Society of Human Reproduction and Embryology updated 2025. (2020).
Dewailly, D. et al. The physiology and clinical utility of anti-Mullerian hormone in women. Hum. Reprod. Update. 20, 370–385. https://doi.org/10.1093/humupd/dmt062 (2014).
Article PubMed Google Scholar
Iliodromiti, S., Anderson, R. A. & Nelson, S. M. Technical and performance characteristics of anti-Müllerian hormone and antral follicle count as biomarkers of ovarian response. Hum. Reprod. Update. 21, 698–710. https://doi.org/10.1093/humupd/dmu062 (2015).
Article CAS PubMed Google Scholar
Coelho Neto, M. A. et al. Counting ovarian antral follicles by ultrasound: a practical guide. Ultrasound Obstet. Gynecol. 51, 10–20. https://doi.org/10.1002/uog.18945 (2018).
Article CAS PubMed Google Scholar
Practice Committee of the American Society for Reproductive Medicine. Testing and interpreting measures of ovarian reserve: a committee opinion. Fertil. Steril. 103, e9–e17. https://doi.org/10.1016/j.fertnstert.2014.12.093 (2015).
Article Google Scholar
Jayaprakasan, K., Walker, K. F., Clewes, J. S., Johnson, I. R. & Raine-Fenning, N. J. The interobserver reliability of off-line antral follicle counts made from stored three-dimensional ultrasound data: a comparative study of different measurement techniques. Ultrasound Obstet. Gynecol. 29, 335–341. https://doi.org/10.1002/uog.3913 (2007).
Article CAS PubMed Google Scholar
Jayaprakasan, K. et al. Does 3D ultrasound offer any advantage in the pretreatment assessment of ovarian reserve and prediction of outcome after assisted reproduction treatment? Hum. Reprod. 22, 1932–1941. https://doi.org/10.1093/humrep/dem104 (2007).
Article CAS PubMed Google Scholar
Martins, W. P. & Jokubkiene, L. Assessment of the functional ovarian reserve. In: (eds Guerriero, S., Martins, W. P. & Alcazar, J. L.) Managing Ultrasonography in Human Reproduction: A Practical Handbook. Springer International Publishing; :3–12. https://doi.org/10.1007/978-3-319-41037-1_1 (2017).
Rodriguez, A. et al. Learning curves in 3-dimensional sonographic follicle monitoring during controlled ovarian stimulation. J. Ultrasound Med. 33, 649–655. https://doi.org/10.7863/ultra.33.4.649 (2014).
Article PubMed Google Scholar
Rodríguez-Fuentes, A. et al. Volume-based follicular output rate improves prediction of the number of mature oocytes: a prospective comparative study. Fertil. Steril. 118, 885–892. https://doi.org/10.1016/j.fertnstert.2022.07.017 (2022).
Article PubMed Google Scholar
Gillies, R. J., Kinahan, P. E. & Hricak, H. Radiomics: images are more than pictures, they are data. Radiology 278, 563–577. https://doi.org/10.1148/radiol.2015151169 (2016).
Article PubMed Google Scholar
Alzubaidi, L. et al. Review of deep learning: concepts, CNN architectures, challenges, applications, future directions. J. Big Data. 8, 53. https://doi.org/10.1186/s40537-021-00444-8 (2021).
Article PubMed PubMed Central Google Scholar
Song, D. et al. Using deep learning to predict microvascular invasion in hepatocellular carcinoma based on dynamic contrast-enhanced MRI combined with clinical parameters. J. Cancer Res. Clin. Oncol. 147, 3757–3767. https://doi.org/10.1007/s00432-021-03617-3 (2021).
Article CAS PubMed PubMed Central Google Scholar
Huang, B. et al. Deep semantic segmentation feature-based radiomics for the classification tasks in medical image analysis. IEEE J. Biomed. Health Inf. 25, 2655–2664. https://doi.org/10.1109/JBHI.2020.3043236 (2021).
Article Google Scholar
Practice Committee of the American Society for Reproductive Medicine. Electronic address: asrm@asrm.org, practice committee of the American society for reproductive Medicine. Fertility evaluation of infertile women: a committee opinion. Fertil. Steril. 116, 1255–1265. https://doi.org/10.1016/j.fertnstert.2021.08.038 (2021).
Article Google Scholar
Ng, E. H., Tang, O. S. & Ho, P. C. The significance of the number of antral follicles prior to stimulation in predicting ovarian responses in an IVF programme. Hum. Reprod. 15, 1937–1942. https://doi.org/10.1093/humrep/15.9.1937 (2000).
Article CAS PubMed Google Scholar
Hendriks, D. J., Mol, B-W-J., Bancsi, L. F. J. M. M., Te Velde, E. R. & Broekmans, F. J. M. Antral follicle count in the prediction of poor ovarian response and pregnancy after in vitro fertilization: a meta-analysis and comparison with basal follicle-stimulating hormone level. Fertil. Steril. 83, 291–301. https://doi.org/10.1016/j.fertnstert.2004.10.011 (2005).
Article PubMed Google Scholar
Polyzos, N. P., Tournaye, H., Guzman, L., Camus, M. & Nelson, S. M. Predictors of ovarian response in women treated with corifollitropin Alfa for in vitro fertilization/intracytoplasmic sperm injection. Fertil. Steril. 100, 430–437. https://doi.org/10.1016/j.fertnstert.2013.04.029 (2013).
Article CAS PubMed Google Scholar
Jayaprakasan, K. et al. Can quantitative three-dimensional power doppler angiography be used to predict ovarian hyperstimulation syndrome? Ultrasound Obstet. Gynecol. 33, 583–591. https://doi.org/10.1002/uog.6373 (2009).
Article CAS PubMed Google Scholar
Kollmann, M., Martins, W. P. & Raine-Fenning, N. Examining the ovaries by ultrasound for diagnosing hyperandrogenic anovulation: updating the threshold for newer machines. Fertil. Steril. 101, e25. https://doi.org/10.1016/j.fertnstert.2014.01.012 (2014).
Article PubMed Google Scholar
Zwanenburg, A. et al. The image biomarker standardization initiative: standardized quantitative radiomics for high-throughput image-based phenotyping. Radiology 295, 328–338. https://doi.org/10.1148/radiol.2020191145 (2020).
Article PubMed Google Scholar
Selvaraju, R. R. et al. Grad-CAM: visual explanations from deep networks via gradient-based localization. IEEE Int. Conf. Comput. Vis. 618–626. https://doi.org/10.1109/ICCV.2017.74 (2017).
Chinè, A. et al. Low ovarian reserve and risk of miscarriage in pregnancies derived from assisted reproductive technology. Hum. Reprod. Open. 2023, hoad026. https://doi.org/10.1093/hropen/hoad026 (2023).
Article PubMed PubMed Central Google Scholar
Rodrigues, A. R. O. et al. Comparing two- and three-dimensional antral follicle count in patients with endometriosis. J. Med. Ultrasound. 30, 282–286. https://doi.org/10.4103/jmu.jmu_204_21 (2022).
Article PubMed PubMed Central Google Scholar
Scheffer, G. J. et al. Quantitative transvaginal two- and three-dimensional sonography of the ovaries: reproducibility of antral follicle counts. Ultrasound Obstet. Gynecol. 20, 270–275. https://doi.org/10.1046/j.1469-0705.2002.00787.x (2002).
Article CAS PubMed Google Scholar
Javed, H., El-Sappagh, S. & Abuhmed, T. Robustness in deep learning models for medical diagnostics: security and adversarial challenges towards robust AI applications. Artif. Intell. Rev. 58 (1), 12. https://doi.org/10.1007/s10462-024-11005-9 (2024).
Article Google Scholar
Sabottke, C. F. & Spieler, B. M. The effect of image resolution on deep learning in radiography. Radiol. Artif. Intell. 2, e190015. https://doi.org/10.1148/ryai.2019190015 (2020).
Article PubMed PubMed Central Google Scholar
Tang, S. et al. The effect of image resolution on convolutional neural networks in breast ultrasound. Heliyon 9, e19253. https://doi.org/10.1016/j.heliyon.2023.e19253 (2023).
Article PubMed PubMed Central Google Scholar
Thambawita, V. et al. Impact of image resolution on deep learning performance in endoscopy image classification: an experimental study using a large dataset of endoscopic images. Diagnostics (Basel). 11, 2183. https://doi.org/10.3390/diagnostics11122183 (2021).
Article PubMed Google Scholar
Zhang, Y. et al. Discordance between antral follicle counts and anti-Müllerian hormone levels in women undergoing in vitro fertilization. Reprod. Biol. Endocrinol. 17, 51. https://doi.org/10.1186/s12958-019-0497-4 (2019).
Article PubMed PubMed Central Google Scholar
Mathur, P., Kakwani, K., Diplav, K. S., Kudavelly, S. & Ga, R. Deep learning based quantification of ovary and follicles using 3D transvaginal ultrasound in assisted reproduction. Annu. Int. Conf. IEEE Eng. Med. Biol. Soc. 2020, 2109–2112. https://doi.org/10.1109/EMBC44109.2020.9176703 (2020).
Article PubMed Google Scholar
Broer, S. L. et al. Added value of ovarian reserve testing on patient characteristics in the prediction of ovarian response and ongoing pregnancy: an individual patient data approach. Hum. Reprod. Update. 19, 26–36. https://doi.org/10.1093/humupd/dms041 (2013).
Article PubMed Google Scholar
Racca, A., Drakopoulos, P., Neves, A. R. & Polyzos, N. P. Current therapeutic options for controlled ovarian stimulation in assisted reproductive technology. Drugs 80, 973–994. https://doi.org/10.1007/s40265-020-01324-w (2020).
Article PubMed Google Scholar
Chen, M-X. et al. An individualized recommendation for controlled ovary stimulation protocol in women who received the GnRH agonist long-acting protocol or the GnRH antagonist protocol: a retrospective cohort study. Front. Endocrinol. (Lausanne). 13, 899000. https://doi.org/10.3389/fendo.2022.899000 (2022).
Article PubMed Google Scholar
Jayaprakasan, K., Campbell, B., Hopkisson, J., Johnson, I. & Raine-Fenning, N. A prospective, comparative analysis of anti-Müllerian hormone, inhibin-B, and three-dimensional ultrasound determinants of ovarian reserve in the prediction of poor response to controlled ovarian stimulation. Fertil. Steril. 93, 855–864. https://doi.org/10.1016/j.fertnstert.2008.10.042 (2010).
Article PubMed Google Scholar
Hou, M., Zhou, L. & Sun, J. Deep-learning-based 3D super-resolution MRI radiomics model: superior predictive performance in preoperative T-staging of rectal cancer. Eur. Radiol. 33, 1–10. https://doi.org/10.1007/s00330-022-08952-8 (2023).
Article PubMed Google Scholar
Cammarasana, S., Nicolardi, P. & Patanè, G. Super-resolution of 2D ultrasound images and videos. Med. Biol. Eng. Comput. 61, 2511–2526. https://doi.org/10.1007/s11517-023-02818-x (2023).
Article PubMed PubMed Central Google Scholar

Download references

Acknowledgements

The authors acknowledge all the participants for the support in this study.

Author information

Jinwei Zhang and Shangqing Liu contributed equally to this work.

Authors and Affiliations

Chongqing Research Center for Prevention & Control of Maternal and Child Diseases and Public Health, Department of Ultrasound, Women and Children’s Hospital of Chongqing Medical University, No. 120 Longshan Road, Yubei District, Chongqing, 401147, China
Jinwei Zhang, Shuang Liu, Yue Rong, Chunyan Zhong & Suzhen Ran
College of Artificial Intelligence Medicine, Chongqing Medical University, Chongqing, China
Shangqing Liu
National-Regional Key Technology Engineering Laboratory for Medical Ultrasound, School of Biomedical Engineering, Shenzhen University, No. 3688, Nanhai Road, Nanshan District, Shenzhen, 518037, Guangdong, China
Dong Ni
Medical UltraSound Image Computing (MUSIC) Lab, Shenzhen University, No. 3688, Nanhai Road, Nanshan District, Shenzhen, 518037, Guangdong, China
Dong Ni

Authors

Jinwei Zhang
View author publications
Search author on:PubMed Google Scholar
Shangqing Liu
View author publications
Search author on:PubMed Google Scholar
Shuang Liu
View author publications
Search author on:PubMed Google Scholar
Yue Rong
View author publications
Search author on:PubMed Google Scholar
Chunyan Zhong
View author publications
Search author on:PubMed Google Scholar
Dong Ni
View author publications
Search author on:PubMed Google Scholar
Suzhen Ran
View author publications
Search author on:PubMed Google Scholar

Contributions

Conceptualization : Jinwei Zhang, Shangqing Liu, Dong Ni; Data curation : JinweiZhang, Shangqing Liu, Suzhen Ran; Formal analysis : Jinwei Zhang, Shangqing Liu; Funding acquisition: Jinwei Zhang; Investigation : Shuang Liu, Yue Rong, Chunyan Zhong; Methodology : Jinwei Zhang, Shangqing Liu; Project administration : Suzhen Ran; Resources : Jinwei Zhang, Suzhen Ran; Software: Shangqing Liu, Dong Ni; Supervision : Suzhen Ran, Dong Ni; Validation : Shuang Liu, Yue Rong; Visualization : Shangqing Liu; Writing-original draft : Jinwei Zhang, Shangqing Liu; Writing-review &editing : all authors.

Corresponding authors

Correspondence to Dong Ni or Suzhen Ran.

Ethics declarations

Competing interests

The authors declare no competing interests.

Ethics approval

Appropriate ethical standards were followed throughout the entire study. The Ethics Committee of the Women and Children’s Hospital of Chongqing Medical University approved this retrospective study (IRB-2024021). The requirement for informed consent was waived.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary Material 1

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Zhang, J., Liu, S., Liu, S. et al. Ultrasound radiomics and deep learning for predicting antral follicle count and anti-Müllerian hormone. Sci Rep 16, 3115 (2026). https://doi.org/10.1038/s41598-025-33010-w

Download citation

Received: 09 June 2025
Accepted: 15 December 2025
Published: 20 December 2025
Version of record: 23 January 2026
DOI: https://doi.org/10.1038/s41598-025-33010-w

Subjects

Abstract

Similar content being viewed by others

International multicenter validation of AI-driven ultrasound detection of ovarian cancer

Development and validation of a nomogram for failure to collect oocytes in POSEIDON Groups 3 and 4 undergoing IVF/ICSI treatment

Assessment of ovarian reserve function in patients with adenomyosis and ovarian endometriosis by three-dimensional transvaginal ultrasound and Anti-Müllerian hormone

Introduction

Materials and methods

Patients and study design

AFC (dynamic scanning)—AFC3D

AFC (static image)—AFC2D

AMH test

Image segmentation and pre-processing

Development of machine learning-based radiomic model

Radiomic feature extraction selection

Feature selection

Model development

Development of DL and combined (rad-DL) models

Model evaluation

Statistical analysis

Results

Study population characteristics

Correlation and consistency analysis

Radiomic feature extraction and selection

Establishment and evaluation of quantitative prediction models

Establishment and evaluation of prediction model performance for the sequential three-class classification

Interpretability of the DL model

Discussion

Conclusion

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Ethics approval

Additional information

Publisher’s note

Supplementary Information

Supplementary Material 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Quick links

AFC (dynamic scanning)—AFC_3D

AFC (static image)—AFC_2D