A machine learning approach for non-invasive PCOS diagnosis from ultrasound and clinical features

Agirsoy, Mehtap; Oehlschlaeger, Matthew A.

doi:10.1038/s41598-025-10453-9

Download PDF

Article
Open access
Published: 29 September 2025

A machine learning approach for non-invasive PCOS diagnosis from ultrasound and clinical features

Mehtap Agirsoy¹ &
Matthew A. Oehlschlaeger¹

Scientific Reports volume 15, Article number: 33638 (2025) Cite this article

4974 Accesses
5 Citations
1 Altmetric
Metrics details

Subjects

Abstract

This study investigates the use of machine learning (ML) algorithms to support faster and more accurate diagnosis of polycystic ovary syndrome (PCOS), with a focus on both predictive performance and clinical applicability. Multiple algorithms were evaluated—including Artificial Neural Networks (ANN), Support Vector Machines (SVM), Logistic Regression (LR), K-Nearest Neighbors (KNN), and Extreme Gradient Boosting (XGBoost). XGBoost consistently outperformed the other models and was selected for final development and validation. To align with the Rotterdam criteria, the dataset was structured into three feature categories: clinical, biochemical, and ultrasound (USG) data. The study explored various combinations of these feature subsets to identify the most efficient diagnostic pathways. Feature selection using the chi-square-based SelectKBest method revealed the top 10 predictive features, which were further validated through XGBoost’s internal feature importance, SHAP analysis, and expert clinical assessment. The final XGBoost model demonstrated robust performance across multiple feature combinations: • Clinical + USG + AMH: AUC = 0.9947, Precision = 0.9553, F1 Score = 0.9553, Accuracy = 0.9553. • Clinical + USG: AUC = 0.9852, Precision = 0.9583, F1 Score = 0.9388, Accuracy = 0.9384. The most influential features included follicle count on both ovaries, weight gain, Anti-Müllerian Hormone (AMH), hair growth, menstrual irregularity, fast food consumption, pimples, and hair loss, levels. External validation was performed using a publicly available dataset containing 320 instances and 18 diagnostic features. The XGBoost model trained on the top-ranked features achieved perfect performance on the test set (AUC = 1.0, Precision = 1.0, F1 Score = 1.0, Accuracy = 1.0), though further validation is necessary to rule out overfitting or data leakage. These findings suggest that combining clinical and ultrasound features enables highly accurate, non-invasive, and cost-effective PCOS diagnosis. This study demonstrates the potential of ML-driven tools to streamline clinical workflows, reduce reliance on invasive diagnostics, and support early intervention in women’s health.

An extended machine learning technique for polycystic ovary syndrome detection using ovary ultrasound image

Article Open access 12 October 2022

CystNet: An AI driven model for PCOS detection using multilevel thresholding of ultrasound images

Article Open access 23 October 2024

Transfer learning-enhanced CNN model for integrative ultrasound and biomarker-based diagnosis of polycystic ovarian disease

Article Open access 03 October 2025

Introduction

Polycystic ovary syndrome (PCOS) is an umbrella term for a complex hormonal and metabolic disorder that primarily affects women of reproductive age. PCOS affects an estimated 10-13% of women worldwide¹. Despite being a significant health condition, it remains understudied, often undiagnosed, and poorly understood due to the heterogeneity of its symptoms^2,3. Difficulty losing weight, irregular menstrual cycles, infertility, hormone imbalance, hirsutism, insulin resistance, depression, acne, cysts on ovaries, and type-2 diabetes are some of the key clinical features of PCOS^4,5.

In 2023, the Centre for Research Excellence in Women’s Health in Reproductive Life (CRE WHiRL) refined and updated the 2018 International Evidence-Based Guideline criteria for the diagnosis of PCOS⁶. This update builds upon the consensus-driven 2003 Rotterdam criteria. Diagnosis requires the presence of at least two of the following symptoms: (i) clinical hyperandrogenism, such as hirsutism, and/or biochemical hyperandrogenism measured through total or free testosterone levels; (ii) ovulatory dysfunction, including irregular menstrual cycles and/or anovulation; and (iii) polycystic ovaries observed via ultrasound or elevated anti-Müllerian hormone (AMH) levels^1,6. Therefore, diagnosing PCOS based on established criteria remains a challenge for clinicians, as traditional diagnostic methods are often complex, time-consuming, and costly.

There remain substantial unmet needs in the diagnosis and treatment of PCOS, alongside notable gaps in the knowledge of healthcare professionals⁷. Gibson-Helm et al. report that obtaining a PCOS diagnosis often takes several months to years and requires consultations with multiple healthcare professionals due to its heterogenic nature⁴. They also found that most women were dissatisfied with their diagnostic experience and the amount of information they received^4,8. Early diagnosis of PCOS is not only vital for improving women’s health outcomes but also has significant economic benefits⁹. Timely identification allows for effective management of the condition, helping to slow the progression of comorbidities and reduce long-term healthcare costs associated with complications like obesity, diabetes, and cardiovascular disease^4,10. Since quality of life is closely tied to the clinical features of PCOS, early intervention improves health outcomes while also reducing the financial burden on both patients and healthcare systems^8,9. Additionally, engaging women early in lifestyle management can prevent costly treatments for metabolic complications, making early diagnosis critical from both a health and economic perspective¹¹.

Machine learning (ML) presents a promising solution to many of the challenges in diagnosing and managing PCOS. By harnessing advanced data analysis, ML enables healthcare professionals to detect patterns, predict outcomes, and personalize treatment plans with greater precision^12,13. This technology facilitates early diagnosis, leading to timely interventions that can significantly improve patient outcomes and quality of life¹⁴. Additionally, ML empowers patients by providing deeper insights into their condition, promoting better self-management, and addressing existing gaps in knowledge and care with eliminating human error contribution^13,14. Integrating ML into clinical practice not only enhances the accuracy and efficiency of PCOS diagnosis and treatment but also offers a cost-effective and time-efficient approach¹⁴.

Literature review

Zad et al. developed a ML model to predict PCOS risk in an outpatient population, aiming to facilitate earlier diagnosis¹⁵. Leveraging electronic health records from 30,601 women, their model successfully predicted PCOS prior to clinical diagnosis, achieving an AUC of 85% using gradient boosting¹⁵. The study evaluated several algorithms, including Random Forests (RF), Gradient-Boosted Trees (GBT), Support Vector Machines (SVM), and Logistic Regression (LR), with AUCs ranging from 77.4 to 85%, where GBT outperformed the others¹⁵. They concluded that elevated levels of luteinizing hormone (LH), low levels of follicle-stimulating hormone (FSH), obesity, and higher body mass index (BMI) significantly increase the likelihood of PCOS¹⁵.

A recent study by Panjwani et al. conducted to predict the presence of PCOS using primary symptoms and general health indicators associated with lifestyle diseases¹⁶. The researchers combined data from cardiovascular disease (CVD) and PCOS datasets to form a new, integrated dataset¹⁶. Through feature extraction and refinement, they identified 12 critical attributes indicative of hormonal imbalances in women¹⁶. Unlike studies relying on complex hormonal testing, this approach focused on early-stage symptoms for PCOS prediction¹⁶. An ensemble learning model was developed by stacking seven base classifiers, with a deep learning model serving as the meta-classifier¹⁶. The WaO meta-heuristic algorithm was utilized to optimize the hyperparameters of the ensemble model¹⁶. The proposed WaOEL model achieved superior performance, reporting an accuracy of 92.8% and an AUC of 0.93, outperforming the RSOEL and CSOEL models¹⁶.

In another study by Chen et al.¹⁷., RNA-seq analysis was performed on granulosa cells, including 13 samples from healthy controls and 25 samples from women with PCOS¹⁷. The dataset was further enriched by integrating publicly available transcriptomic data. Two ML algorithms, SVM and XGBoost, were employed to identify and validate potential diagnostic biomarkers, as well as to investigate immune cell infiltration patterns associated with PCOS¹⁷. The validation results demonstrated promising diagnostic performance, with AUC values of 0.79 for SVM and 0.87 for XGBoost, suggesting their potential utility as diagnostic tools for PCOS¹⁷. In an additional study conducted in China by Chen et al.¹⁸general clinical information, including gender, age, BMI, and hormone levels was collected from 152 PCOS patients and 50 non-PCOS controls¹⁸. The objective was to identify serum biomarkers in PCOS patients using untargeted lipidomics combined with ensemble ML techniques¹⁸. Their workflow involved data preprocessing, statistical pre-screening, secondary screening via ensemble learning, biomarker validation, and the construction of a diagnostic panel¹⁸. The diagnostic panel achieved an AUC of 0.815 on the test set, with an accuracy of 0.74, specificity of 0.88, and sensitivity of 0.7¹⁹.

Ahmetasevic et al. used a larger dataset of 1,000 instances, containing key clinical parameters such as oligo-ovulation, anovulation, free testosterone, free androgen index (FAI), calculated bioavailable testosterone, androstenedione, dehydroepiandrosterone, ovarian volume, follicle count, and obesity to classify PCOS cases using an ANN model¹⁹. The developed system demonstrated promising results, achieving an accuracy of 96.1%, sensitivity of 96.8%, and specificity of 90%, underscoring the potential of ANN-based models in PCOS classification¹⁹. The system identified a total of 157 positive cases and 23 negative cases, which further supports its robust performance in differentiating PCOS and non-PCOS subjects¹⁹.

Lim et al. acquired a total of 450 women with irregular menstrual cycles were enrolled in their study, including 294 patients diagnosed with PCOS and 156 without PCOS²⁰. The aim of this study is to develop and validate an effective predictive model for PCOS and to investigate the correlation between clinical features in women with irregular menstruation and those with PCOS, utilizing pulse wave parameters and Traditional Chinese Medicine (TCM) clinical indices²⁰. Through Recursive Feature Elimination with Cross-Validation (RFECV), 31 key features were selected, comprising 12 pulse parameters and 19 TCM clinical indices²⁰. Models built using the combined pulse and TCM clinical data outperformed those using either dataset al.one²⁰. Among the algorithms tested, the SVM model delivered the best predictive performance for PCOS, achieving an accuracy of 0.837, an AUC of 0.878, and an F1 score of 0.878²¹.

Several studies have utilized the same Kaggle dataset comprising 541 patient records to develop ML models for PCOS prediction. Elmannani et al. proposed a stacking ensemble model for early PCOS detection, achieving an AUC of 94%²². Similarly, Tiwari et al. applied a RF classifier to the same dataset and reported an accuracy of 92%²³ Rahman et al. developed a web-based ML framework using the Kaggle dataset and evaluated 13 classifiers, including AdaBoost (AB), RF, DT, and LR. Both AB and RF achieved the highest accuracy of 94%, emphasizing the advantage of ensemble methods in PCOS prediction²³. Khanna et al. extended this approach by evaluating 15 classifiers and reported that their multi-stack ensemble model achieved the highest accuracy of 98%²⁵. Abu Abda et al. also utilized the Kaggle dataset for PCOS classification and found that the Linear SVM algorithm outperformed other models, achieving an accuracy of 91.6%²⁶. Ahmad et al. focused on leveraging deep learning for automated feature engineering using the same Kaggle dataset²⁶. They proposed three lightweight models based on LSTM, CNN, and a hybrid CNN-LSTM architecture. To address class imbalance, the SMOTE technique was applied during training²⁶. Their models reported accuracies of 92.04%, 96.59%, and 94.31%, corresponding ROC-AUC values of 92.0%, 96.6%, and 94.3%, respectively²⁶.

Motivation

Existing studies have made significant strides in developing accurate ML models for PCOS diagnosis, underscoring the importance of improving diagnostic precision. However, the traditional diagnostic criteria remain complex and difficult to implement efficiently in clinical practice, leading to delays in timely interventions and increasing the physiological burden on patients. These criteria encompass a wide range of clinical, biomarker, and ultrasound data, making them time-consuming and expensive to apply in routine practice. There remains a gap in understanding how to simplify and streamline these diagnostic processes. Thus, it is worthwhile to explore the potential of ML models not only to enhance diagnostic accuracy but also to simplify the process.

This study aims to investigate the feasibility of streamlining the Rotterdam criteria by identifying the most critical diagnostic features that contribute to an accurate PCOS diagnosis. By structuring the dataset according to the Rotterdam framework and testing various combinations of clinical, biomarker, and ultrasound data, this study seeks to develop a faster, more accurate, and cost-effective diagnostic method, ultimately facilitating earlier detection and improving patient outcomes.

Although several existing studies report high accuracies in PCOS diagnosis using ML, many lack external validation, raising concerns about their generalizability. Reported performance metrics are often derived from the training set or a single test set without independent validation, potentially leading to overestimated accuracy. The absence of rigorous external validation limits the reliability of these models when applied to diverse populations or real-world clinical settings. Additionally, most studies do not adequately investigate the diagnostic significance of individual subsets within established criteria. The Rotterdam criteria, for instance, define PCOS based on the presence of two out of three conditions (oligo/anovulation, hyperandrogenism, and polycystic ovaries), yet many ML models treat the dataset without analyzing the relative contribution of these subsets. This oversight limits insights into which combinations of features are most predictive and how diagnostic efficiency can be improved.

Materials and methods

Dataset description

This study utilized a retrospective cohort design, drawing data from ten hospitals from India. The dataset, originally compiled by Kottarathil et al., includes clinical records from a total of 541 women of reproductive age who were evaluated for suspected PCOS, comprising 356 non-PCOS and 185 PCOS cases²⁷. The dataset encompasses 41 distinct features across clinical, biochemical, and ultrasonographic domains, providing a comprehensive multidimensional framework for model development. Detailed descriptions of the dataset for each category are provided in Supplementary Table 1.

The clinical variables consist of 26 features, including key demographic parameters such as patient age and BMI, as well as menstrual cycle characteristics (e.g., cycle length) and hyperandrogenic phenotypes (e.g., hirsutism). The biochemical profile includes 11 features, such as serum concentrations of crucial reproductive and metabolic hormones, including total testosterone, free testosterone, LH, FSH, and AMH, all of which play a vital role in the diagnostic assessment of PCOS. Ultrasonographic data comprises 5 features, derived from transabdominal pelvic ultrasound examinations, including measurements related to polycystic ovarian morphology (PCOM), specifically antral follicle count per ovary. These imaging markers align with the diagnostic thresholds established by the Rotterdam criteria.

Data preprocessing

Prior to the implementation of ML algorithms, rigorous data preprocessing was conducted to enhance data quality and ensure model compatibility:

Missing data imputation: The dataset exhibited a singular instance of missing data, which was addressed via median imputation. The median value for the respective feature was computed and substituted to preserve data distribution and minimize potential bias.
Feature scaling and normalization: All continuous numerical variables were standardized using z-score normalization. This transformation involved centering each feature to a mean of zero and scaling to a unit variance (standard deviation of one). Standardization was applied to mitigate the influence of disparate feature scales and to optimize algorithmic convergence, particularly for distance-based or gradient-sensitive models such as XGBoost.

Model development

During the model development phase, a systematic evaluation of multiple supervised ML algorithms was conducted to identify the most effective classifier for the accurate diagnosis of PCOS. A comparative analysis was performed across a range of widely adopted classification models, including Artificial Neural Networks (ANN), Support Vector Machines (SVM), Logistic Regression (LR), K-Nearest Neighbors (KNN), and Extreme Gradient Boosting (XGBoost). The effectiveness of these predictive models has been confirmed in prior PCOS research, as outlined in the literature review. The flow configuration of the study is illustrated in Fig. 1.

ANNs are inspired by the structure of the human brain, consisting of multiple layers that learn complex, nonlinear patterns, though they require careful hyperparameter tuning and may overfit on small datasets²⁸. SVMs construct optimal hyperplanes to separate classes, performing well in both linear and nonlinear classification, particularly in high-dimensional settings²⁹. LR is a simple and interpretable linear model used for binary classification, effective for linearly separable data³⁰. DTs known for their interpretability, recursively split the data based on feature values, though they are susceptible to overfitting without techniques such as pruning or ensemble learning³¹. Lastly, KNN is an instance-based method that classifies data points based on the majority vote of their k nearest neighbors, offering simplicity but often incurring high computational costs with larger datasets³².

As one of the most powerful ensemble learning techniques, gradient tree boosting is particularly well-suited for tackling high-dimensional and imbalanced clinical datasets³³. XGBoost, a highly optimized and scalable implementation of gradient boosting, offers significant advantages over alternative classifiers, including rapid computation due to advanced parallelization strategies and memory-efficient block structures³⁴. Compared to deep neural networks (DNNs), XGBoost also benefits from faster convergence, reduced computational burden, and simpler hyperparameter optimization, making it highly appropriate for this investigation³⁵.

XGBoost’s effectiveness stems from its ability to iteratively refine predictive accuracy through an ensemble of decision trees, where each subsequent tree focuses on minimizing residual errors from the prior iterations. This sequential learning approach facilitates the modeling of complex non-linear feature interactions and is particularly advantageous for structured datasets comprising clinical, biochemical, and imaging-derived parameters, such as those used in this study³⁶.

The comparative performance assessment was conducted on a withheld test dataset to ensure an unbiased evaluation of each model’s predictive capacity. Among the tested algorithms, XGBoost demonstrated superior performance, achieving the highest accuracy, sensitivity, specificity, and area under the curve (AUC) metrics, Table 2. Consequently, XGBoost was selected as the final predictive model. This choice aligns with existing literature, where gradient boosting frameworks have consistently been validated as highly effective for complex classification tasks³³. XGBoost, an ensemble learning algorithm based on gradient boosting, excels in handling structured data and capturing intricate feature interactions, making it particularly well-suited for PCOS prediction.

To ensure model robustness and mitigate the risk of overfitting, a stratified 10-fold cross-validation strategy was implemented across all models. To address class imbalances, the Synthetic Minority Over-sampling Technique (SMOTE) was applied, enhancing the model’s ability to generalize across different diagnostic cases. The dataset was then partitioned into an 80:20 split, with 80% allocated for model training and internal validation, while the remaining 20% was reserved exclusively for external performance testing. Hyperparameter tuning was conducted via an exhaustive grid search, optimizing critical parameters such as the number of trees, learning rate, maximum tree depth, and subsample ratio. Cross-validation results guided the selection of the final model configuration, ensuring balanced performance across key evaluation metrics while improving generalizability.

Traditional ML methods were chosen over deep learning due to the dataset size, interpretability, and computational efficiency³⁷. Deep learning models typically require large datasets and significant computational resources, whereas traditional ML algorithms as used in this study offer interpretability and efficiency with smaller datasets³⁸. Additionally, explainability is crucial in medical applications, and traditional models provide clearer insights into feature importance, aiding clinical decision-making^38,39. Given the strong predictive performance achieved, deep learning was left for further ultrasound image processing studies, ensuring the current model remains practical, transparent, and easily deployable in real-world settings.

Figure 2 illustrates the subset of the decision tree within the XGBoost used for classifying PCOS. The tree starts by splitting on Follicle No. (L) (number of follicles on the left ovary) < 0.227, where cases satisfying this condition (yes) are further divided based on hair growth < 1, leading to either a leaf node with a value of −0.013 or another split at Follicle No. (L) < 0.364, resulting in leaf nodes with values of 0.056 or −0.005. For cases where Follicle No. (L) ≥ 0.227 (no), the next split occurs at Follicle No. (L) < 0.5, leading to leaf nodes with values of 0.0164 or −0.0112. This tree clearly visualizes the model’s decision-making process, highlighting how different features such as follicle number and hair growth impact classification, providing valuable insights that could inform personalized care in clinical settings.

Feature selection

To identify the most relevant features for PCOS prediction, a multi-faceted feature selection approach was implemented. The chi-square-based SelectKBest method was applied to rank and select the most informative features, ensuring statistical relevance. Additionally, XGBoost’s built-in feature importance mechanism provided a data-driven assessment of feature contributions within the ensemble learning framework. To further enhance interpretability, SHAP (SHapley Additive exPlanations) was utilized to quantify feature impact at both global and instance levels. A SHAP summary plot was generated, offering a comprehensive visualization of the most influential predictors and their relative significance in classification.

The dataset was further structured according to the Rotterdam criteria, categorizing variables into clinical, biomarker, and ultrasound data subsets. Various combinations of these subsets were tested to determine their predictive efficacy and to evaluate the viability of the hypothesis that loosening the Rotterdam criteria could maintain or improve diagnostic performance.

Given the high-dimensional nature of the dataset, effective feature selection was crucial for optimizing model performance and mitigating overfitting. Instead of employing Recursive Feature Elimination (RFE), which systematically eliminates features in iterative steps, the integration of chi-square-based SelectKBest, XGBoost feature importance, and SHAP analysis provided a more robust, interpretable, and computationally efficient feature selection strategy. This hybrid approach ensured the retention of only the most predictive variables, enhancing diagnostic accuracy while reducing computational complexity.

Model evaluation

The comparative performance assessment of the models was conducted using multiple evaluation metrics to ensure a comprehensive analysis of predictive efficacy. The primary metric, Area Under the Receiver Operating Characteristic Curve (AUC-ROC), quantified each model’s ability to distinguish between PCOS and non-PCOS cases. Additionally, accuracy, precision, recall (sensitivity), specificity, and F1-score were computed to provide a more nuanced evaluation of model performance.

Among the tested models, XGBoost demonstrated superior classification performance, achieving the highest AUC-ROC, accuracy, precision, recall, and F1-score, Table 2. Its ability to handle nonlinear relationships and complex feature interactions contributed to its robust performance across different evaluation criteria. Consequently, XGBoost was selected as the final predictive model.

To enhance the model’s reliability and interpretability, a feature selection strategy was applied. The SelectKBest (chi-square) method, XGBoost’s built-in feature importance, and SHAP were utilized to identify and retain the most relevant features. Furthermore, the dataset was analyzed using clinical, ultrasound, and biomarker subsets to assess their individual and combined predictive contributions. This step also tested the feasibility of a more flexible interpretation of the Rotterdam criteria while maintaining or improving diagnostic accuracy.

To validate model robustness, confusion matrices were generated, enabling a detailed breakdown of true positive, true negative, false positive, and false negative rates. The precision-recall curve (PRC) and F1-score further reinforced XGBoost’s effectiveness in achieving a balance between sensitivity and specificity, minimizing false negatives while ensuring high classification confidence.

Statistical analysis

Descriptive statistics were computed for both PCOS and non-PCOS groups to summarize the central tendency and dispersion of numerical features. A t-test was performed to assess mean differences between the two groups, evaluating statistical significance across clinical, biochemical, and ultrasound measurements. This analysis provided crucial insights into distinguishing characteristics between PCOS and non-PCOS patients, aiding in model interpretation and validation.

All analyses were conducted using Python, with key libraries: Scikit-learn for data preprocessing, model training, and evaluation; XGBoost for gradient boosting; TensorFlow/Keras for ANN implementation; and Pandas and NumPy for data handling. Matplotlib and Seaborn were employed for data visualization, enabling comprehensive performance assessment and feature importance analysis.

Results and discussion

The dataset comprises 541 patient records with 33 continuous and 8 categorical features. Descriptive statistics for continuous features in the PCOS and non-PCOS groups are provided in Supplementary Table 2. The mean differences between PCOS and non-PCOS patients reveal notable variations in several clinical and biochemical features. Among these, weight gain (0.456), hair growth (0.442), and skin darkening (0.468) exhibited the highest mean differences, indicating strong associations with PCOS symptoms. Obesity and weight gain are strongly linked to PCOS due to insulin resistance and hormonal imbalances⁴⁰. Studies report that up to 80% of women with PCOS are overweight or obese, exacerbating both metabolic and reproductive symptoms⁴⁰. Excess hair growth (hirsutism) results from hyperandrogenism, a hallmark of PCOS, leading to increased testosterone levels that stimulate hair growth⁴¹. Skin darkening, commonly associated with acanthosis nigricans, is linked to insulin resistance, a prevalent feature in PCOS patients⁴². Other significant characteristics included the number of follicles in the left (0.247) and right (0.306) ovaries, further emphasizing the reproductive aspects of PCOS.

The number of antral follicles in both ovaries is a key diagnostic criterion for PCOS. The Rotterdam criteria define PCOS as having ≥ 12 antral follicles per ovary or increased ovarian volume⁶. Studies have found that follicle count is significantly higher in PCOS patients, with excessive follicular development disrupting normal ovulatory function⁴³. Cycle length (0.257) also showed a meaningful difference, reflecting altered menstrual patterns in affected individuals. PCOS patients often experience oligomenorrhea (irregular cycles) or amenorrhea due to anovulation. Research indicates that cycle lengths > 35 days are common in PCOS and correlate with hormonal imbalances, particularly elevated LH levels⁴⁴. Conversely, factors like age (−0.069), height (0.020), and blood group (0.020) displayed minimal differences, suggesting these variables may not be as relevant for distinguishing between the two groups. Studies suggest that age and height do not significantly differentiate PCOS patients from non-PCOS individuals, as PCOS is primarily an endocrine disorder rather than a structural growth abnormality⁴⁵. Blood group has not been identified as a determinant in PCOS pathology, and limited research suggests no strong correlation between PCOS and ABO blood group distribution⁴⁶. Overall, the findings highlight specific features that could serve as potential indicators in diagnosing and understanding the clinical implications of PCOS.

The t-test analysis in Table 1 provides deeper insights into the differences between PCOS and non-PCOS groups, highlighting several statistically significant variations across clinical, biochemical, and ultrasound features. Key variables such as age, weight, BMI, cycle length, hair growth, and follicle count exhibited significant differences, with p-values well below the conventional threshold of 0.05. Specifically, PCOS patients were significantly younger, with higher weight and BMI compared to their non-PCOS counterparts, aligning with existing literature indicating that PCOS often manifests during adolescence, leading to a younger average age at diagnosis⁴⁷. Additionally, the distinct patterns of excessive hair growth and increased follicle count in PCOS patients further underscore the key physiological traits associated with the condition.

While variables such as pulse rate and Vitamin D levels approached significance, suggesting potential areas for further research, other factors like height, blood group, and pregnancy status did not show significant differences between the two groups. Notably, Vitamin D deficiency is common among women with PCOS and has been linked to metabolic and hormonal imbalances⁴⁸. These findings highlight the importance of recognizing these clinical features in the diagnosis and management of PCOS, providing valuable insights to healthcare practitioners in their clinical decision-making.

Table 1 t-Test results for the dataset.

Full size table

The feature correlation matrix (Supplementary Fig. 1) highlights the relationships between various clinical and biochemical variables, revealing strong correlations that align with established physiological patterns. Notably, the FSH/LH ratio and FSH exhibit a strong correlation (r = 0.97), consistent with studies indicating that an altered FSH/LH ratio is a hallmark of PCOS due to disrupted gonadotropin regulation⁴⁹. Similarly, BMI and height (r = 0.90) and waist and hip measurements (r = 0.87) show expected associations reflecting body composition metrics⁴⁹. The right and left follicle numbers also demonstrate a strong correlation (r = 0.80), reinforcing the role of ovarian morphology in PCOS diagnosis⁵⁰.

Beyond these, several features exhibit moderate positive correlations with the presence of PCOS, including right follicle number (r = 0.65), left follicle number (r = 0.60), skin darkening (r = 0.48), hair growth (r = 0.46), and weight gain (r = 0.40). These associations align with research linking hyperandrogenism and insulin resistance in PCOS to increased follicle count, acanthosis nigricans (skin darkening), and hirsutism (hair growth)^{40,41,42,43,44,45,47,50}. Outside these relationships, most features appear independent, suggesting minimal multicollinearity within the dataset. These findings reinforce known pathophysiological aspects of PCOS and highlight key clinical indicators that may aid in diagnosis and patient management.

The SHAP summary plot, Fig. 3 illustrates the impact of various features on the model’s predictions for PCOS diagnosis. Key predictors such as follicle numbers in both ovaries, AMH levels, hair growth, and weight gain exert the most significant influence on PCOS classification. Higher values for these features (indicated in red) are strongly associated with a higher likelihood of predicting PCOS, whereas lower values (in blue) shift the model towards a non-PCOS diagnosis. These findings align with prior research establishing follicle count and AMH as reliable markers for PCOS, given their association with ovarian dysfunction and hyperandrogenism^{40,41,42,43,44,45}.

The plot also highlights variability in feature importance, with factors like FSH, LH, and age exhibiting mixed effects depending on their values. Notably, follicle numbers on both ovaries and AMH levels emerge as the most significant predictors, with the right ovary’s follicle count demonstrating the strongest influence. This variation between ovaries is expected, as the right ovary generally receives a more direct blood supply from the abdominal aorta, whereas the left ovary’s supply is routed via the renal vein, potentially influencing follicular development⁵¹.

These findings reinforce the third criterion of the Rotterdam classification for PCOS diagnosis, polycystic ovarian morphology or elevated AMH levels. Given that ultrasound and clinical assessments are non-invasive and cost-effective compared to biochemical testing, this raises an important question: Is it sufficient to rely solely on these methods for PCOS detection? While ultrasound and clinical features provide valuable diagnostic insights, integrating biochemical markers such as AMH and hormone levels may enhance accuracy, particularly in borderline or ambiguous cases.

Figure 4 presents a SHAP waterfall plot that illustrates the contribution of individual features to the model’s prediction for a high-confidence, correctly classified case of PCOS. The plot starts at the model’s base value (E[f(X)] = − 0.628E[f(X)] = −0.628E[f(X)] = − 0.628), the expected output across the training set and sequentially adds each feature’s SHAP value to arrive at the final prediction score (f(x) = 5.292f(x) = 5.292f(x) = 5.292). This high score strongly favors the PCOS class, indicating a confident classification. The most influential features include the follicle count in the right ovary (+ 2.27) and left ovary (+ 1.20), both of which align with the diagnostic criterion of polycystic ovarian morphology⁵⁰. Other strong positive contributors include weight gain (+ 0.59), menstrual irregularity (+ 0.49), elevated anti-Müllerian hormone (AMH; +0.44), skin darkening (+ 0.38), and low vitamin D3 levels (+ 0.36), all of which are consistent with established clinical and biochemical indicators of PCOS^{40,41,42,43,44,45}. In contrast, the absence of hirsutism (− 0.27) and relatively low prolactin levels (− 0.34) contribute negatively to the prediction, reflecting features that are less pronounced in this particular phenotype. Collectively, these results demonstrate the model’s ability to synthesize diverse features into an interpretable and individualized prediction consistent with known PCOS diagnostic patterns. Notably, the relative importance and directionality of these features are in agreement with the SHAP summary plot shown in Fig. 3, reinforcing the consistency and clinical relevance of the model’s interpretability.

The performance of various ML models reveals interesting insights into their generalization abilities. The ANN achieves perfect training metrics but experiences a slight drop in test performance, with an AUC of 0.91 and accuracy of 0.84, indicating a minor overfitting issue. The SVM shows strong performance during training, with an AUC of 0.97, and generalizes well to the test set, maintaining an AUC of 0.93 and accuracy of 0.87. LR also performs well, with high training AUC and precision, but its test performance slightly decreases, yielding an AUC of 0.92 and accuracy of 0.85. KNN shows good training results but struggles more with generalization, with a noticeable performance drop on the test set (AUC of 0.90 and accuracy of 0.83). Finally, XGBoost demonstrates the best overall performance, achieving perfect training results and maintaining high test metrics (AUC of 0.98, Accuracy of 0.93), suggesting strong robustness and excellent generalization capabilities. Among them, XGBoost achieved the highest performance and was selected for further analysis. Table 2 presents the comparative performance metrics for both training and test sets, while Supplementary Fig. 2 and Supplementary Table 4 provides the corresponding confusion matrices and ROC-AUC results. In conclusion, while ANN and XGBoost perform the best in terms of training, XGBoost is the most reliable across both training and test data, followed by SVM, LR, and KNN, which experiences more significant overfitting.

Table 2 Comparative performance metrics for the training and test sets.

Full size table

To assess whether ultrasound and clinical assessments alone are sufficient for PCOS detection, the XGBoost model was evaluated using different feature sets, examining the influence of clinical, biochemical, and ultrasound data on diagnosis.

In the first iteration, only ultrasound features were used, Fig. 5. While the model performed well on the training set (AUC = 0.9636, F1 = 0.8406, Accuracy = 0.8981), its performance declined on the test set (AUC = 0.8312, F1 = 0.6441, Accuracy = 0.8073). The confusion matrix for the training set showed 272 true positives (TP), 116 true negatives (TN), 15 false positives (FP), and 29 false negatives (FN). On the test set, the model identified 69 TP, 19 TN, 8 FP, and 13 FN.

These results suggest that while ultrasound features provide valuable insights, additional feature types are essential for improved model generalization and robust PCOS detection.

In the second iteration, both ultrasound and biochemical features were used, Fig. 6, resulting in a slightly higher AUC on the training set (0.9986), along with an F1 score of 0.9686 and an accuracy of 0.9792. However, the test set performance remained lower, with an AUC of 0.8364, an F1 score of 0.6349, and an accuracy of 0.7890. The confusion matrix for the training set showed 284 true positives (TP), 139 true negatives (TN), 3 false positives (FP), and 6 false negatives (FN), while the test set identified 66 TP, 20 TN, 11 FP, and 12 FN.

These results indicate that although incorporating biochemical data led to some improvement, the model continued to face challenges in generalizing to unseen data.

In the third iteration, the model was trained using clinical and biochemical features, Fig. 7. It achieved strong performance on the training set (AUC = 0.9997, F1 = 0.9897, Accuracy = 0.9931), but its performance on the test set remained lower (AUC = 0.8482, F1 = 0.5185, Accuracy = 0.7615). The training confusion matrix showed 285 true positives (TP), 144 true negatives (TN), 2 false positives (FP), and 1 false negative (FN), while the test set identified 69 TP, 14 TN, 8 FP, and 18 FN.

These results indicate that despite the exclusion of ultrasound features, the model struggled to generalize effectively to unseen data.

In the fourth iteration, the model was trained using ultrasound and clinical features, yielding the best overall test set performance so far, supporting the hypothesis, Fig. 8. The training set achieved an AUC of 0.9999, an F1 score of 0.9895, and an accuracy of 0.9931. Notably, the test set showed significant improvement, with an AUC of 0.9545, an F1 score of 0.8000, and an accuracy of 0.8899. The confusion matrices indicated 287 true positives (TP), 142 true negatives (TN), 0 false positives (FP), and 3 false negatives (FN) for the training set, while the test set identified 73 TP, 24 TN, 4 FP, and 8 FN.

These results demonstrate improved generalization to unseen data compared to previous iterations, highlighting the importance of combining ultrasound and clinical features for PCOS diagnosis.

To further enhance model performance, hyperparameter optimization was applied while using the same subset of ultrasound and clinical features, Fig. 9. This refinement resulted in a perfect training set performance, with an AUC of 1.0, precision of 1.0, an F1 score of 1.0, and an accuracy of 1.0. While such results indicate that the model learned the training data exceptionally well, they also suggest the possibility of overfitting. On the test set, the optimized model demonstrated significant improvement over previous iterations, achieving an AUC of 0.9852, precision of 0.9583, an F1 score of 0.9388, and an accuracy of 0.9384. Compared to the best pre-optimization test performance (AUC = 0.9545, F1 = 0.8000, Accuracy = 0.8899), these results show a marked enhancement in generalization to unseen data. The notable increase in precision and F1 score suggests a more reliable classification of positive cases, reducing false positives and false negatives.

These findings confirm that optimizing hyperparameters can substantially improve model performance, reinforcing the effectiveness of combining ultrasound and clinical features for PCOS detection. However, the perfect training scores raise concerns about potential overfitting, suggesting that further validation on independent datasets would be beneficial to ensure robust real-world applicability.

Lastly, to refine model performance and identify the most influential predictors, feature selection was applied using the Chi-Square based SelectKBest, Fig. 10. This process identified the top features contributing to PCOS diagnosis, including skin darkening, hair growth, weight gain, menstrual cycle irregularities, fast food consumption, follicle count in both ovaries, pimples, hair loss, and AMH levels. These selected features were then used to retrain the XGBoost model.

Post-feature selection, the model maintained strong performance, achieving a training set AUC of 0.9947, precision of 0.9553, F1 score of 0.9553, and accuracy of 0.9553. The test set performance remained high, with an AUC of 0.9840, precision of 0.9333, an F1 score of 0.9459, and accuracy of 0.9452. These results indicate that the model not only retained its predictive strength but also achieved improved generalization, as evident from the consistently high AUC and F1 scores on the test set.

The top-ranked features align well with known clinical indicators of PCOS, reinforcing their relevance in diagnosis. Notably, follicle count in both ovaries remained an essential predictor, confirming the importance of ultrasound-based features. Meanwhile, clinical parameters such as skin darkening, hair growth, and weight gain emerged as key contributors, further emphasizing the role of metabolic and dermatological symptoms in PCOS assessment.

Overall, the feature selection process enhanced model interpretability while preserving high classification performance. By reducing the number of input variables, this approach also improves computational efficiency and reduces the risk of overfitting.

The iterative evaluation of different feature sets provided valuable insights into the influence of ultrasound, clinical, and biochemical data on PCOS diagnosis using the XGBoost model, Table 3. Detailed performance metrics, including sensitivity, specificity, and recall, are presented in Supplementary Table 3a. The performance trends observed across iterations highlight the strengths and limitations of each feature combination, ultimately leading to an optimized and interpretable model.

In the first iteration, using only ultrasound features, the model achieved relatively strong training performance (AUC = 0.9636, F1 = 0.8406, Accuracy = 0.8981). However, test set performance dropped significantly (AUC = 0.8312, F1 = 0.6441, Accuracy = 0.8073), suggesting that ultrasound features alone were insufficient for robust generalization. Supplementary Table 3b details the feature sets applied across model versions.

The second iteration, incorporating ultrasound and biomarker features, resulted in improved training performance (AUC = 0.9986, F1 = 0.9686, Accuracy = 0.9792) but showed only a marginal improvement in test AUC (0.8364) and a slight decline in test F1 score (0.6349). This indicated that while biochemical markers contributed to model learning, they did not significantly enhance test set generalization.

The third iteration, using clinical and biomarker features, exhibited the highest training performance (AUC = 0.9997, F1 = 0.9897, Accuracy = 0.9931), yet test performance remained relatively weak (AUC = 0.8482, F1 = 0.5185, Accuracy = 0.7615). The sharp discrepancy between training and test scores suggested overfitting, highlighting the necessity of ultrasound features for improved model generalization.

A significant breakthrough occurred in the fourth iteration, where the combination of ultrasound and clinical features led to the best test set performance observed up to that point (AUC = 0.9545, F1 = 0.8000, Accuracy = 0.8899). This iteration confirmed the importance of integrating ultrasound-based imaging with clinical symptoms for a more reliable and generalized PCOS diagnosis.

To further improve performance, hyperparameter optimization was applied in the fifth iteration, retaining ultrasound and clinical features. The resulting model exhibited perfect training set performance (AUC = 1.0000, F1 = 1.0000, Accuracy = 1.0000) and significantly improved test scores (AUC = 0.9852, F1 = 0.9388, Accuracy = 0.9384). These results demonstrated that fine-tuning model parameters effectively enhanced prediction accuracy and reliability on unseen data.

Finally, the sixth iteration involved feature selection using the Chi-Square based SelectKBest. The selected features, including skin darkening, hair growth, weight gain, menstrual cycle irregularities, follicle count on both ovaries, and AMH levels, improved interpretability while maintaining high predictive performance. The optimized model achieved an AUC of 0.9947, F1 score of 0.9553, and accuracy of 0.9553 on the test set, representing the best generalization observed throughout the iterations.

These findings emphasize that ultrasound and clinical data are the most valuable feature set for PCOS detection, and adding AMH feature s can further enhance diagnostic accuracy.

Table 3 Model metrics for each iteration.

Full size table

Table 4 presents a comparison of classification results with existing studies. In the context of PCOS diagnosis, most studies in the literature emphasize accuracy as their primary performance metric, making it a widely accepted measure of model effectiveness. However, while accuracy is straightforward and provides an intuitive understanding of how often predictions are correct, it can be less informative in cases with imbalanced datasets, like those often encountered in medical diagnoses, where the number of positive cases (PCOS) is smaller relative to the negative cases.

AUC is particularly valuable in such imbalanced scenarios. It evaluates the model’s ability to distinguish between classes across various thresholds, offering a more nuanced view of diagnostic performance. In situations like PCOS, where accurately identifying true cases is crucial despite class imbalances, AUC can provide a clearer indication of how well the model differentiates between PCOS and non-PCOS cases. Thus, while accuracy gives an overview of overall correctness, AUC is more relevant for assessing the model’s discrimination capability.

Table 4 Comparison of classification results with literature.

Full size table

Considering these factors, prioritizing AUC in the evaluation of PCOS diagnostic models could yield more meaningful insights into their effectiveness, particularly when data imbalances are present. Additionally, it is essential to report both training and test results to ensure that the model generalizes well to new data, avoiding overfitting and ensuring robust performance. This approach supports a more balanced and thorough assessment of the model’s strengths and potential limitations.

In summary, integrating ultrasound features with clinical data significantly improved the model’s ability to predict PCOS, particularly on unseen test data. The highest test AUC scores were achieved when both ultrasound and partial clinical data were included, yielding an AUC of 0.9852. The addition of AMH further enhanced model performance, increasing the test AUC to 0.9947. In contrast, excluding ultrasound features led to a substantial drop in test performance, underscoring the critical role of ultrasound data in accurately diagnosing PCOS.

Limitations and further discussion

This study aimed to assess whether ultrasound and clinical assessments alone are sufficient for PCOS detection by evaluating various feature combinations. Given that ultrasound and clinical assessments are non-invasive and cost-effective compared to biochemical testing, understanding their standalone diagnostic power is crucial. Additionally, the study explored whether incorporating biochemical markers, such as AMH, could enhance diagnostic accuracy, particularly in borderline or ambiguous cases.

Through an iterative modeling process, key insights were gained:

Ultrasound features alone (Iteration 1) achieved strong training performance (AUC = 0.9636, Accuracy = 0.8981) but struggled with unseen test data (AUC = 0.8312, Accuracy = 0.8073), indicating that while ultrasound captures follicular patterns, it may not fully encapsulate all PCOS-related variations.
Adding biochemical features to ultrasound data (Iteration 2) slightly improved training accuracy (AUC = 0.9986) but had minimal impact on test performance (AUC = 0.8364), suggesting that biochemical markers alone do not drastically enhance predictive power.
Using clinical and biochemical features (Iteration 3) resulted in high training accuracy (AUC = 0.9997) but suboptimal test performance (AUC = 0.8482), reinforcing that excluding ultrasound weakens generalization.
Combining ultrasound and clinical features (Iteration 4) provided the best non-optimized test performance (AUC = 0.9545), confirming their complementary diagnostic value.
Hyperparameter optimization (Iteration 5) significantly improved test results (AUC = 0.9852), demonstrating the importance of fine-tuning model parameters.
Feature selection (Iteration 6) retained the most relevant features (e.g., skin darkening, hair growth, weight gain, cycle regularity, follicle count, and AMH levels) while maintaining excellent test performance (AUC = 0.9947), improving model interpretability and clinical applicability.

The final XGBoost model achieved strong diagnostic performance; however, its accuracy was marginally lower than that of some models reported in the literature. This difference can be attributed to several important factors. First, our study prioritized clinical applicability and generalizability by selecting a reduced yet clinically meaningful set of features, rather than relying on all available variables. This approach reduces overfitting and enhances real-world usability, though it may marginally impact performance metrics. Second, we used a heterogeneous, real-world dataset encompassing a broad range of PCOS presentations, in contrast to prior studies that may have employed filtered or pre-balanced datasets that artificially inflate performance. Additionally, we performed external validation using a publicly available dataset to ensure robustness, a step often overlooked in high-performing models that report results based only on internal validation. These methodological choices reflect a deliberate trade-off: favoring transparency, reproducibility, and clinical relevance over overly optimistic accuracy figures.

To further reinforce feature robustness and interpretability, future work could incorporate additional analytical techniques. In this study, a hybrid feature selection pipeline was implemented, combining chi-square-based SelectKBest, XGBoost’s embedded feature importance, and SHAP analysis to identify impactful features. To complement these, permutation importance could be employed to quantify performance drops upon feature perturbation, providing a model-agnostic perspective on relevance. Additionally, Deep Feature Synthesis (DFS) offers a promising strategy to automatically construct high-order feature interactions, particularly valuable for structured tabular datasets. While DFS was not applied here due to dataset size limitations, its integration in future models could enhance representational power. SHAP analysis, central to both global and local interpretability in this study, proved especially effective in aligning model behavior with clinical reasoning, thereby improving trust and transparency in AI-assisted decision-making.

External validation

External validation was conducted using a publicly available dataset from Figshare⁵²consisting of 320 instances, 160 labeled as PCOS and 160 as non-PCOS. The dataset includes 18 features covering various diagnostic criteria for PCOS, including BMI, menstrual irregularity, hirsutism, ovarian cyst characteristics, and serum hormone levels, Supplementary Table 2 shows the details of the dataset.

To evaluate the generalizability of the model, XGBoost was trained on this dataset, and Chi-Square based SelectKBest feature selection was applied to identify the most significant predictors. The top features selected were:

1.
1 st Cyst Size Length (cm) and 1 st Cyst Size Width (cm).
2.
Number of Ovarian Cysts.
3.
Serum FSH and Serum LH.
4.
BMI, Menstrual Irregularity, and Blood Pressure.
5.
Hirsutism.

The model achieved perfect scores (1.0) across all evaluation metrics (accuracy, precision, F1 score, and AUC) for both the training and test sets. While this suggests strong predictive performance, such results raise important concerns:

Potential Overfitting: Typically, perfect scores indicate that a model has memorized patterns rather than generalizing well to new data. However, since the test set also yielded perfect performance, this might not be a conventional case of overfitting.
Data Leakage: If any feature in the dataset is highly correlated with the target variable, the model may have learned direct mappings rather than meaningful patterns. It is essential to verify that no unintended information from the target variable has leaked into the features.
Similarity Between Training and Test Sets: If the test set closely resembles the training data, the model may not be facing truly novel cases, leading to inflated performance metrics.

To ensure the model’s reliability, a thorough review of the data pipeline is necessary to confirm the absence of leakage. Additionally, testing on an independent dataset from a different source would provide further validation of the model’s robustness for real-world PCOS diagnosis.

Key findings

1.
Ultrasound and clinical features alone provide a promising diagnostic pathway, achieving an AUC of 0.9852, Precision of 0.9583, F1 Score of 0.9388, and Accuracy of 0.9384, demonstrating their potential as non-invasive diagnostic markers.
2.
Feature selection improves interpretability while maintaining strong predictive performance.
3.
Hyperparameter tuning enhances generalizability, reducing misclassifications.
4.
AUC proves to be a superior evaluation metric in imbalanced datasets, ensuring a more reliable assessment of model performance.
5.
External validation resulted in perfect performance metrics (AUC = 1.0, Precision = 1.0, F1 Score = 1.0, Accuracy = 1.0), raising questions about potential overfitting or dataset bias, underscoring the need for further independent validation.

Considerations for real-world implementation

While the study demonstrates strong diagnostic potential, translating ML-based models into clinical practice presents several challenges:

Regulatory Requirements: Adoption in healthcare requires compliance with medical regulations, such as FDA (U.S.), CE (Europe), HIPAA (U.S.), and GDPR (EU). These frameworks ensure patient safety, data privacy, and ethical AI deployment.
Physician Adoption & Trust: Successful clinical integration depends on physician confidence in the model’s predictions. Providing explainability (e.g., SHAP, LIME) and decision-support tools could enhance adoption.
Barriers to Deployment:
- Dataset Variability: The model must generalize across diverse ethnic groups, ultrasound devices, and clinical environments.
- Data Accessibility: Limited access to large-scale, high-quality datasets remains a key challenge.
- Technical Integration: Embedding ML models into electronic health records (EHRs) and diagnostic software requires seamless interoperability.

To evaluate clinical utility beyond statistical performance, Decision Curve Analysis (DCA) was conducted. Following the interpretative framework by Ardakani et al. DCA curves were generated to examine net benefit across a range of threshold probabilities⁵³. As shown in Supplementary Fig. 3, the XGBoost model yielded a higher net benefit than both the “treat all” and “treat none” strategies across thresholds between 0.2 and 0.8, a range relevant for clinical uncertainty in PCOS diagnosis. This implies that the model could meaningfully support clinical decision-making within this probability range, particularly when clinicians are unsure whether to refer patients for further hormonal or ultrasound testing. Net benefit was highest when using ultrasound + clinical features, aligning with our findings of superior generalizability and accuracy in this feature set.

To improve clinician understanding and trust in AI-assisted diagnosis, YOLO-annotated ultrasound images should be integrated into clinical support tools. These images can highlight detected follicles using bounding boxes and overlay key diagnostic information such as follicle count and size distribution, directly extracted from object detection outputs. In future iterations, incorporating annotated transabdominal or transvaginal ultrasound scans, enhanced with YOLOv8-based follicle detection, classification predictions (e.g., “PCOS-likely”), and interpretability layers such as SHAP value overlays or Grad-CAM heatmaps, can help clinicians visualize how AI models arrive at diagnostic decisions. Displaying critical diagnostic features such as enlarged ovaries or excessive follicle clusters, along with model confidence scores and respective diameters can bridge the “black box” gap. These visual explanations are especially valuable in borderline or ambiguous cases, promoting transparency and aligning AI outputs with established clinical reasoning.

The role of external validation

To enhance credibility and generalizability, further validation on independent datasets and prospective clinical trials is essential. Future work should focus on:

1.
Multi-center validation with diverse patient cohorts to assess model robustness.
2.
Prospective clinical trials to evaluate real-world performance in live diagnostic settings.
3.
Comparing ML-based diagnosis with physician assessments to measure its effectiveness in actual clinical decision-making.

Conclusion

This study highlights the potential of ML specifically the XGBoost model in advancing the diagnosis of PCOS through the integration of diverse feature sets. Among these, ultrasound characteristics, particularly follicle count, emerged as the most predictive features. When combined with clinical data, these features significantly enhanced diagnostic accuracy, offering a reliable and non-invasive alternative to traditional, often more invasive, biochemical assessments.

A key strength of this work is its systematic exploration of feature subgroups aimed at streamlining and optimizing the diagnostic process. By evaluating various combinations of ultrasound, clinical, and biochemical features, the study establishes a diagnostic framework that is both time-efficient and highly accurate, an essential consideration in real-world clinical environments where early intervention is critical.

Looking ahead, several steps are necessary to bridge the gap between research and clinical application. These include pursuing regulatory pathways for model deployment, developing a user-friendly clinical tool compatible with electronic health record (EHR) systems, and conducting prospective validation studies to assess long-term effectiveness and generalizability. Enhancing model explainability through individual prediction visualizations and interpretability frameworks will also be crucial for building clinician trust and ensuring transparency in decision-making.

In summary, this work contributes to the development of scalable, cost-effective, and non-invasive diagnostic tools for PCOS. With continued refinement and clinical validation, such models hold the potential to accelerate diagnosis, reduce healthcare burdens, and support more personalized care strategies, marking an important step forward in precision medicine for women’s health.

Data availability

The dataset used in this study, Kottarathil, P. Polycystic Ovary Syndrome (PCOS), is publicly available through Kaggle. The dataset includes clinical, biochemical, and ultrasound results of patients, and it was utilized for the purpose of diagnosing PCOS using machine learning models.It can be reached at https://www.kaggle.com/datasets/prasoonkottarathil/polycystic-ovary-syndrome-pcos” in the manuscript.

References

Teede, H. J. et al. Recommendations from the 2023 international Evidence-based guideline for the assessment and management of polycystic ovary syndrome. Fertil. Steril. 120, 767–793 (2023).
Article PubMed Google Scholar
Stener-Victorin, E. et al. Polycystic ovary syndrome. Nat Rev. Dis. Primers 10, (2024).
Bahri Khomami, M. et al. Systematic review and meta-analysis of pregnancy outcomes in women with polycystic ovary syndrome. Nat Commun 15, (2024).
Gibson-Helm, M., Teede, H., Dunaif, A. & Dokras, A. Delayed diagnosis and a lack of information associated with dissatisfaction in women with polycystic ovary syndrome. Journal of Clinical Endocrinology and Metabolism vol. 102 604–612 Preprint at (2017). https://doi.org/10.1210/jc.2016-2963
Dokras, A. et al. Gaps in knowledge among physicians regarding diagnostic criteria and management of polycystic ovary syndrome. Fertil. Steril. 107, 1380–1386e1 (2017).
Article PubMed Google Scholar
Teede, H. J. et al. Recommendations from the 2023 international Evidence-based guideline for the assessment and management of polycystic ovary syndrome. J. Clin. Endocrinol. Metab. 108, 2447–2469 (2023).
Article CAS PubMed PubMed Central Google Scholar
Christy, E. & Blanco, D. A. W.-B. F. Early diagnosis in polycystic ovary syndrome. Nurse Pract. 47, 18–24 (2022).
Article Google Scholar
Sydora, B. C. et al. Challenges in diagnosis and health care in polycystic ovary syndrome in canada: a patient view to improve health care. BMC Womens Health 23, (2023).
Azziz, R., Marin, C., Hoq, L., Badamgarav, E. & Song, P. Health care-related economic burden of the polycystic ovary syndrome during the reproductive life Span. Journal of Clinical Endocrinology and Metabolism vol. 90 4650–4658 Preprint at (2005). https://doi.org/10.1210/jc.2005-0628
2.0. www.tnpj.com. (2022).
Wadden, T. A., Tronieri, J. S. & Butryn, M. L. Lifestyle modification approaches for the treatment of obesity in adults. The American psychologist vol. 75 235–251 Preprint at (2020). https://doi.org/10.1037/amp0000517
Richens, J. G., Lee, C. M. & Johri, S. Improving the accuracy of medical diagnosis with causal machine learning. Nat Commun 11, 3923 (2020).
Ahsan, M. M., Luna, S. A. & Siddique, Z. Machine-Learning-Based Disease Diagnosis: A Comprehensive Review. Healthcare (Switzerland) vol. 10 Preprint at (2022). https://doi.org/10.3390/healthcare10030541
Dixon, D. et al. Unveiling the influence of AI predictive analytics on patient outcomes: A comprehensive narrative review. Cureus https://doi.org/10.7759/cureus.59954 (2024).
Article PubMed PubMed Central Google Scholar
Zad, Z. et al. Predicting polycystic ovary syndrome with machine learning algorithms from electronic health records. Front Endocrinol. (Lausanne) 15, 15:1298628(2024).
Panjwani, B., Yadav, J., Mohan, V., Agarwal, N. & Agarwal, S. Optimized machine learning for the early detection of polycystic ovary syndrome in women. Sensors 25, (2025).
Chen, W., Miao, J., Chen, J. & Chen, J. Development of machine learning models for diagnostic biomarker identification and immune cell infiltration analysis in PCOS. Journal Ovarian Research 18, (2025).
Chen, J. Y. et al. Screening of serum biomarkers in patients with PCOS through lipid omics and ensemble machine learning. PLoS One 20, (2025).
Ahmetasevic, A. et al. Institute of Electrical and Electronics Engineers Inc.,. Using Artificial Neural Network in Diagnosis of Polycystic Ovary Syndrome. in 2022 11th Mediterranean Conference on Embedded Computing, MECO 2022 (2022). https://doi.org/10.1109/MECO55406.2022.9797204
Lim, J. et al. Machine learning-based evaluation of application value of traditional Chinese medicine clinical index and pulse wave parameters in the diagnosis of polycystic ovary syndrome. Eur J. Integr. Med 64, (2023).
Elmannai, H. et al. Polycystic Ovary Syndrome Detection Machine Learning Model Based on Optimized Feature Selection and Explainable Artificial Intelligence. Diagnostics 13, (2023).
Tiwari, S. et al. SPOSDS: A smart polycystic ovary syndrome diagnostic system using machine learning. Expert Syst. Appl 203, (2022).
Rahman, M. M. et al. Empowering early detection: A web-based machine learning approach for PCOS prediction. Inform Med. Unlocked 47, (2024).
Khanna, V. V. et al. A distinctive explainable machine learning framework for detection of polycystic ovary syndrome. Applied Syst. Innovation 6, (2023).
Abu Adla, Y. A. et al. Institute of Electrical and Electronics Engineers Inc.,. Automated Detection of Polycystic Ovary Syndrome Using Machine Learning Techniques. in International Conference on Advances in Biomedical Engineering, ICABME vols 2021-October 208–212 (2021).
Ahmad, R., Maghrabi, L. A., Khaja, I. A., Maghrabi, L. A. & Ahmad, M. SMOTE-Based Automated PCOS Prediction Using Lightweight Deep Learning Models. Diagnostics 14, (2024).
Kottarathil, P. Polycystic Ovary Syndrome (PCOS). Kaggle (2020).
Nair, A., Devaser, V. & Arora, K. Machine Learning Applications in the Prediction of Polycystic Ovarian Syndrome. in Generative Artificial Intelligence for Biomedical and Smart Health Informatics 565–589wiley, (2024). https://doi.org/10.1002/9781394280735.ch27
Cortes, C., Vapnik, V. & Saitta, L. Support-Vector networks editor. Mach. Leaming 20 (1995).
jrsssb_20_2_215.
Loh, W. Y. Classification and regression trees. Wiley Interdiscip Rev. Data Min. Knowl. Discov. 1, 14–23 (2011).
Article Google Scholar
Cover, T. M. & Hart, P. E. Approximate formulas for the information transmitted Bv a discrete communication channel. IEEE Trans. Inf. Theory 24 (1952).
Jerome, H. & Friedman Greedy function approximation: A gradient boosting machine. Ann. Statist. 29, 1189–1232 (2001).
Article MathSciNet Google Scholar
Chen, T., Guestrin, C. & XGBoost A scalable tree boosting system. in Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining vols 13-17-August-2016 785–794Association for Computing Machinery, (2016).
Ke, G. et al. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. https://github.com/Microsoft/LightGBM
Wang, R., Zhang, J., Shan, B., He, M. & Xu, J. XGBoost machine learning algorithm for prediction of outcome in aneurysmal subarachnoid hemorrhage. Neuropsychiatr Dis. Treat. 18, 659–667 (2022).
Article PubMed PubMed Central Google Scholar
Ravi, D. et al. Deep learning for health informatics. IEEE J. Biomed. Health Inf. 21, 4–21 (2017).
Article Google Scholar
Liu, Y., Chen, P. H. C., Krause, J. & Peng, L. How to Read Articles That Use Machine Learning: Users’ Guides to the Medical Literature. JAMA - Journal of the American Medical Association vol. 322 1806–1816 Preprint at (2019). https://doi.org/10.1001/jama.2019.16489
Tjoa, E. & Guan, C. A. Survey on explainable artificial intelligence (XAI): toward medical XAI. IEEE Trans. Neural Netw. Learn. Syst. 32, 4793–4813 (2021).
Article PubMed Google Scholar
Barber, T. M., Hanson, P., Weickert, M. O. & Franks, S. Obesity and polycystic ovary syndrome: implications for pathogenesis and novel management strategies. Clin. Med. Insights Reprod. Health. 13, 117955811987404 (2019).
Article Google Scholar
Carmina, E., Rosato, F., Jannì, A., Rizzo, M. & Longo, R. A. Relative prevalence of different androgen excess disorders in 950 women referred because of clinical hyperandrogenism. J. Clin. Endocrinol. Metab. 91, 2–6 (2006).
Article CAS PubMed Google Scholar
Dong, Z. et al. Associations of acanthosis nigricans with metabolic abnormalities in polycystic ovary syndrome women with normal body mass index. J. Dermatol. 40, 188–192 (2013).
Article CAS PubMed Google Scholar
Jonard, S. & Dewailly, D. The follicular excess in polycystic ovaries, due to intra-ovarian hyperandrogenism, may be the main culprit for the follicular arrest. Human Reproduction Update vol. 10 107–117 Preprint at (2004). https://doi.org/10.1093/humupd/dmh010
Teede, H. J. et al. Recommendations from the international evidence-based guideline for the assessment and management of polycystic ovary syndrome. Fertil. Steril. 110, 364–379 (2018).
Article PubMed PubMed Central Google Scholar
Azziz, R. et al. The androgen excess and PCOS society criteria for the polycystic ovary syndrome: the complete task force report. Fertil. Steril. 91, 456–488 (2009).
Article PubMed Google Scholar
Dogan, O. Are abo/rh blood groups A risk factor for polycystic ovary syndrome? Med. (United States). 102, E34944 (2023).
Google Scholar
Meczekalski, B. et al. PCOS in Adolescents—Ongoing Riddles in Diagnosis and Treatment. Journal of Clinical Medicine vol. 12 Preprint at (2023). https://doi.org/10.3390/jcm12031221
Moieni, A. et al. Vitamin D levels and lipid profiles in patients with polycystic ovary syndrome. BMC Womens Health 24, (2024).
Zhu, M. et al. The waist-to-height ratio is a good predictor for insulin resistance in women with polycystic ovary syndrome. Front Endocrinol. (Lausanne) 15, (2024).
Kim, J. J., Hwang, K. R., Chae, S. J., Yoon, S. H. & Choi, Y. M. Impact of the newly recommended antral follicle count cutoff for polycystic ovary in adult women with polycystic ovary syndrome. Hum. Reprod. 35, 652–659 (2020).
Article CAS PubMed Google Scholar
IVIRMA Global. Difference Between Left Ovary and Right Ovary. (2022).
Saba Zafar. PCOS data.xlsx. figshare. Preprint at (2024).
Abbasian Ardakani, A. et al. Interpretation of artificial intelligence models in healthcare: A pictorial guide for clinicians. J. Ultrasound Med. Preprint At. https://doi.org/10.1002/jum.16524 (2024).
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Mechanical, Aerospace, and Nuclear Engineering, Rensselaer Polytechnic Institute, Troy, USA
Mehtap Agirsoy & Matthew A. Oehlschlaeger

Authors

Mehtap Agirsoy
View author publications
Search author on:PubMed Google Scholar
Matthew A. Oehlschlaeger
View author publications
Search author on:PubMed Google Scholar

Contributions

Mehtap Agirsoy¹, Matthew A. Oehlschlaeger²¹Data curation, Methodology, Resources, Software, Validation, Visualization, Writing – original draft.²Supervision, Writing – review & editing.¹,² Department of Mechanical, Aerospace, and Nuclear Engineering, Rensselaer Polytechnic Institute, Troy, USA.

Corresponding author

Correspondence to Mehtap Agirsoy.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary Material 1

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Agirsoy, M., Oehlschlaeger, M.A. A machine learning approach for non-invasive PCOS diagnosis from ultrasound and clinical features. Sci Rep 15, 33638 (2025). https://doi.org/10.1038/s41598-025-10453-9

Download citation

Received: 05 December 2024
Accepted: 03 July 2025
Published: 29 September 2025
Version of record: 29 September 2025
DOI: https://doi.org/10.1038/s41598-025-10453-9

This article is cited by

Feature fusion context attention gate UNet for detection of polycystic ovary syndrome
- Yuvaraj Natarajan
- Sri Preethaa K. R.
- Shyamala Devi M.
Scientific Reports (2025)

Subjects

Abstract

Similar content being viewed by others

An extended machine learning technique for polycystic ovary syndrome detection using ovary ultrasound image

CystNet: An AI driven model for PCOS detection using multilevel thresholding of ultrasound images

Transfer learning-enhanced CNN model for integrative ultrasound and biomarker-based diagnosis of polycystic ovarian disease

Introduction

Literature review

Motivation

Materials and methods

Dataset description

Data preprocessing

Model development

Feature selection

Model evaluation

Statistical analysis

Results and discussion

Limitations and further discussion

External validation

Key findings

Considerations for real-world implementation

The role of external validation

Conclusion

Data availability

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s note

Electronic supplementary material

Supplementary Material 1

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Feature fusion context attention gate UNet for detection of polycystic ovary syndrome

Search

Quick links