Introduction

Azoospermia, the absence of sperm in ejaculate, affects approximately 1% of men during their peak reproductive years. It represents one of the most severe forms of male infertility1,2. Approximately 40% of azoospermia cases result from normal testicular spermatogenic function hindered by reproductive tract blockages3. The remaining cases involve insufficient or absent sperm production within the testes, categorizing the condition into obstructive (OA) and non-obstructive azoospermia (NOA). Distinguishing between OA and NOA is crucial for optimizing treatment strategies, as the two conditions require significantly different approaches. However, current diagnostic practices, particularly those in resource-limited settings, present several challenges.

In clinical practice, distinguishing between obstructive azoospermia (OA) and non-obstructive azoospermia (NOA) typically requires invasive procedures, such as testicular biopsies or sperm retrieval techniques, which may not be accessible in primary andrology or resource-limited settings. Although non-invasive methods, including hormonal assessments and ultrasound imaging, are commonly used, they often lack sufficient sensitivity and specificity to provide a definitive diagnosis. This underscores the urgent need for more accessible and reliable diagnostic tools, particularly in resource-constrained environments, to ensure timely and accurate diagnoses for all patients, regardless of geographic or economic barriers, and to guide optimal treatment decisions.

Treatment options for OA and NOA differ significantly. Microsurgical vasoepididymostomy (MVE) has emerged as a highly effective surgical intervention, leading to successful natural pregnancy in approximately 38% of treated patients4. For those who are unsuccessful in natural conception attempts, have limited ovarian reserve in their partner, opt against surgery, as well as OA patients with obstructions at other sites, conception may still be possible through intracytoplasmic sperm injection using percutaneous epididymal sperm aspiration (PESA) or percutaneous testicular sperm aspiration (TESA). In cases of NOA, microsurgical testicular sperm extraction (mTESE) or testicular sperm extraction (TESE) is an effective surgical approach to retrieve viable sperm1,5,6. The testicular histopathology is considered a crucial predictive factor for the success of sperm retrieval7,8.

However, testicular histopathology may only reflect the status of most seminiferous tubules in patients, as those with sertoli cell only syndrome (SCOS) can also obtain sperm through mTESE, thereby potentially predicting fertilization success prior to sperm retrieval surgery9. For NOA patients, the European Association of Urology guidelines recommend avoiding TESA or fine needle aspiration before mTESE or TESE, while the American Urological Association/American Society for Reproductive Medicine (AUA/ASRM) guidelines discourage pre-sperm retrieval diagnostic testicular biopsy1,10. In contrast, percutaneous testicular or epididymal sperm aspiration proves effective for OA patients, although outcomes are not guaranteed until sperm are observed. Therefore, avoiding diagnostic testicular biopsy in NOA patients is essential, as the procedure is often uninformative. It highlights the importance of identifying potential NOA cases among azoospermic patients. According to AUA/ASRM guidelines, azoospermia patients with a testicular longitudinal axis < 4.6 cm, FSH > 7.6 IU/l, and/or semen volume > 0.5/1.0 ml are more likely to be diagnosed with NOA10. Nonetheless, different studies employ various biomarkers and cutoff values11,12.

In recent years, machine learning has shown promise in clinical settings for improving diagnostic accuracy and predicting patient outcomes. However, predictive models for NOA are notably lacking, especially in primary healthcare institutions with limited diagnostic resources. This study seeks to apply machine learning techniques to basic male parameters for accurate identification of NOA among azoospermia patients. After developing a nomogram, we will perform comprehensive validation to assess its effectiveness. We expect that this approach will improve personalized diagnosis and treatment for patients with azoospermia.

Materials and methods

Study population and observational index

A retrospective study was conducted on 352 patients diagnosed with azoospermia, collected by the Department of Andrology and Sexual Medicine of the First Affiliated Hospital of Fujian Medical University from January 2020 to February 2024. This study received approval from the Medical Ethics Committee of the First Affiliated Hospital of Fujian Medical University (MRCTA, ECFAH of FMU [2019] 213 and [2020] 375) and was conducted in compliance with the principles outlined in the Declaration of Helsinki. Informed written consent was obtained from all participants.

Azoospermia was confirmed in all patients after more than three semen centrifugation procedures (3000 g, 15 min; Centrifuge 5425, Eppendorf, Hamburg, Germany) spaced over two week intervals, yielding no detectable sperm13. The exclusion criteria included patients with hypogonadotropic hypogonadism, individuals under 14 years old, and those with incomplete clinical data. Detailed medical histories were obtained, including inquiries about orchitis, epididymitis, mumps, prior testicular trauma and surgery, and cryptorchidism status. Similarly, potential iatrogenic causes, such as the use of gonadotoxic medications, prior radiation exposure, and environmental exposure, were all taken into consideration. Patients with cryptorchidism had undergone orchiopexy at least six months prior to the study.

Testicular volume (TV) was determined using using a Prader’s orchidometer by two experienced andrologists (Song-xi Tang and Hui-liang Zhou). Subsequently, the volumes of the left and right testes were summed and averaged to calculate the mean testicular volume (MTV) for each patient. Ejaculate volume and semen pH were averaged from multiple assessments. The clinical assessments included measuring serum levels of prolactin, follicle-stimulating hormone (FSH), luteinizing hormone (LH), estradiol, testosterone (T), and inhibin B (INHB) between 8:00 a.m. and 10:00 a.m. Varicocele was diagnosed through physical examination conducted by two seasoned andrologists (Hui-liang Zhou, and Song-xi Tang) in conjunction with color Doppler ultrasound.

Patients initially diagnosed with NOA must undergo karyotype analysis of peripheral blood and Y chromosome microdeletion analyses; whole-exome sequencing is deemed unnecessary. Patients being assessed for OA underwent evaluations utilizing color Doppler ultrasound (GE LOGIQ Fortis, GE Healthcare, Chicago, USA) or magnetic resonance imaging (Magnetom Prisma, Siemens, Munich, Germany) along with physical examinations to determine the site of obstruction.

Histopathological analysis of the testis.

Histopathological examination of the testis involved embedding the testicular tissues in paraffin and serially sectioning them at a thickness of 4 μm for routine hematoxylin and eosin staining. The sections were observed under light microscopy (CKX31, Olympus, Tokyo, Japan). The condition characterized by the presence of only Sertoli cells was classified as SCOS. Maturation arrest (MA) was diagnosed when spermatogonia, primary spermatocytes, secondary spermatocytes, or spermatids were present without mature sperm. Hypospermatogenesis referred to testicular pathology where a small amount of mature sperm was observable. Diagnosis was based on the predominant histopathological pattern observed. Patients with testicular pathological conditions such as SCOS, MA, or hypospermatogenesis were classified as NOA, whereas specimens exhibiting normal or near-normal spermatogenesis were classified as OA14.

Statistical analysis

Statistical analysis was performed using SPSS software version 27.0 (IBM Corp., Armonk, NY, USA). Continuous variables were expressed as means ± standard deviation or medians with interquartile ranges, while categorical variables were presented as frequencies or percentages. The Shapiro-Wilk test was used to assess data distribution, and Levene’s test evaluated variance homogeneity. Non-normally distributed continuous variables were compared using the U-test. The application of Pearson’s chi-square and Fisher’s exact probability tests were utilized to compare rates between groups. Univariate and multivariate logistic regression analyses were conducted to identify independent risk factors. From the entire dataset, 70% was randomly selected for training the models, while the remaining 30% was allocated for testing.

The machine learning algorithms and nomogram was constructed within R version 4.2.3 (R Foundation for Statistical Computing, Vienna, Austria). We utilized nine distinct machine learning algorithms for modeling: Random Forest, Gradient Boosting Decision Trees (GBDT), XGBoost, LightGBM, Naive Bayes, Support Vector Machine (SVM), Logistic Regression, Decision Trees, and Neural Networks. The Random Forest model ensures reproducibility by optimizing the “mtry” hyperparameter using “tuneRF” and applying regularization through parameters such as “ntree = 500,” “nodesize = 5,” and “maxnodes = 30” to mitigate overfitting and enhance stability. The GBDT model optimizes hyperparameters via 5-fold cross-validation, tuning key parameters including “n.trees = 100,” “interaction.depth = 3,” “shrinkage = 0.05,” and “n.minobsinnode = 10” to balance complexity and reduce overfitting. Similarly, the XGBoost model employs 5-fold cross-validation with early stopping to mitigate overfitting. Parameters such as “max_depth = 6,” “eta = 0.1,” “subsample = 0.8,” “colsample_bytree = 0.8,” “min_child_weight = 1,” “lambda = 1,” and “alpha = 0” are tuned to manage complexity, with the optimal number of boosting rounds determined to ensure reproducibility. The LightGBM model follows a similar approach, using 5-fold cross-validation and early stopping (limited to 10 rounds) to prevent overfitting. Parameters such as “num_leaves = 31,” “learning_rate = 0.05,” “feature_fraction = 0.8,” “bagging_fraction = 0.8,” “bagging_freq = 5,” “lambda_l1 = 0.1,” “lambda_l2 = 0.1,” and “min_data_in_leaf = 20” are selected to balance model complexity and reproducibility by determining the optimal number of boosting rounds. The Naive Bayes model is trained using 5-fold cross-validation, with hyperparameter tuning on the Laplace smoothing parameter (“laplace = 0, 0.5, 1”), kernel usage, and the adjustment factor (“adjust = 1”), ensuring reproducibility with consistent tuning parameters and cross-validation settings. The SVM model uses a radial basis function kernel, tuning hyperparameters such as “C” (cost) and “sigma” (gamma) through 5-fold cross-validation and grid search (“C = 0.1, 1, 10” and “sigma = 0.01, 0.05, 0.1”) to prevent overfitting and ensure reproducibility. The Logistic Regression classifier employs Lasso regularization (“alpha = 1”) and is trained via 5-fold cross-validation (“nfolds = 5”) to select the optimal regularization parameter (“lambda.min”), promoting model generalization and preventing overfitting through automatic hyperparameter tuning. Reproducible results are achieved based on a fixed dataset and consistent parameter settings. The Decision Tree model is tuned through 10-fold cross-validation, adjusting parameters such as “cp = 0.01,” “minsplit = 20,” and “maxdepth = 5” to prevent overfitting and ensure reproducibility by controlling complexity and optimizing generalization. Finally, the Neural Network model uses 10-fold cross-validation, tuning hyperparameters such as the number of hidden units (“size = 5, 10, 15”) and regularization strength (“decay = 0, 0.001, 0.01”) to prevent overfitting with L2 regularization, ensuring reproducibility through consistent parameter settings and cross-validation.

The predictive performance was assessed using the area under the receiver operating characteristic (ROC) curve. Graphs were created using GraphPad Prism 10.0 (GraphPad Software Inc., San Diego, CA, USA). The calibration curve and decision curve analysis (DCA) were calculated using R software. Statistical significance was set at P < 0.05 for all two-tailed tests.

Results

Basic information

A total of 352 patients with azoospermia were included in the study, comprising 152 (43.47%) with OA and 200 (56.82%) with NOA. Among the 152 OA patients, 48 (31.58%) had concomitant varicocele, including 34 with unilateral varicocele, 14 with bilateral varicocele, and 6 patients with severe varicocele on at least one side. Severe varicocele was defined through clinical physical examination as palpable varicocele, with ultrasound (GE LOGIQ Fortis, GE Healthcare, Chicago, USA) revealing a maximum inner diameter of at least 3.1 mm during quiet respiration, and reflux observed during the Valsalva maneuver lasting for at least 6 s.

As for the NOA patients, 30 cases of Klinefelter syndrome, 12 cases of Y chromosome microdeletion, 18 cases of undergone bilateral cryptorchidism surgery, 11 cases of acquired NOA, and 129 cases of idiopathic NOA. Among the 200 NOA patients, 73 (36.50%) were diagnosed with concomitant varicocele. Of these, 44 had unilateral varicocele, 29 had bilateral varicocele, and 9 exhibited severe varicocele on at least one side.

Machine learning methods analysis

The data was randomly divided into a training set and a validation set in a 7:3 ratio. The training set comprised 244 cases, with 141 cases categorized as NOA and 103 cases categorized as OA. The validation set included 108 cases, with 59 categorized as NOA and 49 as OA. Statistical analysis demonstrated no significant difference between the two groups (all P > 0.05; Supplementary Table 1).

After performing both univariate and multivariate logistic regression analyses on the training set, we observed that semen pH and FSH had a positively correlation with NOA, while MTV and INHB were negatively correlated with NOA (Table 1). Subsequently, we conducted a binary logistic regression including only these four factors. This analysis revealed that semen pH (OR = 12.30, P = 0.019) and FSH (OR = 1.59, P < 0.001) were positive correlated with NOA. In contrast, MTV (OR = 0.73, P = 0.013) and INHB (OR = 0.98, P = 0.019) exhibited a negatively correlation with NOA.

Table 1 Predictors of non-obstructive azoospermia in azoospermia patients via logistic regression analysis.

We further analyzed and optimized this model based on these four factors using nine machine learning methods: Random Forest, GBDT, XGBoost, LightGBM, Naive Bayes Classifier, SVM, Logistic Classifier, Decision Trees, and Neural Networks. The performance of each machine learning model was compared using AUC, sensitivity, and specificity (Table 2). In evaluating nine machine learning methods using the AUC metric for model validation, the GBDT demonstrated the highest performance with an area under the curve (AUC) of 0.974, reflecting its superior classification capabilities. In contrast, Random Forest had the lowest AUC at 0.953, indicating comparatively weaker performance. Dalong’s test further revealed that a significant difference in AUC values was found only between the GBDT and Random Forest (P < 0.05), while the AUC values of the other seven models showed no statistically significant differences compared to GBDT (P > 0.05).

Table 2 Comparison of the performance of nine machine learning methods.

Establish and validate the nomogram model

Following this, a comprehensive multivariate analysis led to the development of a novel nomogram model (Fig. 1). This model incorporates four independent variables to predict NOA in azoospermia patients. Notably, it identifies FSH as the primary predictor among these variables.

Fig. 1
figure 1

The construction of the nomogram in predicting non-obstructive azoospermia in azoospermia patients.

The ROC curve shows that the AUC is 0.984 in the training set and 0.976 in the validation set, demonstrating strong predictive performance of the nomogram model (Fig. 2a-b). Moreover, the calibration curve of the nomogram exhibited favorable alignment between predicted and observed outcomes, the P-values of the Hosmer-Lemeshow test are greater than 0.05, indicating good model fit (Fig. 2c-d). Furthermore, analysis of the DCA underscored the enhanced net benefits associated with employing a predictive model prior to in azoospermia patients (Fig. 2e-f).

Fig. 2
figure 2

The evaluation of the nomogram in predicting non-obstructive azoospermia in azoospermia patients.

The cut-off points were determined for several biomarkers: FSH at 7.50 IU/L (AUC = 0.96, Sensitivity = 91.50%, Specificity = 93.20%); INHB at 43.45 pg/ml (AUC = 0.95, Sensitivity = 100.0%, Specificity = 82.30%); MTV at 9.92 ml (AUC = 0.91, Sensitivity = 99.00%, Specificity = 75.90%); and semen pH at 6.95 (AUC = 0.71, Sensitivity = 97.90%, Specificity = 42.7%).

In the validation set, NOA prediction rates were computed using distinct cutoff values for four indicators: INHB < 43.45 pg/ml resulted in 100.00% (48/48), FSH > 7.5 IU/L yielded 96.30% (52/54), and MTV < 9.92 ml showed 95.00% (38/40), pH > 6.95 achieved 63.74% (58/91). The cutoff value of semen pH had a lower NOA prediction rate in the validation set compared to the other three biomarkers (all P < 0.05), with significant statistical differences (all P > 0.05).

Discussion

Identifying potential NOA patients among azoospermic individuals is crucial for developing effective treatment strategies. Due to the discouragement of diagnostic testicular biopsy in azoospermic patients, predicting NOA based on non-invasive clinical parameters poses a significant challenge for andrologists but is necessary. Our study employed nine machine learning algorithms, all of which exhibited robust predictive capabilities for NOA in the validation set, thereby affirming the validity of our model. Among the four basic male fertility indicators—FSH, INHB, TV, and semen pH—FSH was found to be the most effective predictor.

FSH, a hormone secreted by the anterior pituitary gland, acts on Sertoli cells expressing FSH receptor. It is recognized for its role in stimulating mitotic and meiotic deoxyribonucleic acid synthesis in spermatogonia and preleptotene spermatocytes, and promoting cell survival through its interactions with Sertoli cells15. FSH serves as a key regulator of Sertoli cell numbers during testicular development, with the quantity of Sertoli cells produced in the perinatal period determining the germ cell count in adult males16. In adulthood, the size of the testes is influenced by circulating FSH levels, which correlate with the number of spermatogenic cells15,17. In previous studies, the critical values of FSH ranged from 5.9 to 10.895 IU/l11,14,18,19,20,21. However, we observed that the two papers from North America reported FSH critical values of 5.9 and 7.6 IU/l, which are the lowest among the six papers14,21. The remaining studies reported values higher than these two and all involved Asian populations11,18,19,20. Similarly, Regent et al.12 described a TV threshold of 12.4 ml for NOA and OA among 45 individuals of European descent. In contrast, the TV threshold derived from our training set of 244 Asian individuals was 9.92 ml. Shamohammadi et al.11 have noted differences in the thresholds for NOA and OA among different populations, suggesting that variations in testicular volume thresholds may stem from ethnic and racial differences. Andrologists should adjust their diagnostic criteria based on racial and regional differences when assessing azoospermic patients, highlighting the necessity of refining azoospermia parameters for each population group.

Other parameters contribute to our nomogram prediction model. INHB, secreted by Sertoli cells, exerts a direct inhibitory effect on FSH. Primarily functioning through negative feedback on the pituitary gland, INHB not only regulates FSH levels but also serves as an indicator of sperm production to some degree22. Prior research22,23,24 has thoroughly investigated the predictive role of INHB in the outcomes of NOA patients undergoing mTESE, low levels of INHB and elevated FSH have been associated with reduced sperm count25,26. Zhang et al.20 described an INHB threshold of 41.5 pg/ml in a retrospective study of 70 Asian individuals. Following calculations for NOA predictions in our validation set, the INHB threshold from our study (43.45 pg/ml) showed no significant statistical difference. This may be attributed to the nearly identical ethnic and regional composition of the populations included in both studies. Our study found that NOA patients often have a “normal” semen pH value, which has been rarely discussed in previous research. The majority of semen components are secretions originating from the seminal vesicle and prostate gland, with semen pH primarily reflecting the acid-base equilibrium between the alkaline secretion from the seminal vesicle and the acidic secretion from the prostate gland9,27. In individuals with NOA, where the seminal ducts remain unobstructed, ejaculate volume and semen pH typically align closely with normal values. However, in cases of OA, variations in ejaculate volume and semen pH are occasionally observed, depending on the site of the obstruction. In instances of bilateral ejaculatory duct obstruction, the presence of alkaline seminal vesicle fluid in the semen is minimal, consequently yielding a relative low pH value. While some patients with bilateral epididymal obstruction may exhibit normal ejaculation volume and pH values, patients experiencing epididymal obstruction due to congenital developmental defects may combined with seminal vesicle dysplasia28. According to our predictive model, semen pH, as a fundamental indicator, may offer some value in distinguishing certain types of azoospermia, potentially aiding andrologists in identifying semen pH patients. However, we must acknowledge that, whether in terms of predictive efficacy or accuracy rates for NOA, semen pH performs less effectively than other parameters in the predictive model, particularly FSH.

Many other parameters hold potential in distinguishing OA and NOA. Particularly when FSH and INHB levels are at a borderline threshold, it becomes difficult to definitively categorize the patient as either OA or NOA. In addition to requiring high discriminatory sensitivity and specificity, the selection of biomarkers must also account for convenience, affordability, and accessibility. Jarow et al. found that semen volume is significantly higher in patients with testicular failure compared to those with OA29. Gao et al. also discovered that combining FSH with plasma leptin and serum leptin enhances the discriminatory power between OA and NOA more effectively than using FSH alone19. Foresta et al. observed fewer intratesticular vessels in NOA patients compared to both OA patients and normal populations using testicular color Doppler ultrasound30. Du et al. reported a significantly higher incidence of seminal duct abnormalities in OA patients (92.2% vs. 2.8%) through scrotal and transrectal ultrasonography of azoospermic patients31. Li et al. identified a significantly higher strain ratio in NOA patients compared to OA patients using real-time performance; however, there is a lack of an effective evaluation of the predictive value of real-time elastography for NOA32. One study evaluated the apparent diffusion coefficient (ADC) using magnetic resonance imaging and found that ADC values in the testes of OA patients were significantly lower than in NOA patients, but the predictive accuracy of this method was limited12. Hesham et al.‘s prospective study revealed that besides ADC, the normalized apparent diffusion coefficient, choline, and lipids from diffusion-weighted magnetic resonance imaging and proton magnetic resonance spectroscopy effectively differentiate between OA and NOA patients. However, this method not only requires advanced medical facilities but also involves high costs and long waiting times33. While some tsRNAs have shown potential in differentiating between OA and NOA, their sensitivity and specificity remain inferior to the new model we have developed34. Additionally, there are emerging biomarkers such as seminal plasma cysteine-rich secretory protein 135 and seminal plasma testis-expressed protein 10136. Furthermore, Fietz et al. proposed that testis-specific proteins L-lactate dehydrogenase C chain, Phosphoglycerate kinase 2, Dipeptidase 3, as well as reproductive cell-enriched heat shock proteins Heat shock-related 70 kDa protein 2 and Heat shock 70 kDa protein 4 L, hold promise as biomarkers for spermatogenic function based on seminal plasma proteomic analysis37. In general, there is still a long way to go in distinguishing between OA and NOA.

The limitations of this study are primarily attributed to several factors. Firstly, there is a lack of external validation within patient cohorts, which limits the generalizability and robustness of the predictive models. Secondly, although the classification method used for defining NOA and OA is the most widely accepted in the scientific community, there remains an ongoing debate regarding their precise definitions. Lastly, the model’s applicability is further restricted by the absence of sufficiently reliable and reproducible imaging techniques that could enhance its accuracy and convenience. These limitations highlight the need for future research, including multicenter investigations, to address these issues and improve the model’s broader applicability.

In conclusion, we utilized nine machine learning algorithms and fundamental male fertility parameters to predict NOA. After validation, our predictive model and nomogram demonstrated accurate identification of NOA. Moving forward, we aim to develop an online calculator that incorporates additional factors to further refine our model’s performance in personalized medicine.