Introduction

Prostate cancer (PCa) is the second most common cancer in males, with an estimated 1.4 million new cases worldwide in 20201. The current standard PCa screening methods include tracing the level of prostate-specific antigen (PSA) in serum and digital rectal examination (DRE)2. In particular, serum PSA testing allows early cancer detection that can affect patient outcome3. Still, limited specificity of elevated PSA level alone results in a high false positive rate, leading to unnecessary prostate biopsies and subsequent overtreatments4. Therefore, there is a critical need for a non-invasive, DRE-free screening platform that is highly accurate and minimizes false results for PCa diagnosis.

There has been a surge in research to identify PCa-specific biomarkers that can surpass the diagnostic performance of PSA5,6. Noncoding RNAs and fusion transcripts show measurable diagnostic signal in urine, tissue, and blood, indicating that molecular readouts can complement clinical assessment7,8,9,10. However, single-marker assays capture only one biological axis of cancers with heterogeneous origins and are typically sensitive to cohort differences, related benign conditions, and pre-analytical variation, thereby reducing stability across studies and clinical settings6,11,12. These limitations motivate the use of multivariate panels that integrate biologically complementary biomarkers with relevant clinical variables to improve accuracy and reproducibility beyond single analytes13,14,15. Clinical models composed of multiple biomarker panels could better accommodate pathway diversity, provide more reliable risk estimates, and reduce unnecessary biopsies by refining decisions in the PSA gray zone, where single markers often underperform14,16. Since prostate-derived RNAs are readily detected in DRE-free urine, often at elevated levels compared to those in blood, this matrix supports a clinically relevant and patient-friendly sampling method that preserves key molecular signatures14,17,18. Given this framework for diverse molecular signatures representative of disease states, machine learning (ML) offers a principled way to combine multiple RNA markers, capture interactions and nonlinearity, and generate individualized risk estimates suitable for real-world screening workflows11,19.

In this study, we developed a novel method called Prostate Cancer diagnosis using Sensitive and Sophisticated ML classifiers based on non-invasive urinary RNA biomarkers (PCASSO) to diagnose PCa using ML techniques and urinary RNA biomarkers from DRE-free whole urine samples. Six types of supervised ML classifiers were evaluated with 20 urinary RNA biomarkers (2 long noncoding RNAs (lncRNAs), 1 fusion gene mRNA and 17 miRNAs). Furthermore, this study aimed to evaluate the performance of the pre-trained ML classifier on PSA gray zone patients (PSA 3–10 ng/mL), for whom clinical decision-making is often ambiguous. Therefore, this study presents novel ML-based diagnostic models using DRE-free urinary RNA biomarkers and to evaluate their performance through robust internal validation, providing a basis for future external validation studies.

Methods

Patient information and clinical data collection

This study was approved by the Institutional Review Board of Asan Medical Center (Seoul, Republic of Korea/IRB No. 2019-1312) and written informed consent was obtained from all patients. All clinical data and urine samples were prospectively collected from patients treated between 2019 and 2021 at Asan Medical Center. The final cohort included urine samples from 51 patients with benign prostatic hyperplasia (BPH) and 112 patients with histologically verified PCa. Clinicopathological characteristics are summarized in Table 1.

Table 1 Summary of patient characteristics.

To establish a definitive diagnostic standard for model training and evaluation, we defined the ground truth as follows: For the 98 patients (out of 112 PCa cases) who underwent radical prostatectomy (RP), the final histopathological report from the resected prostate tissue was used to confirm the definitive diagnosis. This approach provided a definitive diagnosis for all the patients in this subgroup, including cases where the initial biopsy data were not recorded but were subsequently confirmed by post-surgical pathology with pathological Gleason score and staging. Notably, 3 of these 98 RP patients were classified as pT0. These patients were retained in the PCa cohort as their initial biopsy-confirmed diagnosis (cT2, Gleason score 6) represented the ground truth at screening. This “vanishing cancer phenomenon” (pT0) has been reported for small, low-grade tumors potentially removed entirely by biopsy20. For the remaining 14 PCa patients who did not undergo RP (e.g., received radiation therapy or other treatments), the diagnosis confirmed by the initial prostate biopsy results was used instead. For the 51 patients with benign conditions, the diagnosis was also confirmed by initial prostate biopsy results.

This stratified diagnostic standard was established to evaluate the models against the most reliable clinical endpoint available for each patient subgroup, using post-prostatectomy histopathology as the gold standard where possible to minimize the known bias of biopsy-based diagnoses21.

Urine specimen collection

The protocols for the human urine assay were approved by the Ethics Committee of the Asan Medical Center, and the human urine assays were performed in accordance with the approved guidelines. All urine samples were collected without DRE. Urine was immediately processed and extraction of urinary RNA was conducted using the QIAamp® Circulating Nucleic Acid Kit (QIAGEN). Urinary RNA was synthesized into cDNA for mRNA and miRNA detection, using the QuantiTect® Reverse Transcription Kit and the miRCURY LNA™ Universal RT Kit (QIAGEN), respectively. The detailed experimental procedure is provided in Supplementary materials and methods.

Real-time quantitative PCR (qPCR)

qPCR was performed on synthesized cDNA from urinary RNA using the Stratagene Mx3000P (Agilent Technologies). ORA™ qPCR Green ROX L Mix (highQu) was used for amplification. miRCURY LNA™ miRNA PCR Assay (QIAGEN) was used as primers to quantify miRNA levels. The primer sequences are listed in Supplementary Table 1. Thermal cycling for lncRNA and mRNA detection started with a 15 min denaturation step at 95 °C, followed by 50 repeated cycles of 15 s at 95 °C for denaturation and 60 s at 60 °C for annealing/extension. Thermal cycling protocol for miRNA detection followed manufacturer’s protocol.

Statistical analysis

Statistical analysis was conducted in R version 4.0.1 (http://www.R-project.org). Raw qPCR traces were processed using Stratagene Mx3000P software to determine cycle threshold (Ct). The urinary RNA marker expression levels were normalized against their mean β-actin and 18S rRNA values as follows: Normalized expression (− ΔCt) = (Ctβ-actin + Ct18S-rRNA)/2 − Cttarget gene. Comparison of normalized expression levels (− ΔCt) between BPH and PCa was analyzed using the Wilcoxon rank sum test and p-values were adjusted with Benjamini–Hochberg procedure. The diagnostic potential of each RNA marker was assessed using normalized expression (− ΔCt) in a receiver operating characteristic (ROC) curve analysis with the pROC package22. The bootstrap method of 2000 replicates was applied to computing the measurement uncertainties in ROC analysis22,23, and the ML decision margins were analyzed by Kolmogorov–Smirnov test24.

Supervised ML classifiers

A customized ML program based on Python 3.8.5 including open libraries, Scikit Learn25 and Pandas26 was utilized to construct PCa/BPH classifiers. Supervised ML classifiers used were as follows: C-support Vector Classification (SVC)27,28, Random Forest Classifier (RF)29, Gaussian Naïve Bayes Classifier (NB)30, Gradient Boosting Classifier (GB)31,32, and Multilayer Perceptron Classifier (MLP)33,34. Logistic Regression (LR)35 was used as a baseline. For learning process, the class weight was set to ‘balanced’. Single biomarker-based ML analysis used default parameters, whereas for multi-biomarker cases, parameters were tuned via GridSearchCV algorithm25. Normalized expression levels (-ΔCt) were min-max normalized per marker basis, and patients were randomly divided into a training set (75%) and a test set (25%). Classifiers were trained using train dataset, and evaluated on the test dataset. It is well known that a single random split can yield unstable or overly optimistic performance estimates, particularly in small or unbalanced datasets36,37,38. To mitigate this source of bias and ensure more reliable generalization assessment, we incorporated additional cross-validation analyses. Two complementary strategies, specifically Stratified K-Fold and Leave-One-Out Cross-Validation (LOOCV), were employed to provide consistent and unbiased out-of-sample performance estimates39. Stratified K-Fold cross-validation maintains class balance across folds and offers robust evaluation for unbalanced datasets, whereas LOOCV iteratively evaluates each individual sample, maximizing data utilization and minimizing dependence on any single partition.

Feature selection for multimarker models

Recursive feature elimination with cross-validation (RFECV) was used with GB as the estimator and LOOCV as the scoring method39,40. The RFECV procedure was performed with all 20 urinary RNA markers initially. RFECV then proceeded by removing the least-informative marker at each step according to the GB estimator, and recomputed the LOOCV score after each step with reduced number of markers. In LOOCV with n patients, the data are split into n folds, where each fold holds out one patient for testing and fits the model on the remaining n − 1, to reduce dependence on any single random split. Compared with a simple train/test split, LOOCV provides more robust and unbiased estimates by iteratively using all samples for both training and validation41,42. Iteration proceeded until only a single marker remained, and the subset that maximized the LOOCV score was selected (Fig. 2a). This procedure yielded a fixed 9-RNA biomarker subset (PCA3, TMPRSS2:ERG, hsa-miR-125b-5p, hsa-miR-141-5p, hsa-miR-17-3p, hsa-miR-24-3p, hsa-miR-30b-5p, hsa-miR-30c-5p, hsa-miR-31-5p), which was then used consistently for all subsequent multivariate analyses.

Results

Diagnosis of PCa using urinary RNA markers

To develop a multiplexed qPCR-based test for PCa, we assessed 20 putative PCa urinary RNA biomarkers in a cohort of 112 patients with PCa and 51 patients with BPH. These 20 urinary RNAs were selected based on their biological roles in prostate cancer progression, prior evidence of diagnostic utility, and validation in our internal cohort. Canonical RNAs such as PCA3, MALAT1, and TMPRSS2:ERG were included9,43, along with 17 microRNAs that regulate androgen receptor signaling, PI3K/AKT/STAT3, EMT, and tumor suppressing pathways44,45. The detailed descriptions of target pathways, reported diagnostic performance, and internal cohort validation results for all 20 biomarkers are summarized in Supplementary Table 2. The two lncRNAs (MALAT1 and PCA3) and TMPRSS2:ERG fusion gene were up-regulated in PCa, but all miRNA biomarkers were down-regulated in PCa (Supplementary Fig. 1, Supplementary Table 3). ROC curves for individual urinary RNA markers for the diagnosis of PCa showed that AUCs for two lncRNAs and the TMPRSS2:ERG fusion gene were 0.78 and 0.74, respectively. AUCs for a subset of miRNAs were high with a maximum of 0.85 for hsa-miR-222-3p (Fig. 1a, Supplementary Table 4).

Fig. 1
figure 1

Characterization of candidate urinary RNA biomarkers of PCa. (a) AUC values for single urinary RNA biomarkers for the diagnosis of PCa. ROC curves were constructed from expression levels for single urinary RNA biomarkers (Supplementary Fig. 1). (b) ROC curves for single urinary RNA biomarkers with six different ML classifiers. Each ML model was trained on the training dataset and ROC curves were constructed on the test set. (c) AUC values for the ROC curves in (b). (d) Stratified-K-Fold cross-validation (K = 51) scores. This unbiased indicator maintains the ratio of PCa to BPH in each fold.

Single-biomarker analysis by ML

Prior to the multi-marker approach, we first constructed and tested the performance of ML classifiers using single urinary RNA biomarkers. The normalized expression levels for 20 urinary RNA markers from the final cohort (112 PCa and 51 BPH) were used to train and evaluate the ML classifiers. Six different ML algorithms (SVC, RF, LR, NB, GB, MLP) representative of diverse ML approaches including those that are often used in clinical models were chosen for evaluation46,47. These ML classifiers were trained on the training dataset and the performance of each model was evaluated with the test dataset. Single biomarker-based classifiers showed variable performance depending on the choice of biomarker and ML algorithm (Fig. 1b,c, Supplementary Fig. 2a). It is often the case that the reported AUC could be overoptimized due to the randomized train/test set split instances36.

To correct this bias, we further performed cross-validation analysis using the Stratified-K-Fold and LOOCV37 (Fig. 1d). The maximum values of AUC and Stratified-K-Fold/LOOCV were 0.84 (hsa-miR-17-3p, model: SVC) and 0.864/0.859 (hsa-miR-125b-5p, model: SVC), respectively. We confirmed that ML classifier models can be successfully constructed using single biomarkers, and their performances compare favorably with previous models (Supplementary Table 5). However, this variability in performance also highlights the inherent limitations of a univariate approach, as a single marker is often insufficient to capture the full biological heterogeneity of a complex disease like prostate cancer6,11,12.

Multimarker analysis by ML

To overcome the reported limitations of univariate analyses, which capture only a single biological axis and are prone to variability across diverse pathological conditions, ML models with multiple markers were constructed and analyzed6,11,12,13,14,15. Multiple biomarker combinations have the potential to improve classification accuracy; however, the inclusion of insignificant terms in the model could negatively impact performance48. In particular, when applying machine learning to high-dimensional datasets, the outcome can be strongly dependent on both the feature combinations and the random data partitioning38,49. To address this issue, TRIPOD/PROBAST recommended resampling-based selection and validation rather than single-split workflows50,51. To identify the optimal combination of features among the 20 urinary RNA markers, we chose the RFECV algorithm to select informative biomarkers as an exploratory model with reduced bias while utilizing the complete experimental dataset40. RFECV was applied to the min-max normalized qPCR data, using GB as the estimator and LOOCV as the scoring method. For each round of selection step, RFECV removed a single least-informative marker by computing the LOOCV score with the GB estimator40. RFECV was initiated with all 20 biomarkers and proceeded until only one biomarker remained, with maximum LOOCV scores reported for the given number of features (Fig. 2a). This selection procedure yielded an optimized 9-RNA biomarker panel consisting of PCA3, TMPRSS2:ERG, hsa-miR-125b-5p, hsa-miR-141-5p, hsa-miR-17-3p, hsa-miR-24-3p, hsa-miR-30b-5p, hsa-miR-30c-5p, and hsa-miR-31-5p. This 9-RNA subset was then used consistently for all subsequent multivariate analyses. As a control, a 3-feature set composed of conventional PCa RNA markers (PCA3, MALAT1, and TMPRSS2:ERG fusion gene) was constructed and analyzed7,52,53,54.

Fig. 2
figure 2

Multiple urinary RNA biomarker-based ML analysis for the diagnosis of PCa. (a) The optimal biomarker combination was selected through the RFECV algorithm with GB and LOOCV. The highest LOOCV score was observed when the number of features was 9. (b, c) ROC curves for multiple urinary RNA biomarker-based ML classifiers for the diagnosis of PCa. The number of features used for training is 3 in (b) and 9 in (c), respectively. The 3-feature combination consisting of MALAT1, PCA3, and TMPRSS2:ERG was introduced as a baseline for conventional biomarker combinations. The 9-feature combination obtained from the RFECV analysis in (a) consists of PCA3, TMPRSS2:ERG, hsa-miR-125b-5p, hsa-miR-141-5p, hsa-miR-17-3p, hsa-miR-24-3p, hsa-miR-30b-5p, hsa-miR-30c-5p, and hsa-miR-31-5p. (d) AUC values for the ROC curves in (b) and (c). (e) Stratified-K-Fold cross-validation (K = 51) scores for the trained ML classifiers in (b) and (c). (f, g) Feature importance was calculated using RF or GB for 3 features (f) and 9 features (g).

In our cohort, the multiplexed models achieved significantly higher AUC values compared with single biomarker-based models (Table 2, Fig. 2b–e, Supplementary Fig. 2b). Among all the classifiers tested, GB model with nine urinary RNA biomarkers demonstrated the best overall performance. While high AUC values may appear to indicate near-perfect discrimination, these values can be over-optimistic as a result of data overfitting or random data partitioning bias, particularly in small medical datasets. To ensure reliable evaluation of model performance and reduce this potential bias, we performed additional cross-validation analyses using Stratified K-Fold and LOOCV39,41,55. Both cross-validation methods provided consistent and robust internal validation scores (AUC 0.99, Stratified K-Fold 0.912, LOOCV 0.890), confirming that the GB model’s performance is expected to be stable. The bootstrap method was applied to compare ROC curves, indicating that the GB model with 9 markers could improve performance over all single biomarkers for non-ML ROC curves (hsa-miR-222-3p with AUC of 0.85; p < 0.00001; Supplementary Table 6) and ML models (GB model using hsa-miR-30c-5p with AUC of 0.81; p < 0.01; Supplementary Table 7). Feature importance analysis of the 9-biomarker combination using RF or GB indicated that miRNA markers contributed significantly with summed feature importance of 76.3% for RF and 80.1% for GB (Fig. 2f,g).

Table 2 ML training and test results summary.

The ML decisions for different algorithms and biomarker choices were further analyzed based on the predictor values (0 ≤ BPH < 0.5, 0.5 ≤ PCa ≤ 1) (Fig. 3a,b). Notably, LR was not included due to its limited performance and SVC was not included since it does not provide predictor values. The limited number of error cases typically appeared near the threshold value of 0.5, reflecting the low prediction certainty for these error cases. The decision margin significantly increased for  9-marker combinations compared to 3-marker combinations, except for RF (Fig. 3c). Together, GB and MLP with 9 marker combinations showed the highest decision margin and unanimity in 37 out of 41 cases, with only one common error case, indicating that the performance could potentially be improved if multiple trained ML models are combined for decision-making.

Fig. 3
figure 3

Diagnostic performances of multiple urinary RNA biomarker-based ML analysis. (a, b) Predictor values calculated as decision probability for the test set using four different ML algorithms (RF, NB, GB, MLP). Predictor values using the 3-feature combination (a) and the 9-feature combination (b). The predictor value close to 1 indicates that the patient is likely a PCa patient, while the predictor value close to 0 indicates that the patient is likely a BPH patient. Open circles: predictor values for BPH patients; filled circles: predictor values for PCa patients. (c) Decision margin based on predictor values for individual patients. The decision margin has a value equal to the distance from the threshold (0.5), and the sign is positive when the diagnosis is correct and negative when the diagnosis is incorrect. In the box-and-whiskers plot, the ends of the box mark the first quartile (25%) and third quartile (75%). The two whiskers extend from the first quartile to the smallest value and from the third quartile to the largest value. The median is shown with a dashed line. The Kolmogorov–Smirnov test was used to compare the distribution of decision margin values for the 3- and 9-feature combinations (ns > 0.05, ** p = 0.001–0.01, *** p = 0.001–0.0001).

Classification of PSA gray zone patients using ML classifiers

Although PSA is a key diagnostic tool, management of patients with borderline PSA is not straightforward56. Thus, we assessed the ability of the ML classifiers by analyzing the subset of the cohort in the PSA gray zone to validate their robustness (Fig. 4a–d, Supplementary Fig. 2c). The trained ML classifiers had higher AUC values than serum PSA (AUC: 0.59; Supplementary Fig. 3) and generally high cross-validation scores. The GB algorithm with 9 biomarkers showed the best performance with AUC of an 1.00, with robust cross-validation scores at 0.885 for Stratified-K-Fold, and 0.865 for LOOCV. Together, these results indicate that ML classifiers with multiple urinary RNA markers could be a useful strategy with further developments.

Fig. 4
figure 4

Diagnostic performances of multiple urinary RNA biomarker-based ML analysis on PSA gray zone patients. Among the patient cohort, 96 patients (24 BPH, 72 PCa) with serum PSA levels of 3–10 ng/mL were selected for analysis. (a, b) ROC curves for six different ML algorithms using the 3-feature combination (a) and the 9-feature combination (b). Each ML model was trained on the training dataset and ROC curves were constructed on the test set. (c) AUC values for the ROC curves in (a) and (b). (d) Stratified-K-Fold cross-validation (K = 24) scores calculated for the trained ML classifiers in (a) and (b).

Discussion

In this study, we developed a diagnostic model for PCa, termed PCASSO, combining sophisticated ML techniques and simple qPCR assays on urinary RNA biomarkers from DRE-free whole urine samples. To explore the potential of multimarker ML models, well-established urinary markers such as PCA3 and TMPRSS2:ERG as well as several miRNAs were assayed as inputs to ML algorithms. Although there are no commercially available miRNA signatures, there are many that have been described in the literature with attractive characteristics including high stability in the blood and urine due to their association with proteins (such as Argonaute)57,58,59. To capture the highly complex relationship between inputs while minimizing potential over-optimization, the best feature combination was selected via the RFECV algorithm, resulting in a 9-biomarker combination that features both the well-known biomarkers and multiple miRNA species (Fig. 2a, Supplementary Table 8). The 9-biomarker ML model significantly improves predictive performance compared with any single RNA marker and the 3 conventional biomarker combinations7,52,53,54 (Table 2, Fig. 2b–e, Supplementary Fig. 2b). Among the different ML algorithms, GB showed the best performance in terms of AUC and cross-validation scores (AUC: 0.99, Stratified-K-Fold score: 0.912, LOOCV score: 0.890). In addition, feature importance analysis confirmed that miRNAs contribute to improve classifier accuracy and compare favorably with other well-established markers (Fig. 2f,g). The performance of ML classifiers was largely maintained when applied to the subset of the cohort with intermediate PSA levels (Fig. 4a–d, Supplementary Fig. 2c).

Our ML-based diagnostic classifier results compare favorably to serum PSA test and those from other early detection tests. First, the ML classifiers developed here need only a simple qPCR assay on DRE-free urinary RNA biomarkers, and therefore, could minimize patient discomfort, potentially without the high false positive rate known for the PSA test1,60. This feature may allow clinical application of ML models with further clinical validation and development in the future. As an example, we provide a potential workflow where all nine RNA markers from DRE-free urine are assayed concurrently, and the resulting ΔCt values are used as inputs to a pre-trained classifier to calculate the disease score for PCa (Supplementary Fig. 4). Second, our ML classifier showed diagnostic ability and robustness that compare favorably with other multivariable PCa diagnostic models. Logistic regression-based multivariable models typically reported relatively low specificity and/or sensitivity43,54,61 (Supplementary Table 5), whereas a manually constructed non-ML model required a time-consuming optimization process yet with limited robustness44. Commercially available biomarker tests could be more effective than PSA test alone, including 4Kscore (AUC: 0.82) and prostate health index (AUC: 0.70) that use blood samples, and PCA3 (AUC: 0.76) and SelectMDx (AUC: 0.86) that use post-DRE urine samples62. The PCASSO model with simple and cost-effective assays demonstrated robust classification accuracy for PCa upon biopsy, without resorting to prior clinical data, blood draw, and DRE (Supplementary Tables 9 and 10).

Further, it is well documented that prostate-derived RNAs are preferentially shed into urine63,64 such that our DRE-free urinary biomarker assay targets the medium with direct linkage and rich information on PCa. Meanwhile, prior studies showed that several urinary RNA biomarkers—most notably miR-375—are detectable in blood and are associated with metastatic disease and survival in advanced PCa/CRPC, suggesting potential prognostic utility65,66. Although challenges remain to elucidate the nature of differences in biomarker concentrations between urine and blood samples due to distinct compartments and different release/clearance kinetics, future work could explore paired urine-plasma samples for RNA biomarkers to evaluate the potential benefit of including blood marker levels alongside a urine-based diagnostic panel.

While we demonstrated that our ML classifiers showed robust performance in PCa diagnosis, further developments could be achieved by addressing certain limitations. First, our study is limited as an exploratory study using data from a single-country cohort of voluntarily participating Korean men (51 BPH, 112 PCa). Following TRIPOD-aligned guidance and methodological reviews, we chose to resort to cross-validation rather than a simple split sample because the dataset is used efficiently while providing realistic error estimates50,67. Two widely-used and complementary cross-validation procedures, stratified K-fold and LOOCV, were employed to approximate out-of-sample behavior and minimize overfitting, in line with recommendations for medical AI model development41,55. Although the GB model maintained high discrimination power in both the overall cohort and the PSA gray zone patients, the cross-validation results indicated a slight decrease in the model performance metrics within the PSA gray zone. This difference suggests that diagnostic performance for patients with borderline PSA levels could be affected by patient sample distributions and limited sample size. Future extension of the current exploratory work in an independent, multi-center cohort with controlled age and PSA levels would provide additional evaluation metrics for potential clinical utility68. Second, methods to elucidate the decision-making process of ML algorithms are often lacking. The ML classifiers provided the predictor values as the probability of algorithms’ predictions, helping to evaluate the performance of models and mostly reaching unanimity for multiple classifiers. Dissecting the reasons for misclassifications may provide a crucial aspect for model optimization. Third, no clinical parameters were considered in the ML classifiers. While our ML classifiers based solely on DRE-free urinary RNA markers showed robust performance, combining the models with novel imaging tests and other clinical parameters may further improve performance and robustness69. The non-invasive PCa screening via ML classifiers using DRE-free urine causes minimal side effects for patients, and may open doors to new ways of screening and monitoring the disease70,71.