Abstract
Prostate cancer (PCa) diagnosis remains challenging due to overlapping clinical features with benign prostatic hyperplasia (BPH) and limitations of existing diagnostic tools like PSA tests, which yield high false-positive rates. This study investigates the potential of microRNA (miRNA) biomarkers, analyzed via reverse transcription polymerase chain reaction and machine learning (ML), to enhance diagnostic accuracy. miRNAs such as miR-21-5p, miR-141-3p, and miR-221-3p were identified as significant discriminators between PCa and BPH through a prospective cohort study. Whole blood miRNA profiling offered a robust systemic representation of disease states. A random forest ML model was trained on expression data, achieving notable performance metrics: an accuracy of 77.42%, AUC of 0.78 during verification, and 74.07% accuracy and 0.75 AUC in validation. The model’s use of miRNA expression ratios, such as miR-141-3p/miR-221-3p, demonstrated superior sensitivity and specificity over traditional PSA testing. Bioinformatics analysis confirmed the association of selected miRNAs with cancer pathways, including PD-L1/PD-1 checkpoint and androgen receptor signaling, validating the biological relevance of the findings. This novel integration of miRNA profiling and machine learning holds great potential for the clinical translation of miRNA-based non-invasive diagnostics, enhancing diagnostic precision. However, broader population studies and standardization of protocols are needed to ensure scalability and clinical applicability. This research provides a foundational framework for advancing miRNA-based diagnostics, bridging discovery and clinical implementation.
Similar content being viewed by others
Introduction
MicroRNA (miRNA) biomarkers have emerged as critical players in disease diagnostics, particularly in cancer research. Known for their stability and differential expression in disease versus healthy subjects, miRNA biomarkers are promising candidates for early cancer diagnosis and monitoring1,2. The advent of high-throughput technologies like next-generation sequencing (NGS) has revolutionized miRNA biomarker discovery, providing comprehensive datasets that reveal miRNA expression profiles across various disease conditions3. Despite these advancements, the clinical application of miRNA biomarkers remains limited4,5, primarily due to the high cost and complexity of NGS, which makes it unsuitable for routine diagnostics in resource-limited settings6. Consequently, more affordable and accessible technologies such as reverse transcription polymerase chain reaction (RT-PCR) has become the primary tools for validating discovered biomarkers7,8. While PCR-based methods are sensitive for targeted miRNA quantification, they still exhibit relatively lower sensitivity compared to high-throughput approaches like NGS9. However, the diagnostic application of miRNA biomarkers using RT-PCR-based methods faces significant challenges10 in terms of experimental design, internal controls, sample preparation, and data analysis11. A major concern is the variability introduced by differences in devices, methodologies, and sample handling. Such inconsistencies can result in data variations that do not accurately reflect the biological state of the patient but instead stem from technical artifacts. This lack of uniformity makes it difficult to establish a precise cutoff for differentiation based on Ct values that undermines the reliability of miRNA-based diagnostics in clinical settings. Variability and low sensitivity has hindered the translation of promising research findings into clinical trials and diagnostic applications. Furthermore, the necessity for large-scale and prospective clinical studies for validation remains indispensable, posing significant resource and ethical challenges. Addressing these challenges is essential for improving diagnostic accuracy and enhancing the clinical utility of miRNA biomarkers.
Machine learning (ML) is an artificial intelligence (AI) subset that enables systems to learn from data without explicit programming. ML models use algorithms to find patterns in data, enabling them to make predictions or decisions. These models can be supervised, unsupervised, semi-supervised, or reinforcement learning-based. As they process more data, their performance improves, making ML valuable across various fields, including healthcare12.
Linear methods, such as linear regression, assume a direct relationship between inputs and outputs. These models can struggle when data shows high variance or complex relationships, leading to underfitting or overfitting. ML models, especially those like decision trees, random forests, and neural networks, are more flexible and equipped to handle variance13. They capture non-linear relationships and interactions between features more effectively than linear models. For example, ensemble techniques such as random forests use multiple decision trees to reduce variance by averaging predictions, making ML models robust in noisy and complex datasets where traditional linear models would fail14.
ML models are revolutionizing biological research by analyzing complex datasets like genomics, proteomics, and medical imaging. In biology, datasets often exhibit high dimensionality, non-linearity, and noise, where ML excels15. For instance, ML models have extensively been used in biomarker discovery, such as in cancer diagnostics, by analyzing RNA sequencing or microarray data16. Supervised learning models, which excel in classifying labeled data, however remain underutilized in the context of RT-PCR data. To the best of our knowledge, very few studies have utilized RT-PCR data to effectively leverage supervised learning for enhancing the diagnostic accuracy17,18. By training ML models on a large set of patient-specific RT-PCR data, ML models can capture patterns in multiple miRNA expressions that may go undetected by traditional analysis methods, to improve the sensitivity (the ability to accurately identify true positives) and specificity (the ability to accurately identify true negatives).
The aim of our study was to assess the expression levels of a miRNA panel in the blood of prostate cancer (PCa) patients compared to those with benign prostatic hyperplasia (BPH) and apply supervised ML algorithms to enhance the sensitivity and specificity of these biomarkers. By leveraging the strengths of ML alongside the cost-effectiveness and accessibility of RT-PCR, we aimed to not only improve biomarker panel performance but also bridge the gap between discovery research and clinical application.
The standard screening method for PCa is the serum prostate-specific antigen (PSA) test. However, while PSA levels often rise in prostate-related conditions, it is not specific to PCa19, leading to high false-positive rates, overdiagnosis, and unnecessary biopsies. The gold standard for distinguishing PCa from BPH remains a transrectal prostate biopsy, which is invasive and uncomfortable, causing many patients to avoid it20. Since PCa and BPH differ significantly in their life-threatening risk, treatment, and intervention strategies21, early differentiation is crucial for informed decision-making. Therefore, there is an urgent need for more accurate, non-invasive diagnostic tools to distinguish between these conditions.
The discovery of miRNAs in the plasma of PCa patients sparked significant interest in their potential as liquid biopsy biomarkers for PCa22. Numerous studies have identified and validated various miRNAs as potential biomarkers; however, the results have not been consistent23,24.
Based on previous studies, we selected a group of promising miRNAs for investigation in our study: miR-19b-3p, miR-21-5p, miR-93-5p, miR-141-3p, miR-221-3p, and miR-375-3p25,26,27,28,29. Although, urine, plasma and serum are commonly utilized samples in both the studies mentioned above and other related research, whole blood also presents a good source of biomarkers with distinct advantages, including a higher miRNA yield, reduced susceptibility to technical errors, and a more comprehensive representation of the systemic profile30. In this study, we focused on miRNA profiling from whole blood using the RT-PCR method, conducted in three distinct phases with separate cohorts. In the first phase, the most significant miRNAs were identified from the discovery cohort. The verification phase involved generating RT-PCR data from the verification cohort to train the ML model. During the final validation phase, the ML model was evaluated using a dataset that had not been previously exposed to the model (Figs. 1 and 2). The results showed that combining ML with miRNA biomarkers significantly improved diagnostic efficiency compared to using individual miRNAs alone.
Workflow of the study design (A) discovery cohort; (B) verification cohort; (C) validation cohort figure created using Biorender (https://biorender.com/).
Materials and methods
Study population
We conducted a prospective cohort study of 86 participants from November 2022 to August 2024 in collaboration with the Centre for Genetic Disorders and the Department of Urology at the Institute of Medical Sciences, Banaras Hindu University. This study was conducted in accordance with relevant guidelines and regulations (ICMR National Ethical Guidelines for Biomedical and Health Research). The experimental protocols were reviewed and approved by the Ethics Committee, Institute of Science, Banaras Hindu University (Registration No.: ECR/226/Indt/UP/2014/RR22) under Approval No: I.Sc./ECM-XIV/2022-23. Written informed consent was obtained from all participants and/or their legal guardians before sample collection and data analysis. The study examined a treatment-naïve patients who were clinically suspected of having PCa and scheduled for a prostate biopsy based on digital rectal examination (DRE) findings and elevated PSA levels. Exclusion criteria involved: (a) lack of complete medical records; (b) non-urothelial malignancies or other urological; (c) refusal to consent; and (d) prior androgen deprivation therapy (ADT) or radiation therapy. In the second phase, de-identification of collected samples was performed, followed by classification into two groups based on histopathological examination of 12-core prostate biopsies resulting in PCa (n = 51) and BPH (n = 35) (Table 1). The BPH cohort includes five samples from patients who presented with similar LUTS (lower urinary tract symptoms) but were classified as PIN (Prostatic Intraepithelial Neoplasia) or histopathologically negative upon evaluation.
Blood collection, storage and RNA isolation
Peripheral venous blood was collected from the patient into collection tubes coated with EDTA prior to the prostate biopsy. Total RNA was extracted from 400 µL of blood using 750 µL of Trizol reagent (Cat No. 15596026). 200 µL of chloroform was added, the sample was partitioned into three distinct phases. The upper aqueous phase was carefully separated, combined with 500 µL of isopropanol, to precipitate RNA. The pellet was resuspended in 20 μL of DEPC-treated water. The concentration and quality of the extracted RNA were assessed using NanoDrop (Thermo Scientific, USA).
Reverse transcription and RT-PCR
The RevertAid First Strand cDNA Synthesis Kit (Catalog #K1622, Thermo Scientific, USA) was used for the reverse transcription of miRNA samples. miRNA-targeted stem-loop primer was used for reverse transcription (Supplementary Table 1). For quantitative PCR (RT-PCR), reactions in triplicate were carried out in the Applied Biosystem QuantStudio 6 Flex Real-Time PCR System (Thermo Scientific, USA) using Maxima SYBR Green/ROX RT-PCR Master mix (2X, Catalog #K0221). Delta Ct values were then calculated by subtracting the Ct value of the endogenous control, RNU6 for each sample.
Study design
Due to the limited availability of studies investigating miRNA expression in whole blood for PCa, we selected candidate miRNAs (miR-19b-3p, miR-21-5p, miR-93-5p, miR-141-3p, miR-221-3p, and miR-375-3p) based on existing literature on serum and plasma. These miRNAs have been consistently associated with PCa and key oncogenic pathways. Our study aims to validate their relevance in whole blood, which provides distinct advantages as higher miRNA yield, reduced susceptibility to pre-analytical variability, and a more comprehensive representation of systemic disease states.
The study was designed in three cohorts: discovery, verification, and validation (Fig. 2). The discovery cohort comprised of 20 samples, including BPH (n = 9) and PCa (n = 11) cases, and was used to analyze the expression patterns of six selected miRNAs using RT-PCR. The hypothesis test (Mann–Whitney U) was employed to identify the top 3 differentially expressed miRNAs among the selected group, resulting in the selection of three miRNAs: miR-21-5p, miR-141-3p, and miR-221-3p (Supplementary Table 2). In the verification cohort (n = 59; inclusive of samples used in the discovery cohort) composed of BPH (n = 36) and PCa (n = 23) (Fig. 2), the expression data for these 3 selected miRNAs was collected and this data was used to build and train a ML model. Finally, the validation cohort (n = 27), consisting of BPH (n = 7), PIN (n = 5) and PCa (n = 15) samples, included cases that had not been previously exposed to the trained model. The aim of the validation cohort was to evaluate its performance of the trained ML model.
ML-model development
Preprocessing
For the preprocessing step, NULL values have been imputed by the median value. All the operations have been performed on Delta Ct values.
Statistical analysis
The Mann–Whitney U test, a nonparametric statistical method, was used for hypothesis testing to compare differences between two independent groups. Its ability to operate without assuming a normal distribution makes it a versatile analytical tool. We have used the ROC curve to evaluate the predictive capability of individual features in distinguishing the BPH and PCa classes. ROC (receiver operating characteristic) curve is a graphical representation that shows the performance of a binary model at different threshold settings. Using two values of TPR (true positive rate) and FPR (false positive rate) it plots two parameters at different threshold levels. The AUC area under the ROC curve quantifies the overall ability of the model to differentiate between negative (BPH) and positive (PCa) classes.
Decision curve analysis (DCA) was utilized to assess the clinical utility of the predictive models by quantifying their net benefit across a range of threshold probabilities. DCA compares the proposed model with alternative strategies, such as using PSA alone, treating all patients, or treating none.
The threshold probability represents the minimum predicted probability at which a patient would opt for further intervention. It allows for assessment of whether a model provides a higher benefit than treating all patients or none, and how it performs across decision thresholds relevant to clinical practice. The DCA was implemented using predicted probabilities from the trained models, and results were presented as net benefit curves across threshold probabilities ranging from 0.01 to 0.99.
Feature selection
Each biomarker in the discovery cohort was assessed using the Mann–Whitney U test to identify the most significant ones based on their discriminative power between the two classes.
Machine learning
In this study, random forest has been employed as a powerful classifier, combining an ensemble of decision trees to capture the complex patterns of RT-PCR data. In this model, no hyperparameter optimization methods were used in this study, to ensure the reproducibility of the proposed model we have utilised the defaults hyperparameter values from scikit-learn library. The detailed steps used in the modelling has been presented (Supplementary Fig. s1). Each tree within the ensemble is constructed from bootstrapped samples and guided by a randomly selected subset of features, which helps to reduce the correlation between trees and enhances the overall diversity of the model. By working collectively, these decision trees independently vote on each classification, minimizing the influence of any single outlier and improving the robustness of the final classification. The randomness introduced in both sampling and feature selection mitigates overfitting, leading to improved stability and accuracy. Ultimately, the aggregate voting of these trees forms the final prediction in PCa classification. To validate the generalizability and avoid overfitting of the model, a fivefold cross validation has been employed on the verification cohort. The dataset was randomly spitted in 5 different equal subsets, where the model was trained on four folds and tested on the remaining fold iteratively. The final performance metrics were averaged across all five folds.
Bioinformatics analysis
To predict the target genes of the 3 miRNAs, we used databases like miRTarBase and miRBase31,32. Target gene lists from both the sources were combined by removing any duplicates. All selected target genes were listed in a common list referred to as miRNA Target Gene Panel (mTGP). This list was used to analyze functional aspects by Gene ontology based functional analysis using ShinyGO tool33. A PCa specific gene panel was created using GeneCards, KEGG and String database34. We employed the keyword “PCA” (Prostate Cancer) to retrieve gene entries from the databases to ensure the inclusion of genes associated specifically with PCa. All the genes reflected from the search were included in the PCa Gene Panel (PGP). To improve the reliability of target gene results, we identified the overlapping target genes from mTGP and PGP by using the VennDiagram package of R software. Then, these overlapping target genes were used to study protein–protein interaction and cancer context specific network construction using PINAv3.0.
Results
miRNA expression analysis in discovery cohort
In the discovery cohort, expression analysis of six selected miRNAs was done using RT-PCR, in both BPH and PCa groups. The Mann–Whitney U test showed that all 6 miRNAs had p-values below 0.05, indicating significance of selected miRNAs. miR-21-5p, miR-221-3p, and miR-141-3p exhibiting the most significant differences, with p values of 0.000224, 0.000153, and 0.004293, respectively, were selected for further study in the verification cohort. Although the distribution pattern showed the bimodal nature of data we still selected these miRNAs for training, because the random forest model has the capability to address non-linearity in the data Additionally miR-19b-3p, miR-93-5p, and miR-375-3p, also displayed significance with p values ranging from 0.004747 to 0.037008 (Fig. 3, Supplementary Table 2). These findings, illustrated by the boxplots of ΔCt values, highlight the potential of these miRNAs as diagnostic biomarkers for PCa.
ML model development on verification cohort
The verification cohort was used to generate expression data of 3 miRNAs (miR-21-5p, miR-141-3p, miR-221-3p) using RT-PCR. In the verification cohort, 6 features were created—3 individual miRNAs as features (miR-21-5p, miR-141-3p, miR-221-3p) and their ratio-based combinations (21_141, 141_221, and 221_21). The performance of each feature was assessed using ROC curve analysis (Supplementary Fig. s2). The resulting AUC scores ranged from 0.48 to 0.52, indicating that the both individual features and ratios performed poorly when used as independent features suggesting their limited standalone utility in diagnostic predictions.
In the verification cohort, a random forest model was applied using all features. The trained ML model showed an accuracy of 77.42%, precision of 86.21%, recall of 71.43%, and an F1 score of 78.12%, as illustrated by the confusion matrix (Fig. 4a). The ROC curve for the random forest model showed an AUC of 0.78 (Fig. 4b) as compared to PSA that showed an AUC of 0.52 (Fig. 4b) similar to random classifier.
(a) Confusion matrix for the model trained with differentially expressed miRNAs on test set in verification cohort; (b) ROC AUC graph of proposed miRNA-based model and PSA based model; (c) Feature importance employed by random forest model. 141_221 refers to expression ratio of miR-141-3p/miR-221-3p, 21_141 refers to expression ratio of miR-21-5p/miR-141-3p, and 221_21 refers to expression ratio of miR-221-3p/miR-21-5p (d) Decision curve analysis of proposed model over PSA.
To validate the stability and generalizability of the proposed model a fivefold cross validation was performed on the verification cohort. The cross validated result demonstrated consistent performance across all the folds with average accuracy of 81.67%, average precision of 83.28%, recall of 84%, f1 score of 83.54% and AUC score of 0.81. To rigorously evaluate the proposed method, baseline ML models were implemented using the same feature set for comparative analysis. The random forest model consistently demonstrated superior performance over these alternatives (Supplementary Table 3), further substantiating its suitability for the given classification task.
The ensemble model, being composed of multiple decision trees, assigns importance to features based on their ability to improve classification accuracy. Further analysis of feature importance, generated by the random forest classifier, revealed miRNAs and feature combinations that contributed most to the model’s predictive accuracy and are the key drivers behind its discriminative performance. As shown in the feature importance plot (Fig. 4c), the ratio feature of miR-141-3p and miR-221-3P was the most significant contributor, followed closely by the ratio of miR-21-3p and miR-141-3p. Further, decision curve analysis (Fig. 4d) clearly shows the proposed miRNA based model shows greater clinical net benefit than the PSA based model. Especially in the threshold probability range between 0.2 and 0.7, where it demonstrates realistic clinical decision making.
Validation of designed ML-model on validation cohort
To validate the performance of the trained model, we used an independent cohort that was previously unexposed to the ML-model. When expression data from this validation cohort is employed to the trained ML-model. The testing of the model using data from the validation cohort showed an accuracy of 74.07%. It exhibited a precision of 71.43%, meaning it effectively identified true positive cases, while maintaining a recall (sensitivity) of 76.92%, which reflects the model’s capacity to capture most of the true positive instances of PCa. Furthermore, the model achieved a specificity of 74.07% (Fig. 5a) and ROC-AUC score of 0.75 (Fig. 5b).
Association of miRNA and their target genes with PCa pathogenesis
We predicted target genes for the selected miRNAs (miR-21-5p, miR-141-3p, and miR-221-3p), identifying 278 genes (mTGP panel). Using ShinyGO, we conducted Gene Ontology (GO) enrichment analysis of mTGP to explore key biological processes (BP), cellular components (CC), and molecular functions (MF), presenting the top 20 in each category. The BP analysis revealed that the genes are primarily involved in regulating metabolic processes related to nitrogen compounds, macromolecules, and nucleobase compounds (Fig. 6a). In the MF category, the genes were linked to chromatin binding, DNA binding, and kinase activity (Fig. 6c). The CC analysis showed strong associations with ribonucleoprotein complexes, including the RNA-induced silencing complex (RISC) and transcription regulator complexes (Fig. 6b). Additionally, KEGG pathway analysis highlighted enrichment in cancer-related pathways, such as miRNAs in cancer, TNF, JAK-STAT, MAPK, and Akt signaling, as well as growth hormone and prolactin synthesis—processes relevant to cancer pathogenesis (Fig. 6).
Gene ontology and KEGG enrichment analysis of target genes for miR-21-5p, miR-141-3p, miR-221-3p. (a) Biological processes: shows regulation of metabolic pathways crucial for cellular function and cancer progression. (b) Cellular component: reveals key roles in transcription factor binding and kinase activity. (c) Molecular function: reveals key roles in transcription factor binding and kinase activity. (d) KEGG Pathway analysis highlights enrichment in cancer-related signaling pathways.
To determine association of selected miRNAs with PCa progression, identified target genes (mTGP) were screened against PCa gene panel (PGP), revealing 32 shared genes. Result of the chord diagram shows that all 3 miRNA interacted with at least one gene from PGP (Fig. 7). The analysis highlighted miR-21-5p targeting the most PCa-associated genes, followed by miR-141-3p and miR-221-3p, suggesting these miRNAs are critical for PCa biology.
KEGG pathway analysis on 15 genes out of 32 genes that interacted with miR-21-5p showed involvement of genes in general cancer related pathways such as PD-L1 and PD-1 checkpoint, Prolactin, HIF-1, NF-kB etc. Whereas combined analysis of 17 genes interacting with miR-141-3p and miR-221-3p showed more PCa specific pathways involvement such as Endocrine Resistance (Supplementary Fig. s3).
Interestingly, the 32 identified genes are functionally interconnected within cancer-specific protein–protein interaction (PPI) networks, as constructed using PINAv3.0 (Supplementary Fig. s4). This analysis revealed seven hub genes—RASA1 (RAS P21 Protein Activator 1), RHOB (Ras Homolog Family Member B), CDKN1B (Cyclin Dependent Kinase Inhibitor 1B), EPHA2 (EPH Receptor A2), ARID1A (AT-Rich Interaction Domain 1A), OGT (O-GlcNAc transferase), and CBX4 (Chromobox 4)—all associated with tumorigenesis. Specifically, RASA1 and ARID1A were strongly linked to the androgen receptor (AR) pathway, which is frequently disrupted in PCa. Other notable interactions showed connections between key oncogenes, such as SMAD2 (SMAD Family Member 2), TNFAIP1(TNF alpha induced protein 1), and CBX8 (Chromobox homolog 8), with these hub genes (Supplementary Fig. s4). Expanding this analysis using STRING for the complete mTGP gene list identified additional critical hub genes, like STAT3, PTEN, CDKN1B (Cyclin Dependent Kinase Inhibitor 1B), and FOS, indicating roles of these proteins in systemic regulations (Supplementary Fig. s5). Overall, the miRNA targets exhibited extensive interactions with both PCa specific genes and generalised oncogenes, underscoring their significant role in the development and progression of the disease through target gene regulation. While these findings support a strong functional link between the selected miRNAs and key regulatory pathways, it is important to note that our analysis was conducted at the individual miRNA level and does not directly account for the miRNA ratios used in the machine learning model. Additionally, further functional validation is required to elucidate the role of these genes in PCa pathogenesis.
Discussion and conclusion
PCa diagnosis remains challenging, particularly in differentiating it from BPH, as both conditions share overlapping clinical features. The reliance on PSA tests has led to high false-positive rates, resulting in unnecessary biopsies and significant patient anxiety35,36. These limitations highlight the urgent need for reliable, non-invasive biomarkers capable of accurately distinguishing between PCa and BPH.
Integration of miRNA profiling and machine learning
miRNAs are promising candidates for non-invasive diagnostics due to their stability in circulation and ability to reflect tumor biology. Most studies on miRNA biomarkers for PCa have primarily examined serum and plasma37, with limited available research on whole blood. However, promising results have been reported for other cancers such as breast, pancreatic, and lung cancers using whole blood-based miRNA profiling. Our study utilized miRNAs previously identified in studies and evaluated their diagnostic potential in whole blood. Whole blood offers key advantages, including higher miRNA yield and robust systemic representation of disease states, making it a valuable biofluid for biomarker discovery. The complexity of whole blood, which contains miRNAs from multiple cellular sources, could possibly introduce noise. To improve standardization and reproducibility, future studies should systematically compare miRNA expression across different biofluids, ensuring consistency in diagnostic applications. However, the ensemble-based random forest method used in the study mitigates this challenge by handling non-linear relationships and reducing sensitivity to noise38. Further, the fivefold validation proves generalizability of the proposed model, supporting that the model is able to handle unseen data and avoid overfitting. This study leverages a novel combination of miRNA profiling and ML to enhance diagnostic precision for PCa. While individual miRNAs, such as miR-21-5p, miR-141-3p, and miR-221-3p, have been implicated in PCa progression in previous studies39,40,41, the current work innovates by exploring the application of machine learning tools and miRNA’s expression profile showing that use of ML demonstrated superior discriminatory power in distinguishing PCa from BPH, capturing synergistic effects overlooked in standalone miRNA analyses and linear models respectively.
The random forest classifier was chosen in this study due to its ability to capture non-linear relationships and complex feature interactions, leading to an AUC-ROC score of 0.78. The results demonstrated that our machine learning model outperformed PSA, which suffers from high false-positive rates, achieving a higher AUC-ROC score and offering greater net benefit across a range of threshold probabilities in DCA. Unlike models reliant on fixed Ct-value thresholds, our ML approach dynamically adjusts to data variability, improving sensitivity and specificity. These findings suggest that miRNA-based diagnostics, when integrated with ML approaches, could provide a more accurate and clinically relevant tool for PCa detection and risk stratification, reducing unnecessary biopsies while maintaining high sensitivity.
Biological interpretation of miRNA findings
To address concerns regarding the “black box” nature of ML models, we incorporated feature importance rankings and bioinformatics analyses to validate the biological relevance of the key features identified by the model. The miRNA ratios miR-141-3p/miR-221-3p and miR-21-5p/miR-141-3p were confirmed as critical features for distinguishing PCa from BPH.
KEGG pathway enrichment analysis linked miR-21-5p to cancer-related pathways, including PD-L1/PD-1 checkpoint regulation, prolactin signaling, HIF-1, and NF-κB signaling, all of which play crucial roles in immune evasion, angiogenesis, and inflammation. In contrast, miR-141-3p and miR-221-3p were associated with androgen receptor (AR) signaling and endocrine resistance, which are critical pathways in hormone-sensitive and castration-resistant PCa. These findings suggest a potential regulatory role of these miRNAs in PCa progression, but further functional validation is required to confirm their direct involvement in tumor development and progression. Interestingly, target gene analysis revealed both oncogenes and tumor suppressors within the hub gene network, including: EPHA2, CBX8, STAT3, SMAD2 (context-dependent), TNFAIP1 as oncogenes and RASA1, RHOB, CDKN1B, ARID1A, OGT, CBX4, PTEN, FOS (context-dependent) as tumor suppressors. To better understand how these hub genes influence tumor progression and response to therapy, future research should focus on longitudinal expression studies and functional assays. The coexistence of both oncogenes and tumor suppressors in the hub gene network may initially appear counterintuitive. However, this reflects the complex regulatory interactions within cancer biology, where genes can have dual roles depending on cellular context, mutation status, and signaling interactions. These findings also emphasize the complex regulatory landscape of miRNA expression in PCa and suggest that miRNA profiling in exosomal fractions or immune cell subsets may provide deeper insights.
Limitations and future directions
While our model demonstrated the ability to generalize, an essential requirement for real-world clinical applications, several limitations must be acknowledged. The study’s findings are based on a limited cohort predominantly from a single population, necessitating further validation across diverse genetic, environmental, and clinical settings. In the future we are focused on large-scale, multi-center study to validate the model across various populations. Other models, such as XGBoost, support vector machines (SVM), and deep learning, could be explored for scalability and improved predictive power in larger datasets. However, such deep learning methods typically require larger training datasets and extensive computational resources, which were beyond the scope of this study.
One of the major challenges in translating miRNA-based diagnostics into clinical practice is the lack of standardized protocols. Variations in sample processing methods, RT-PCR platforms, cut-off values, and reference genes can introduce inconsistencies that hinder cross-study validation. Standardization efforts, including unified Ct value normalization methods and consensus guidelines for miRNA biomarker validation, are crucial to improving reproducibility and clinical utility. Future work should focus on establishing standardized protocols for reproducibility across different platforms.
To further strengthen the external validation of our findings, future studies should consider leveraging publicly available datasets. These datasets provide valuable large-scale transcriptomic data across diverse patient populations and could help assess the generalizability of our model. However, integrating such datasets presents challenges, as they often involve heterogeneous sample types (e.g., plasma, serum, urine) and different profiling platforms (e.g., RNA sequencing, microarrays, RT-PCR), leading to technical variability. Addressing these discrepancies would require robust normalization strategies and cross-platform data harmonization to ensure comparability with the current model. A research effort focused on developing computational approaches for cross-platform normalization strategies could be highly valuable.
The integration of miRNA-based ML models with multiparametric MRI (mpMRI) is also a promising avenue for enhancing PCa diagnosis and risk stratification. While mpMRI is widely used to assess prostate lesions and guide biopsies, its accuracy can be limited by inter-reader variability and false-positive findings. Combining molecular biomarkers, such as miRNA signatures, with radiological features (e.g., lesion morphology, diffusion-weighted imaging parameters) could improve diagnostic precision.
In Conclusion, The integration of miRNA profiling with ML offers a promising approach for improving PCa diagnostics. By leveraging miRNA expression ratios and ensemble-based models, this study demonstrated enhanced diagnostic accuracy, surpassing traditional PSA-based approaches. The biological validation of key miRNA biomarkers supports their clinical potential, while model validation underscores its reliability. Future research focused on large-scale validation, standardization, and multimodal integration will be crucial in advancing this approach toward clinical implementation.
Data availability
The data generated and analyzed in this study are available from the corresponding author upon reasonable request. To ensure privacy and comply with ethical guidelines, access to patient related data is restricted. However, anonymized datasets and supplementary materials used in this study will be provided on request for research and verification purposes, subject to institutional approval and appropriate data-sharing agreements.
References
Condrat, C. E. et al. miRNAs as biomarkers in disease: Latest findings regarding their role in diagnosis and prognosis. Cells 9, 276 (2020).
Sempere, L. F., Azmi, A. S. & Moore, A. microRNA-based diagnostic and therapeutic applications in cancer medicine. WIREs RNA 12, e1662 (2021).
Khamina, K. et al. A microRNA next-generation-sequencing discovery assay (miND) for genome-scale analysis and absolute quantitation of circulating microRNA biomarkers. Int. J. Mol. Sci. 23, 1226 (2022).
Constâncio, V., Tavares, N. T., Henrique, R., Jerónimo, C. & Lobo, J. MiRNA biomarkers in cancers of the male reproductive system: Are we approaching clinical application?. Andrology 11, 651–667 (2023).
Muñoz, J. P., Pérez-Moreno, P., Pérez, Y. & Calaf, G. M. The role of MicroRNAs in breast cancer and the challenges of their clinical application. Diagnostics 13, 3072 (2023).
Pritchard, C. C., Cheng, H. H. & Tewari, M. MicroRNA profiling: Approaches and considerations. Nat. Rev. Genet. 13, 358–369 (2012).
Seputra, K. P., Purnomo, B. B., Susianti, H., Kalim, H. & Purnomo, A. F. miRNA-21 serum evaluation in BPH, hormone sensitive prostate cancer, and castrate resistant prostate cancer: Attempt for diagnostic biomarker evaluation. Acta Inform. Med. 29, 266 (2021).
Li, C. et al. MiRNA-199a-3p in plasma as a potential diagnostic biomarker for gastric cancer. Ann. Surg. Oncol. 20, 397–405 (2013).
Chen, M. & Zhao, H. Next-generation sequencing in liquid biopsy: Cancer screening and early detection. Hum. Genom. 13, 34 (2019).
McDonald, J. S., Milosevic, D., Reddi, H. V., Grebe, S. K. & Algeciras-Schimnich, A. Analysis of circulating microRNA: Preanalytical and analytical challenges. Clin. Chem. 57, 833–840 (2011).
Pathak, A. K. et al. Development of a robust and generalizable algorithm “gQuant” for accurate normalizer gene selection in qRT-PCR analysis. Sci. Rep. 14, 18774 (2024).
Greener, J. G., Kandathil, S. M., Moffat, L. & Jones, D. T. A guide to machine learning for biologists. Nat. Rev. Mol. Cell Biol. 23, 40–55 (2022).
Jiang, T., Gradus, J. L. & Rosellini, A. J. Supervised machine learning: A brief primer. Behav. Ther. 51, 675–687 (2020).
Rigatti, S. J. Random forest. J. Insur. Med. N. Y. 47, 31–39 (2017).
Reel, P. S., Reel, S., Pearson, E., Trucco, E. & Jefferson, E. Using machine learning approaches for multi-omics data analysis: A review. Biotechnol. Adv. 49, 107739 (2021).
Bostanci, E. et al. Machine learning analysis of RNA-seq data for diagnostic and prognostic prediction of colon cancer. Sensors 23, 3080 (2023).
Sharova, E. et al. A circulating miRNA assay as a first-line test for prostate cancer screening. Br. J. Cancer 114, 1362–1366 (2016).
Pawelka, D. et al. Machine-learning-based analysis identifies miRNA expression profile for diagnosis and prediction of colorectal cancer: A preliminary study. Cancer Genom. Proteom. 19, 503–511 (2022).
Balk, S. P., Ko, Y.-J. & Bubley, G. J. Biology of prostate-specific antigen. J. Clin. Oncol. 21, 383–391 (2003).
McNally, C. J., Ruddock, M. W., Moore, T. & McKenna, D. J. Biomarkers that differentiate benign prostatic hyperplasia from prostate cancer: A literature review. Cancer Manag. Res. 12, 5225–5241 (2020).
Nickel, J. C. Inflammation and benign prostatic hyperplasia. Urol. Clin. N. Am. 35, 109–115 (2008).
Munteanu, V. C. et al. MiRNA-based inspired approach in diagnosis of prostate cancer. Medicina (Mex.) 56, 94 (2020).
Fabris, L. et al. The potential of microRNAs as prostate cancer biomarkers. Eur. Urol. 70, 312–322 (2016).
Jain, G. et al. Urinary extracellular vesicles miRNA—A new era of prostate cancer biomarkers. Front. Genet. 14, 1065757 (2023).
Yaman Agaoglu, F. et al. Investigation of miR-21, miR-141, and miR-221 in blood circulation of patients with prostate cancer. Tumour Biol. J. Int. Soc. Oncodev. Biol. Med. 32, 583–588 (2011).
Joković, S. M. et al. MiR-375 and miR-21 as potential biomarkers of prostate cancer: Comparison of matching samples of plasma and exosomes. Genes 13, 2320 (2022).
Porzycki, P., Ciszkowicz, E., Semik, M. & Tyrka, M. Combination of three miRNA (miR-141, miR-21, and miR-375) as potential diagnostic tool for prostate cancer recognition. Int. Urol. Nephrol. 50, 1619–1626 (2018).
Konoshenko, M. Y. et al. The panel of 12 cell-free microRNAs as potential biomarkers in prostate neoplasms. Diagnostics 10, 38 (2020).
Ciszkowicz, E., Porzycki, P., Semik, M., Kaznowska, E. & Tyrka, M. MiR-93/miR-375: Diagnostic potential, aggressiveness correlation and common target genes in prostate cancer. Int. J. Mol. Sci. 21, 5667 (2020).
Wen, W. et al. Identification of two microRNA signatures in whole blood as novel biomarkers for diagnosis of nasopharyngeal carcinoma. J. Transl. Med. 17, 186 (2019).
Huang, H.-Y. et al. miRTarBase 2020: Updates to the experimentally validated microRNA-target interaction database. Nucleic Acids Res. 48, D148–D154 (2020).
Griffiths-Jones, S., Saini, H. K., van Dongen, S. & Enright, A. J. miRBase: Tools for microRNA genomics. Nucleic Acids Res. 36, D154-158 (2008).
Ge, S. X., Jung, D. & Yao, R. ShinyGO: A graphical gene-set enrichment tool for animals and plants. Bioinformatics 36, 2628–2629 (2020).
Safran, M. et al. GeneCards Version 3: The human gene integrator. Database J. Biol. Databases Curation 2010, 20 (2010).
Lumbreras, B. et al. Variables associated with false-positive PSA results: A cohort study with real-world data. Cancers 15, 261 (2022).
Kumar, L., Kumar, S. & Agarwal, S. To find the role of DNA damage marker 8-hydroxy 2-deoxy guanosine in patients of prostate cancer, benign prostatic hyperplasia and its association to other prognostic factors of prostate cancer. J. Integr. Nephrol. Androl. 4, 55–55 (2017).
Endzeliņš, E. et al. Diagnostic, prognostic and predictive value of cell-free miRNAs in prostate cancer: A systematic review. Mol. Cancer 15, 41 (2016).
Breiman, L. Random forests. Mach. Learn. 45, 5–32 (2001).
Huang, S. et al. Downregulation of miR-141-3p promotes bone metastasis via activating NF-κB signaling in prostate cancer. J. Exp. Clin. Cancer Res. 36, 173 (2017).
Li, D. et al. Elevated miR-221-3p inhibits epithelial–mesenchymal transition and biochemical recurrence of prostate cancer via targeting KPNA2: An evidence-based and knowledge-guided strategy. BMC Cancer 25, 34 (2025).
Ribas, J. et al. miR-21: An androgen receptor-regulated microRNA that promotes hormone-dependent and hormone-independent prostate cancer growth. Cancer Res. 69, 7165–7169 (2009).
Funding
This research was supported by the grant from BIRAC-India, under AGC-JanCare scheme to MIRNOW. The Institute of Eminence (IoE) Scheme at BHU, Fellowship/Grant support to Dr. Garima Jain, Banaras Hindu University. Non-Net fellowship support to AP and SK, and MIRNOW employee support to SS. We would also like to acknowledge the Centre for Genetic Disorders, BHU for providing laboratory resources and facilities as well as the Department of Urology, BHU for facilitating sample collection. We extend our sincere gratitude to all collaborators and patients involved in this study.
Author information
Authors and Affiliations
Contributions
Research conceptualization and funding acquisition: Jain G; Supervision and validation: Jain G; Sample collection and/or data curation: Singh S and Kural S; miRNAs analysis workflow: Singh S; Formal data analysis and interpretation: Singh S, Jain G, Pathak AK, Gupta M and Kural S; Visualization: Singh S, Pathak AK and Kural S; Writing-original draft: Singh S, Pathak AK and Jain G; Resources: Yadav M and Bhardwaj MG; Clinical advisory: Trivedi S, Kumar L and Yadav M; Model optimization assistance: Pathak AK and Gupta M Writing-review and editing: Singh S, Pathak AK, Kural S, Jain G, Trivedi S and Das P.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Singh, S., Pathak, A.K., Kural, S. et al. Integrating miRNA profiling and machine learning for improved prostate cancer diagnosis. Sci Rep 15, 30477 (2025). https://doi.org/10.1038/s41598-025-99754-7
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-025-99754-7