Introduction

Prostate cancer has become the second most common malignancy and the fifth leading cause of cancer-related death among men globally, with nearly 1.4 million new cases and 375,000 deaths annually [1]. In Western countries, the incidence of prostate cancer is higher due to early screening using prostate-specific antigen (PSA), and prostate cancer ranks first among urological malignancies in China [2, 3].

Current methods for early screening and diagnosis of prostate cancer include PSA testing, digital rectal examination, transrectal ultrasound, magnetic resonance imaging (MRI), and biopsy. While PSA is a key biomarker, its limited sensitivity/specificity, diagnostic gray zone (4–10 ng/ml), and susceptibility to confounding factors challenge its reliability [4, 5]. Digital rectal examination relies heavily on subjective tactile interpretation, increasing missed diagnoses, whereas transrectal ultrasound exhibits lower sensitivity than MRI for early-stage detection. MRI, despite radiologists’ experience-dependent accuracy, has emerged as the preferred imaging modality due to its superior diagnostic efficacy for cancer detection and grading [6].

The integration of artificial intelligence (AI) in medicine has significantly advanced prostate cancer diagnostics by enabling rapid data synthesis and automated analysis of imaging features. In recent years, many studies have used artificial neural network-based image texture analysis techniques to extract image texture features from prostate lesions, revealing the heterogeneity of prostate cancer tissue through the distribution and changes of these features [7,8,9]. Several studies have shown that combining MRI with machine learning significantly improves the diagnostic performance for prostate cancer [10, 11]. For example, Kwon et al. [12] used a random forest algorithm to construct a three-class model for prostate cancer lesion identification based on T2-weighted imaging (T2WI), diffusion-weighted imaging (DWI), apparent diffusion coefficient (ADC), and dynamic contrast-enhanced (DCE) sequences from 204 patients with 330 suspicious lesions, achieving an area under the curve (AUC) of 0.82. Zhang et al. [13] integrated support vector machine (SVM) with KNN and Zernik feature selection methods to discriminate prostate cancer lesions, training their model with MR images from 271 lesions and achieving an accuracy rate of over 80%. Ellmann et al. [14] used the XGBoost algorithm to analyze 195 prostate cancer lesions from 124 patients, obtaining a high accuracy rate (AUC = 0.913). These studies suggest that machine learning-based MRI imaging has significant diagnostic value and clinical application potential for prostate cancer diagnosis.

However, there is currently a lack of systematic evaluations of the diagnostic value of machine learning-based MRI imaging for prostate cancer. Therefore, we conducted this systematic review and meta-analysis to include studies assessing the diagnostic accuracy of machine learning-based MRI in detecting prostate cancer and csPCa (defined as Gleason score ≥7 vs. Gleason score <7), aiming to provide evidence-based insights for clinical practice.

Materials and methods

Search strategy

Following the PRISMA 2020 statement [15], we systematically searched four electronic databases (PubMed, Web of Science, Cochrane Library, and Embase) from inception to February 15, 2025. The search strategy in English databases was as follows: “Magnetic Resonance Imaging” OR “MRI” AND “Prostate” OR “Prostate Cancer” AND “Machine Learning” OR “Radiomics.” Keywords were connected using “OR” or “AND.” Additionally, target literature was obtained by reviewing the references of included studies.

Eligibility

Inclusion Criteria: (1) Populations (P): participants with prostate cancer; (2) Interventions (I): intervention involving MRI examination of the prostate in all participants; (3) Controls (C): healthy individuals or those without prostate cancer or other malignancies; (4) Outcomes (O): diagnostic results (benign/malignant and csPCa) presented in a 2 × 2 test performance table, with csPCa defined as Gleason score ≥7 [16]; (5) Study design (S): machine learning studies evaluating the diagnostic value of MRI for primary prostate cancer, published in peer-reviewed journals in Chinese or English.

Exclusion Criteria: (1) Non-human studies; (2) Conference articles, case reports, systematic reviews, etc.; (3) Insufficient outcome information for data analysis; (4) Duplicate reports; (5) Studies without full-text availability.

Literature screening and data extraction

Two researchers independently screened the literature according to the inclusion and exclusion criteria. Initial screening was performed by reading titles and abstracts, followed by full-text review for potentially eligible studies. Disagreements were resolved by consulting a third researcher. After literature screening, data extraction was performed independently by two researchers using a standardized data extraction form. Extracted information included study characteristics, demographic features of participants, true positives (TP), false positives (FP), true negatives (TN), false negatives (FN), and study duration.

Quality assessment

The quality of diagnostic accuracy studies was assessed using the QUADAS-2 tool developed by Whiting et al. at the University of York [17]. This tool evaluates studies in four domains: patient selection, index test, reference standard, and flow and timing. Risk of bias was categorized as “high risk,” “low risk,” or “unclear risk.”

Statistical analysis

Stata 16.0 software was used for statistical analysis. Pooled sensitivity, specificity, positive likelihood ratio (PLR), negative likelihood ratio (NLR), and their 95% confidence intervals (CIs) were calculated. Summary receiver operating characteristic (SROC) curves were plotted, and the AUC was calculated. Heterogeneity was assessed using Q-tests or I² statistics; I² < 50% or P > 0.1 indicated homogeneity, while I² > 50% or P ≤ 0.1 indicated significant heterogeneity. Publication bias was assessed using funnel plots, with P < 0.05 indicating statistical significance.

Results

Basic characteristics of included studies

A total of 1392 studies were identified through systematic database searches (Fig. 1). After excluding 928 duplicates and 247 irrelevant studies, 217 studies were reviewed in full. Ultimately, 12 studies were included in the meta-analysis [18,19,20,21,22,23,24,25,26,27,28,29].

Fig. 1
Fig. 1The alternative text for this image may have been generated using AI.
Full size image

Literature selection flowchart.

Table 1 presents the basic characteristics of the included studies. The 12 studies were published between 2019 and 2023. Five studies were from China, two from the United States, and the remaining studies were from the United Kingdom (n = 1), Canada (n = 1), Italy (n = 1), Germany (n = 1), and New Zealand (n = 1). Of the 12 studies, 11 were retrospective and one was prospective in design; 11 were single-center studies and one was a multicenter study. The data collection period ranged from 2008 to 2019. A total of 3474 patients were included in the meta-analysis, with ages ranging from 30 to 90 years. All diagnoses of prostate cancer were confirmed by histopathological examination. Among the 12 studies, five assessed the diagnostic value of machine learning-based MRI imaging for differentiating benign and malignant prostate cancer, while the remaining seven evaluated its diagnostic value for csPCa. Eleven studies used T2WI sequences for feature extraction and lesion segmentation, and seven used DWI sequences for feature extraction. Notably, all studies utilized at least two MRI sequences, highlighting the advantage of biparametric and multiparametric approaches in feature extraction and lesion identification for prostate cancer MRI imaging.

Table 1 Basic information of included studies.

Quality assessment of included studies

The quality of the included studies was assessed using the QUADAS-2 tool. The results showed that the included studies had high quality with low potential bias (Figs. 2, 3).

Fig. 2
Fig. 2The alternative text for this image may have been generated using AI.
Full size image

Risk of bias and applicability concerns graph.

Fig. 3
Fig. 3The alternative text for this image may have been generated using AI.
Full size image

Risk of bias and applicability concerns summary.

Diagnostic value for benign/malignant prostate cancer

A total of five studies reported the diagnostic value of machine learning-based MRI imaging for benign/malignant prostate cancer, involving 1398 participants. The meta-analysis showed that the pooled sensitivity and specificity for diagnosing benign/malignant prostate cancer were 0.92 (95% CI: 0.83–0.97; Fig. 4) and 0.90 (95% CI: 0.68–0.97; Fig. 4), respectively, with a combined AUC of 0.96 (95% CI: 0.94–0.98; Fig. 5). The heterogeneity was 95.68% and 97.52%, respectively. Additionally, the pooled PLR and NLR were 9.24 (95% CI: 2.36–36.12; Supplementary Fig. 1) and 0.08 (95% CI: 0.03-0.21; Supplementary Fig. 1), respectively.

Fig. 4
Fig. 4The alternative text for this image may have been generated using AI.
Full size image

Summary of sensitivity and specificity of machine learning-based MRI in PCa diagnosis.

Fig. 5
Fig. 5The alternative text for this image may have been generated using AI.
Full size image

Summary of AUC of machine learning-based MRI in PCa diagnosis.

Diagnostic value for csPCa

A total of seven studies reported the diagnostic value of machine learning-based MRI imaging for csPCa, involving 2076 participants. The meta-analysis showed that the pooled sensitivity and specificity for diagnosing csPCa were 0.83 (95% CI: 0.77–0.87; Fig. 6) and 0.73 (95% CI: 0.65–0.81; Fig. 6), respectively, with a combined AUC of 0.86 (95% CI: 0.83–0.89; Fig. 7). The heterogeneity was 73.96% and 87.02%, respectively. Additionally, the pooled PLR and NLR were 3.12 (95% CI: 2.34–4.15; Supplementary Fig. 2) and 0.23 (95% CI: 0.18-0.30; Supplementary Fig. 2), respectively.

Fig. 6
Fig. 6The alternative text for this image may have been generated using AI.
Full size image

Summary of sensitivity and specificity of machine learning-based MRI in csPCa diagnosis.

Fig. 7
Fig. 7The alternative text for this image may have been generated using AI.
Full size image

Summary of AUC of machine learning-based MRI in csPCa diagnosis.

Publication bias

Publication bias was assessed for studies on the diagnosis of both benign/malignant prostate cancer and csPCa using machine learning-based MRI imaging. The results showed no significant publication bias among the included studies, with all P values greater than 0.05 (Supplementary Figs. 3, 4).

Discussion

In recent years, machine learning-based medical imaging analysis has shown great potential in prostate cancer diagnosis, but its clinical application still requires higher levels of evidence for validation. This study, through a systematic review and meta-analysis, comprehensively evaluated the diagnostic performance of machine learning models based on multi-modal MRI for differentiating benign and malignant prostate cancer and detecting csPCa for the first time. A total of 12 studies were included, involving 3474 patients, with five studies focusing on tumor benign/malignant diagnosis and seven studies on csPCa diagnostic accuracy. The results showed that machine learning models performed exceptionally well in prostate cancer benign/malignant diagnosis (combined AUC = 0.96), while the diagnostic efficacy for csPCa was relatively lower (combined AUC = 0.86). These findings provide important evidence for the application of machine learning in prostate cancer diagnosis and also reveal the limitations of current studies and future directions for improvement.

Our study found that machine learning-based MRI models demonstrated extremely high diagnostic performance in differentiating benign and malignant prostate cancer (sensitivity: 0.92, specificity: 0.90, AUC = 0.96), significantly outperforming the traditional PI-RADS scoring system, which has an AUC of approximately 0.80 [30]. This improvement may be attributed to the deep integration of multi-sequence MRI features (such as T2WI and DWI) by machine learning. For example, T2WI can clearly display the anatomical structure of the prostate, while DWI quantifies tissue cell density through the ADC. The combination of these two sequences can more comprehensively capture tumor heterogeneity [31, 32]. Despite the clinically meaningful AUC for csPCa diagnosis, the relatively low specificity may reflect several issues. First, tumors with Gleason scores ≥7 exhibit diverse MRI appearances, especially intermediate-grade cancers (Gleason 3 + 4), which overlap with benign hyperplasia or inflammation on imaging [33]. Moreover, only 11 studies utilized T2WI, and 7 studies employed DWI. The application of dynamic contrast-enhanced MRI (DCE) or emerging technologies, such as PSMA-PET fusion, was limited. This scarcity may restrict the model’s ability to identify microinvasive lesions. Additionally, nine studies relied on manual segmentation, which may introduce interobserver variability. Automated segmentation, used in only three studies, requires further validation. These factors could also be potential sources of the heterogeneity identified in this study.

The widespread application of artificial intelligence in clinical practice still faces the core challenge of insufficient generalizability, primarily due to the heterogeneous nature of medical data [34]. Variations in acquisition parameters, diverse imaging equipment, and the absence of standardized protocols for cross-institutional medical imaging data often lead to a decline in diagnostic performance when artificial intelligence models are validated across centers. This data shift, triggered by non-pathological features, has a particularly significant impact on radiomics-based machine learning models. A study by Bleker et al. [34] found that a radiomics-based biparametric MRI artificial intelligence model trained on single-center imaging data showed a substantial performance drop when validated on multicenter data (−28.1%, with AUC values of 0.594 vs. 0.816 for single-center validation, p = 0.002). In contrast, models trained on multicenter data did not experience significant performance degradation between multicenter and single-center validation (−12%, AUC: 0.750 vs. 0.660, p = 0.114) and even outperformed single-center models on multicenter data (AUC: 0.750 vs. 0.594, p = 0.012). Notably, several studies using both internal and external datasets for validation have demonstrated that models built on deep learning frameworks exhibit stronger data adaptability. These models maintain relatively stable diagnostic performance in both internal validation and external testing, suggesting that deep neural networks have unique advantages in autonomous feature extraction and adaptation to data distributions [35,36,37]. To optimize the generalizability of models in multicenter studies, the academic community has proposed using batch effect correction techniques to standardize cross-center data. This approach can effectively eliminate systematic biases caused by differences in scanning protocols and non-uniform feature standardization processes, thereby enhancing model robustness [38].

This study is the first to systematically analyze the diagnostic value of machine learning-based MRI for differentiating benign and malignant prostate cancer and detecting clinically significant prostate cancer (csPCa). However, several limitations exist. First, despite our systematic search strategy to identify eligible studies, the number of studies meeting the inclusion criteria was relatively low, likely due to the limited number of current studies using radiomics and machine learning for prostate cancer diagnosis. Additionally, most included studies were single-center and retrospective in design, which may introduce recall and confounding biases. This study’s findings should be interpreted with caution due to the inherent heterogeneity in machine learning methodologies across the included studies. Variations in algorithm architectures, preprocessing pipelines, and validation protocols may limit the generalizability of pooled results. Future work should prioritize harmonized reporting standards to mitigate algorithmic diversity and improve clinical translation. Moreover, due to the limited number of included studies, this meta-analysis was unable to explore subgroup analyses based on different variables.

Conclusion

In summary, this study demonstrates that artificial intelligence models built on MRI imaging data hold great potential and value for diagnosing prostate cancer. Although the included studies employed diverse methodologies, extracted MRI data from different centers, and utilized various algorithms to construct artificial intelligence models, our meta-analysis indicates that artificial models based on MRI imaging significantly outperform radiologists using the PI-RADS scoring criteria in diagnosing prostate cancer. However, the evidence presented by these studies should be interpreted with caution, as there is still a lack of multicenter evidence for artificial intelligence algorithms.