Introduction

Bladder cancer (BCa) is the 10th most commonly diagnosed cancer worldwide, with approximately 573,000 new cases and 213,000 deaths in 20201. Treatment options primarily depend on the stage of the disease at diagnosis. In addition, approximately 25% of patients are initially diagnosed with muscle-invasive bladder cancer (MIBC), while the remaining cases are classified as non-muscle-invasive bladder cancer (NMIBC)2. Radical cystectomy (RC) is the standard treatment for MIBC and certain high-risk or very high-risk cases of NMIBC3. Research has shown that 9–49% of BCa cases were incorrectly staged, leading to some patients receiving inappropriate treatment4. Importantly, accurately identifying muscle-invasive states is essential for the effective treatment and management of patients.

Currently, cystoscopy, computed tomography, and magnetic resonance imaging (MRI) play a significant role in the preoperative diagnosis of bladder cancer (BCa). A recent study indicated that it is possible to refer potential MIBC patients from cystoscopy to multi-parametric MRI (mpMRI) for staging instead of performing transurethral resection of bladder tumor (TURBT)5. MRI offers superior anatomical visualization with high spatial and contrast resolution, demonstrating its vast potential for determining muscle-invasive states. Recently, various diagnostic methods based on mpMRI have emerged. In 2018, Panebianco et al. proposed a vesical imaging-reporting and data system (VI-RADS), which utilizes the size, location, number, and morphology of tumors on mpMRI to predict the risk of muscle invasion6.

Artificial intelligence (AI)-based research methods are becoming increasingly prevalent in the medical field. Among these methods, machine learning is being leveraged to enhance the diagnosis and treatment of various diseases. Commonly applied machine learning methods include deep learning (DL), support vector machines, random forest, kernel methods, etc7. DL offers significant advantages in diagnosing various diseases, such as liver fibrosis, pancreatic cancer, pulmonary nodules, prostate cancer, etc8,9,10. Moreover, extensive research has focused on predicting drug response and assisting in radiotherapy11,12,13. In urinary, the preoperative prediction of MIBC has attracted considerable interest. The potential of DL to enhance diagnosis using cystoscopic and CT images has been explored14,15. Li et al. established a DL model based on MRI and compared it to the VI-RADS score, however, it lacked validation from a large sample and multiple centers16. Thus, this study aims to develop the DL model based on MRI to predict MIBC and conduct multicenter validation.

Materials and methods

Patient population

The study was approved by the Ethics Committees (2021-SR-409) and registered with ClinicalTrials.gov (NCT05096533). As shown in Fig. 1, patients who underwent preoperative MRI and surgery were enrolled between February 2012 and December 2023. Additionally, all written informed consent to participate in the study was obtained from the patient. The inclusion criteria and exclusion criteria were shown in the supplementary material.

Fig. 1
Fig. 1
Full size image

The workflow of population screening process and model design process in multi-center research.

Lesion assessment and preprocessing

All pathological specimens were obtained from RC or TURBT. Two senior pathologists, who were unaware of the clinical data, reviewed all specimens. All specimens from NMIBC patients were confirmed to contain bladder muscle. MRI examinations of the bladder were conducted with patients in the supine position. Before the MRI, patients were instructed to void 2 h before the examination and to limit fluid intake or further urination until the completion of the MRI. Detailed parameters of MRI were provided in Table S1.

To construct the models, radiologists selected slides containing tumors for evaluation. For patients with NMIBC, slides containing tumors were all selected for the training set as a label for non-muscle-invasive. For patients with MIBC, all slides containing tumors were independently defined as muscle-infiltrating or non-muscle-infiltrating by two radiologists. A third senior radiologist with 20 years of experience, reviewed any disputed slides to make a final decision.

Besides, the radiologists evaluated the VI-RADS for all cases. All evaluation procedures were independently performed by two experienced radiologists. A third radiologist reviewed any controversial results before the final conclusions were made. All radiologists were blind to the patient’s pathology results. Furthermore, the diagnostic performance of the VI-RADS scores for MIBC was assessed using the AUC with cut-off scores of greater than 2 or 3. Then the images were resized and histograms equalized and normalized.

MIBC prediction model based on 2D- inception V3

As shown in Fig. 1b, automatic segmentation was performed by the Cascade Path Augmentation Unet (CPA-Unet) developed by our center17. The radiologists subsequently re-examined the segmented images. Based on the segmented T2WI images, the DL model was comprised of Inception V3, reconstruction blocks, and classification blocks for the classification of MIBC and NMIBC through an end-to-end approach. Inception V3 was utilized as the main model to extract image features from the T2WI slices18. Additionally, a reconstruction block was incorporated within a multitask framework to support the primary classification task (Fig. 1c). Detailed information regarding model construction is described in Supplementary Methods.

MRI images of each patient were input into the model slide by slide. The cut-off value was 0.5 (MIBC: probability value ≥ 0.5; NMIBC: probability value < 0.5). For a patient, if there were two or more slides’ probability values exceeding 0.5, it was MIBC. Otherwise, it was classified as NMIBC.

Statistical analysis

Statistical analysis was conducted using IBM SPSS statistics (v26.0). Evaluation metrics included sensitivity, specificity, accuracy, and the area under the curve (AUC) of the receiver operating characteristic (ROC). P < 0.05 was deemed statistically significant, with a 95% confidence level. The Chi-square and ANOVA tests were used to analyze the differences between groups. The interobserver agreement between the two radiologists for classifying the MRI slides and evaluating the VI-RADS was determined using kappa (k) scores. The sample size was calculated by PASS software (tests for one-sample sensitivity and specificity, version 15, NCSS, LLC). The experiments are performed on an NVIDIA 2080Ti with 12 GB of memory, and all networks are implemented in TensorFlow with Python version 3.7.

Results

The characteristics of the patients

The clinicopathologic characteristics across the groups are summarized in Table 1. A total of 559 BCa patients were included, with 521 patients from our center and 38 patients from external centers. There were no statistically significant differences among groups regarding clinical characteristics. Based on the sensitivity and specificity of the DL model in the validation set, the required sample size calculated by PASS software was at least 135. Finally, a total of 164 patients in the internal test set and 38 patients in the external test set were included for testing.

Table 1 Clinical characteristics of patients across the groups.

The inter-reader agreement for slides evaluation and VI-RADS

As shown in Table S2, an excellent diagnostic agreement on slides evaluation between the two readers was observed (k = 0.869, P < 0.001). Table S3 demonstrated robust diagnostic agreement between the two readers in VI-RADS scores in the validation test set (k = 0.927, P < 0.001), the internal test set (k = 0.922, P < 0.001), and the external test set (k = 0.831, P < 0.001).

The performance metrics of the DL model and VI-RADS

The performance of the DL model was shown in Table 2 and Fig. 2a. The accuracy, sensitivity (SN), specificity (SP), positive predictive value (PPV), and negative predictive value (NPV) for MIBC were 92.4% (61/66), 94.7% (18/19), 91.5% (43/47), 81.8% (18/22) and 97.7% (43/44) in the validation set. Meanwhile, the accuracy, SN, SP, PPV, and NPV were 92.1% (151/164), 86.8% (46/53), 94.6% (105/111), 88.5% (46/52) and 93.8% (105/112) in the internal test set. In the external test set, these values were 81.6% (31/38), 57.1% (4/7), 87.1% (27/31), 50.0% (4/8) and 90.0% (27/30).

Table 2 The performance of DL model and VI-RADS in predicting MIBC.
Fig. 2
Fig. 2
Full size image

The DL model results and typital cases. (a) The predictive performance of deep learning model across various sets (validation set, internal test set and external test set). (b) The predictive performance of deep learning model in different VI-RADS. (cf) The predictive performance of deep learning model, VI-RADS and combined model across various sets. (g) The performance metrics of the DL model in different bladder anatomical locations. (h) The process of determining the muscle infiltration status of patients based on deep learning model.

To further explore the differences between the DL models and VI-RADS, we compared the performance metrics of the mode with VI-RADS (cut-off VI-RADS > 2 or 3). The performance metrics are summarized in Table 2 and Fig. 2c–f. There were no statistically significant differences in the AUCs between the DL model and VI-RADS in the validation set (0.931 vs. 0.976, P = 0.172) and the internal test set (0.907 vs. 0.930, P = 0.427). Besides, the combined model based on the DL model and VI-RADS score demonstrated better performance than a single model in validation and internal test sets (P < 0.05). However, in the external test set, the combined model did not show an improvement in prediction performance for VI-RADS (P = 0.224).

The performance metrics of the DL model among VI-RADS

As shown in Table 3 and Fig. 2b, the accuracy of the DL model was 100% (21/21) in VI-RADS 1. Meanwhile, the accuracy, SN, SP, PPV and NPV were 93.5% (115/123), 100% (1/1), 93.4% (114/122), 11.1% (1/9) and 100% (114/114) in VI-RADS 2; 80.0% (48/60), 66.7% (14/21), 87.2% (34/39), 73.7% (14/19) and 82.9% (34/41) in VI-RADS 3; 90.3% (28/31), 91.7% (22/24), 85.7% (6/7), 95.7% (22/23), 75.0% (6/8) in VI-RADS 4. The accuracy, SN, and PPV were 93.9% (31/33), 93.9% (31/33), and 100% (31/31) in VI-RADS 5.

Table 3 The performance of DL model in predicting MIBC at different VI-RADS.

The performance metrics of the DL model in different bladder anatomical locations

As shown in Table S4 and Fig. 2g, the accuracy, SN, SP were 63.2 (12/19), 71.4 (5/7), 58.3 (7/12) in the bladder neck and trigone; 92.0 (138/150), 85.4 (35/41), 94.5 (103/109) in left and right wall; 88.2 (15/17), 81.8 (9/11), 100.0 (6/6) in dome; 94.6 (35/37)80.0 (4/5), 96.9 (31/32) in posterior wall; 93.3 (14/15), 100.0 (6/6), 88.9 (8/9) in anterior wall and 95.2 (20/21), 100.0 (4/4), 94.1 (16/17) in multiple bladder walls. Additionally, in patients near the ureteral orifice, the accuracy, SN, SP, PPV, and NPV of the DL model in predicting MIBC were 71.4 (35/49), 75.0 (15/20), 69.0 (20/29), 62.5 (15/24), 80.0 (20/25).

Analysis of typical cases

The process of determining the muscle infiltration status of patients based on a DL model is illustrated in Fig. 2h. The six MRI slides of the patient containing tumors were input into the model. After classification recognition, the predicted probability values for each slide were obtained. All six slides of this patient were predicted as MIBC (probability values > 0.5), consistent with the final pathological results.

Additionally, misclassified cases by DL model were further analyzed, as shown in Table S5 and Fig. S1. The results indicated that the most common characteristics in misclassified cases were tumors located near the ureteral orifice (14/25) and tumors located on the dome or bladder neck (8/25).

Discussion

Currently, numerous studies focus on muscle-invasive states using imaging approaches, including VI-RADS score based on MRI, predictive models based on radiomic features, and DL models based on CT6,15,19. Considering the excellent anatomical visualization provided by MRI and the advanced artificial intelligence algorithms, we developed MRI-based DL models for MIBC diagnosis and conducted a multi-center clinical study.

Recently, the medical applications of radiomics based on machine learning have advanced significantly. By extracting radiomic features of the region of interest (ROI), a series of clinical prediction models with AUCs ranging from 0.798 to 0.986 have been developed19. However, most predictive models were created after manually outlining the ROI, extracting and analyzing the features, followed by the extraction and analysis of features. This process is time-consuming and considerably diminishes diagnostic utility11. The DL algorithm can construct models by automatically segmenting the ROI and extracting optimal features intelligently and objectively11. Several studies15,20 explored the value of DL models in BCa with the AUCs ranging from 0.861 to 0.998, but all were retrospective CT-based studies. Furthermore, many studies suggested that MRI has greater potential in diagnosing the muscle-invasive states of BCa6,21. Building upon T2WI of MRI, we developed a DL model and executed a multi-center clinical study, achieving impressive results with an accuracy rate exceeding 90% at our institution.

VI-RADS proposes standardized reporting criteria based on mp-MRI, which has been validated and improved at many centers22. However, the score only indicates the probability of MIBC, and 4.4–25.7% patients classified as VI-RADS 3 were unable to differentiate muscle-invasive states23,24,25,26. Additional analysis was conducted to compare the difference between the DL model and the VI-RADS. Remarkably, the performance of our DL model was found to be comparable to that of VI-RADS in both the validation and internal test sets (P > 0.05). Interestingly, the combined model of the DL model and VI-RADS score demonstrated better performance than a single model in the validation and internal test sets. The preprocessing of multicenter medical imaging data remains an open challenge27. In our research, the performance of the DL model in the external test set was not ideal, showing results similar to those of other studies15,28.

We further explored the performance metrics of the DL model under various VI-RADS categories. Promisingly, the models achieved satisfactory performance metrics (accuracy > 90%) in VI-RADS 2, 4, and 5. In VI-RADS 3, the muscle-invasive states of BCa are ambiguous, posing a challenge for each center. Research has suggested that the characteristics of tumor morphology can serve as an effective tool for the preoperative detection of muscle invasion in VI–RADS 327. However, the evaluation process was inherently subjective. For the first time, we elaborated on the performance metrics of the AI model in VI-RADS 3. Notably, our research achieved an accuracy of 80% with the DL model, underscoring the potential of AI as a complementary tool in the diagnostic workflow for patients with VI-RADS 3.

Several critical factors still affect the accuracy of the model. Firstly, non-muscle-invasive tumors near the ureteric orifice were mostly misclassified as muscle-invasive tumors due to their encroachment into the ureteral orifice. Secondly, due to the limitations of 2D transverse images, non-muscle-invasive tumors located at the bladder neck and parietal wall were more likely to be misclassified as muscle-invasive tumors. Ongoing research aims to develop a 3D deep learning model, which could reduce the impact of numerous variables, such as tumor location and human factors. Finally, image quality also affects performance metrics.

However, several limitations were identified in this research. The “black box” nature of DL dictated that its decision-making process is unexplainable. Therefore, illustrating DL principles through visualization techniques is a promising direction for future research. Secondly, the external centers incorporated a limited number of patients and exhibited disparities in the distribution of clinicopathologic characteristics compared to our center, introducing bias in the performance evaluation of the DL model at the external set. Additionally, the MRI imaging quality at external centers deviated from ours, and the DL model was trained exclusively on cases sourced from our center, consequently influencing the performance metrics in the external test set. To tackle these challenges, future research endeavors should prioritize enhancing the diversity of the training dataset, potentially through the application of data augmentation techniques and the adoption of multi-center training strategies. Moreover, establishing a dedicated data pre-processing pipeline to address the discrepancies in multicenter data is essential27.

While most previous imaging-based studies were retrospective and single-center, we developed DL models based on MRI to diagnose MIBC and carried out multi-center clinical research. Our findings indicate that the DL model can accurately determine MIBC and provide valuable information for clinical decision-making.

Conclusion

The DL model based on T2WI and Inception V3 was constructed for predicting MIBC. The research results indicated that the DL model had certain advantages in predicting MIBC and providing additional diagnostic value in the context of VI-RADS 3.