Introduction

Bladder cancer (BLCA) is the leading malignancy affecting the urinary tract and ranks as the world’s ninth most common cancer1. Urothelial cell carcinoma (UCC), which generally originates in the lower urinary tract such as the bladder and urethra but can also develop in upper urinary structures like the renal pelvis and ureters, is the predominant subtype of BLCA. Clinically, bladder cancer is classified into non-muscle-invasive bladder cancer (NMIBC) and muscle-invasive bladder cancer (MIBC) according to tumor penetration depth. NMIBC includes early-stage lesions limited to the mucosa or submucosa, such as carcinoma in situ (CIS; Tis), non-invasive papillary tumors (Ta), and tumors that extend into the lamina propria (T1)2. In comparison, MIBC, identified by detrusor muscle invasion and classified as T2-T4 disease, carries a greater likelihood of lymph node spread and distant metastasis3. This underscores the importance of early detection of metastatic MIBC to improve treatment strategies and patient outcomes.

MIBC is highly aggressive, with a strong tendency for both local and distant spread, greatly elevating mortality risk in affected individuals. Around 50% of MIBC cases show micro metastases at the time of diagnosis, primarily involving lymph nodes, lungs, liver, and bones4. Such metastatic occurrences are linked to a poor outlook, as the five-year survival rate for metastatic MIBC patients remains low, between 5% and 15%5. Additionally, 20–30% of those initially diagnosed with NMIBC experience disease progression to MIBC during the disease course6. The primary treatment for MIBC includes cisplatin-based neoadjuvant chemotherapy (NAC), followed by radical cystectomy (RC) and pelvic lymph node dissection (PLND)3. However, RC is a highly invasive surgery with notable perioperative risks, leading a significant proportion of MIBC patients to be considered ineligible and unable to undergo the procedure7,8. Furthermore, postoperative complications, especially those related to urinary diversion, can significantly diminish quality of life (QoL), making RC an unfavorable choice for some. Although recent urinary assays have enhanced early detection of bladder cancer, treatment options for MIBC remain limited, and patient prognosis often remains poor9. Trimodality therapy (TMT), which includes maximal transurethral resection of the bladder tumor (TURBT) followed by concurrent chemoradiotherapy, has emerged as a promising option for patients in T2N0M0 disease10. Previously, MIBC patients unfit for RC due to physical or psychological constraints would choose TMT, but it is generally limited to those who meet certain clinical criteria11. This highlights the urgent need to further investigate the risk factors influencing MIBC metastasis and to advance therapeutic strategies that enhance patient survival.

Recently, nomograms have gained popularity for prognosis prediction, though their sensitivity and specificity can be limited. Machine learning (ML), a branch of artificial intelligence, offers a practical alternative due to its strength in managing complex, non-linear relationships, making it especially effective for large datasets and unstructured information12. With advances in big data analytics, creating new tools to identify risk factors impacting metastasis and prognosis has become achievable. However, no ML-based model currently exists for predicting metastasis in MIBC, highlighting the need for an accurate model’s development and validation. The Surveillance, Epidemiology, and End Results (SEER) database (https://seer.cancer.gov/) provides extensive clinicopathological data and follow-up records for cancer patients, making it an invaluable asset for developing and validating ML models in healthcare. Although ML is widely applied across numerous fields, there is still a significant lack of research dedicated to forecasting metastasis and prognosis in MIBC patients. This study seeks to fill this gap by creating and validating predictive models with various ML algorithms, offering clinicians personalized tools to more precisely evaluate metastasis risk and prognostic outcomes for individuals with MIBC.

Materials and methods

Data collection and patient demographics

Clinicopathological information of MIBC patients from 2000 to 2020 was sourced from the SEER database. Additionally, retrospective data from The First Affiliated Hospital of Henan University of Science and Technology (2010–2020) was gathered through its electronic medical records system. This study adhered to the Declaration of Helsinki and received approval from Ethics Committee of The First Affiliated Hospital of Henan University of Science and Technology (protocol code: 2024-03-K0144, approved on 2024-03-20), with written informed consent obtained from all participants. Inclusion criteria specified T2-T4 MIBC patients with an initial diagnosis of bladder cancer (C67.0-C67.9) per the Third Edition of the International Classification of Diseases for Oncology (ICD-O-3) and patients over 18 years old. Exclusion criteria included cases missing follow-up information on survival and cause of death, survival less than one month, lacking histological confirmation, absence of surgical data, and incomplete TNM staging or grading. Selecting one month as the time point for landmark analysis can exclude the interference of patients who died early, and more accurately evaluate the long-term effects of the clinical factors on those patients who were able to survive until that time point. Metastasis was defined as the spread to distant sites at initial diagnosis.

We gathered patient data covering demographics (age, gender, race, marital status, household location, and income), disease attributes (pathological grade, TNM stage, tumor size, primary tumor location, pathology, and metastasis details), treatment information (surgery, lymph node surgery, lymph node biopsy, radiotherapy, chemotherapy), and follow-up outcomes (overall survival (OS) and cancer-specific survival (CSS) status, survival duration in months). Age and tumor size, both continuous variables, were transformed into categorical variables based on established clinical practices and prior research findings. Age was grouped into “<40,” “40–59,” “60–69,” “70–79,” and “>=80” categories13,14, while tumor size was categorized as “<=2 cm,” “2.1-4 cm,” “4.1-6 cm,” “>6 cm,” and “Unknown”15,16,17. Tumor size was measured based on imaging and pathology reports according to SEER database guideline. Cases were classified as having “Distant Metastasis” if metastasis was identified in the brain, bone, liver, lung, or distant lymph nodes during follow-up, or if tumors were verified as M1 stage via pathology or imaging at diagnosis. Missing data from either the database or hospital records were categorized as “Unknown.” The minimum sample size required for the external validation cohort was calculated using Riley et al.’s formula18.

Model development and validation for predicting distant metastasis

To identify predictive variables, we employed univariate and multivariate logistic regression analyses in the training cohort, selecting those with P-values below 0.05 for the following feature selection. We then applied recursive feature elimination (RFE) technique to perform feature selection based on six ML algorithms, namely categorical boosting (CatBoost), random forest (RF), support vector machine (SVM), extreme gradient boosting (XGB), decision tree (DT), and gradient boosting machine (GBM), with 10-fold cross-validation19. RFE operates by training an ML model and ranking features based on their predictive strength, iteratively evaluating and removing features until all are assessed and prioritized20. We applied the Robust Rank Aggregation (RRA) method to consolidate the importance ranks of variables across the six ML algorithms in RFE process, yielding an integrated ranking of clinical variables21. The random seed was set to “123”. After feature selection, we developed the model by 13 ML algorithms, including CatBoost, RF, SVM, XGB, DT, GBM, k-nearest neighbor (KNN), logistic regression (LR), naive bayes classifier (NBC), linear discriminant analysis (LDA), quadratic discriminant analysis (QDA), neural network (NNET), and generalized linear model (GLM), instead of building a mixed model, using the “mlr3” R package22. This approach enabled us to compare each algorithm’s performance to identify the most effective predictive model for distant metastasis. To address potential class imbalance, which could bias performance metrics, we used the Synthetic Minority Over-sampling Technique (SMOTE) during model training23. We refined the model development by applying nested resampling, which utilized a two-level k-fold cross-validation structure: an outer layer for hyperparameter optimization and an inner layer dedicated to model selection. Additionally, we conducted a random search with 1,000 evaluations within a 10-fold cross-validation framework, repeating the process five times for each model.

Subsequently, metrics including area under the curve (AUC), area under the precision-recall curve (PRAUC), accuracy, sensitivity, specificity, precision, cross-entropy, Brier score, balanced accuracy (bacc), and F-beta score (beta = 1) were calculated to identify the top-performing ML model. For internal validation, we employed 10-fold cross-validation. Precision-recall curves (PRC) were generated to evaluate each model’s handling of imbalanced data, while calibration curves assessed discrimination ability. Decision curve analysis (DCA) was used to gauge the clinical benefits of the ML model, implemented through the “runway” R package (https://github.com/ML4LHS/runway/). The best model was selected based on high AUC, high PRAUC, low Brier score, an accurate calibration curve, and balanced accuracy and F-beta score (beta = 1). To determine each variable’s significance, we calculated its mean contribution to the AUC as a percentage of the full model using the “DALEX” R package24. SHapley Additive exPlanations (SHAP) values, visualized in a bee swarm plot via the “shapviz” R package (https://github.com/ModelOriented/shapviz)25, were applied to interpret the black-box ML model.

Integrating ML algorithms for prognosis prediction

Univariate and multivariate Cox analyses were conducted to identify clinical variables with P-values below 0.05 that could predict overall survival (OS). We applied 10 ML algorithms, including random survival forest (RSF), elastic network (Enet), Lasso, Ridge, stepwise Cox, CoxBoost, partial least squares regression for Cox (plsRcox), supervised principal components (SuperPC), GBM, and survival support vector machine (survival-SVM), to estimate prognosis in terms of OS for MIBC patients with distant metastasis. In total, 101 ML algorithm configurations were tested on the training set, using a leave-one-out cross-validation (LOOCV) framework to identify the optimal prognostic model. Models utilizing fewer than three clinical variables were excluded from consideration. The concordance index (C-index) for each ML configuration was then computed across training, internal validation, and external validation sets. The top five ML configurations, achieving the highest average C-index across the three datasets, were selected for subsequent assessment using k-fold cross-validation to address overfitting and ensure model stability and generalizability. Logarithmic loss, recall, and decision calibration were calculated to identify the most effective prognostic ML model using the “mlr3proba” R package26. Patient risk scores were derived through a linear combination for each prognostic ML configuration. The median risk score in the training set was used as a threshold, allowing us to categorize patients in the training, internal validation, and external validation sets into high- or low-risk groups. Kaplan-Meier (KM) survival analysis and log-rank testing were conducted with the “survival” and “survminer” R packages. Additionally, AUC, time-dependent ROC curves, calibration plots, and DCA were performed to evaluate the precision, discrimination capability, and clinical value of the optimal model.

Results

Demographic and clinical baseline characteristics

To develop a predictive model for distant metastasis, data from 43,951 MIBC patients in the SEER database and 125 MIBC patients from The First Affiliated Hospital of Henan University of Science and Technology (used as the external validation set) were collected. SEER patients were randomly divided into training and internal test sets with a 7:3 split. In the training cohort, 2,513 patients (8.17%) exhibited distant metastasis, while 28,253 patients (91.8%) did not. The internal test cohort included 1,048 patients (8.02%) with metastasis and 12,012 patients (92.0%) without it. Detailed clinical characteristics for both groups are provided in Table 1.

Table 1 Clinicopathological characteristics of MIBC patients in the training, internal validation and external validation cohorts.

To construct a prognostic model for predicting outcomes in MIBC patients with distant metastasis, patients from the SEER database with distant metastasis were similarly divided into training and internal test sets in a 7:3 ratio. An additional 100 MIBC patients with distant metastasis from The First Affiliated Hospital of Henan University of Science and Technology served as the external validation group. In the training set, 2,394 patients (95.5%) had died by the last follow-up, while 114 (4.55%) remained alive (Table 2). For the internal test set, 1,016 patients (96.5%) had died, and 37 (3.51%) were alive (Table 2). Median follow-up durations for the training, internal, and external validation sets were 7.00 [3.00;15.0] months, 7.00 [3.00;14.0] months, and 7.00 [2.00;14.0] months, respectively (Table 2). Figure 1 outlines the selection process for MIBC patients from the SEER database.

Table 2 Clinicopathological characteristics of MIBC patients with distant metastasis in the training, internal validation and external validation cohorts.
Fig. 1
figure 1

The workflow diagram for study design and patient screening.

Feature selection for the predictive model

We began by calculating the correlation coefficients of baseline clinical variables and visualizing the results with a lollipop chart, which indicated that “N stage” had the strongest association with metastasis (Fig. 2A). Based on clinical expertise, 19 variables were selected for logistic regression analysis (Table 3), while variables with correlation coefficients above 0.6 were excluded (Supplementary Fig. 1A). Next, univariate and multivariate logistic regression analyses were conducted on the training set, identifying “Age,” “Primary Site,” “Histology,” “Grade,” “T Stage,” “N Stage,” “Surgery Type,” “Lymph Node Surgery,” “Lymph Node Biopsy,” “Surgery in Other Sites,” “Radiotherapy,” “Chemotherapy,” and “Tumor Size” as significant predictors for metastasis (P < 0.05, Table 3; Fig. 2C). Further correlation analysis between model variables and metastasis highlighted “N stage” as the most influential predictor (Fig. 2B). We applied RFE with six ML algorithms (GBM, SVM, RF, DT, XGB, and CatBoost) to identify the relevant clinical features (Fig. 2D–I). RFE analysis revealed that RF was the optimal model, retaining 11 variables and achieving the highest AUC of 0.844 (Fig. 2F). The RRA method was then used to create an overall ranking of each clinical variable across the six ML algorithms, with “Age” emerging as the most influential factor (Supplementary Table 1). Given the strong associations between “Surgery Type,” “Lymph Nodes Surgery,” “Lymph Node Biopsy,” and “Surgery in Other Sites,” only “Surgery Type” and “Surgery in Other Sites” were retained for further analysis (Supplementary Fig. 1A). Ultimately, nine clinical variables with a frequency greater than three in RRA analysis, indicating substantial importance in most ML selection methods, were included in the subsequent model development steps (Supplementary Table 1). In summary, we selected the ultimate model variables by logistic regression analysis, RFE based on six ML algorithms, and RRA algorithm, guaranteeing the robustness of our feature selection process.

Fig. 2
figure 2

The process of feature selection. (A) The correlation coefficients of the baseline characteristics with distant metastasis. (B) The heatmap of Spearman’s correlation analysis among the clinical variables and distant metastasis. The correlation index ranges from -1.0 to 1.0, with a brighter color indicating a stronger correlation. (C) The forest plot visualized the feature selection with multivariate logistic regression analysis. (D-I) Feature selection process with Recursive Feature Elimination (RFE) method based on six ML algorithms (GBM, SVM, RF, DT, XGB and CatBoost).

Table 3 Univariate and multivariate logistics regression analyses of MIBC patients for distant metastasis in the training cohort.

Design and assessment of the predictive model

To accurately predict metastasis, “Age,” “Chemotherapy,” “N Stage,” “Histology,” “Radiotherapy,” “Surgery in Other Sites,” “Surgery Type,” “Tumor Size,” and “T Stage” were selected through the feature selection process. Thirteen ML algorithms, including CatBoost, RF, SVM, XGB, DT, GBM, KNN, LR, NBC, LDA, QDA, NNET, and GLM, were employed to build the predictive model in the training set, with hyperparameters fine-tuned via 10-fold cross-validation and random search. Model performance was then evaluated in both internal and external validation sets. ROC curves (Figs. 3A, 4A and 5A) and AUC values (Figs. 3B, 4B and 5B) highlighted CatBoost as the leading model, achieving the highest AUC scores of 0.956 [0.933, 0.969], 0.882 [0.857, 0.919], and 0.839 [0.723, 0.936] across the training, internal test, and external test sets. To thoroughly evaluate model performance, we calculated accuracy, sensitivity, specificity, precision, cross-entropy, Brier scores, balanced accuracy (bacc), and F-beta score (beta = 1) across the 13 ML models. Results indicated that CatBoost was the most precise and reliable for predicting distant metastasis (Figs. 3C, 4C and 5C). CatBoost achieved outstanding metrics with an accuracy of 0.875 [0.854, 0.896], sensitivity of 0.869 [0.851, 0.889], specificity of 0.883 [0.823, 0.912], and precision of 0.917 [0.885, 0.944] (Fig. 3C). Grid search during hyperparameter tuning identified optimal CatBoost settings: depth of 4, learning rate of 0.03927546, iterations of 669, and leaf regularization of 6.098357. Precision-recall curves (PRC) demonstrated CatBoost’s effectiveness in managing imbalanced data (Figs. 3D, 4D and 5D). Calibration curves demonstrated that the CatBoost model excelled in both fit and predictive accuracy (Figs. 3E, 4E and 5E). DCA indicated that CatBoost provided the greatest benefit for predicting distant metastasis (Figs. 3F, 4F and 5F). 10-fold cross-validation further confirmed that CatBoost surpassed other ML algorithms (Supplementary Fig. 1B). Thus, CatBoost was identified as the most effective algorithm to build ML model for predicting distant metastasis, supporting clinical decision-making.

Fig. 3
figure 3

Establishment and evaluation of the ML models in the training set. (A) ROC curves of different ML models in the training set. (B) AUC values of different ML models in the training set. (C) The performances of 13 ML models in terms of AUC, accuracy, sensitivity, specificity, precision, cross-entropy, Brier scores, Balanced Accuracy (bacc) and F1 Score in the training set. (D) PR curves of different ML models in the training set. (E) Calibration curves of different ML models in the training set. (F) DCA curves of different ML models in the training set. ML, machine learning; CAT, categorical boosting; LR, logistic regression; DT, decision tree; RF, random forest; XGB, extreme gradient boosting; GBM, gradient boosting machine; NB, Naive Bayes; LDA, linear discriminant analysis; QDA, quadratic discriminant analysis; NNET, neural network; GLMNET, generalized linear models with elastic net regularization; SVM, support vector machine; KNN, k-nearest neighbor.

Fig. 4
figure 4

Evaluation of the ML models in the internal validation set. (A) ROC curves of different ML models in the internal validation set. (B) AUC values of different ML models in the internal validation set. (C) The performance of 13 ML models in terms of AUC, accuracy, sensitivity, specificity, precision, cross-entropy, Brier scores, Balanced Accuracy (bacc) and F1 Score in the internal validation set. (D) PR curves of different ML models in the internal validation set. (E) Calibration curves of different ML models in the internal validation set. (F) DCA curves of different ML models in the internal validation set. ML, machine learning; CAT, categorical boosting; LR, logistic regression; DT, decision tree; RF, random forest; XGB, extreme gradient boosting; GBM, gradient boosting machine; NB, Naive Bayes; LDA, linear discriminant analysis; QDA, quadratic discriminant analysis; NNET, neural network; GLMNET, generalized linear models with elastic net regularization; SVM, support vector machine; KNN, k-nearest neighbor.

Fig. 5
figure 5

Evaluation of the ML models in the external validation set. (A) ROC curves of different ML models in the external the external validation set. (E) Calibration curves of different ML models inels in the external validation set. (C) The performance of 13 ML models in terms of AUC, accuracy, sensitivity, specificity, precision, cross-entropy, Brier scores, Balanced Accuracy (bacc) and F1 Score in the external validation set. (D) PR curves of different ML models in the external validation set. (E) Calibration curves of different ML models in the external validation set. (F) DCA curves of different ML models in the external validation set. ML, machine learning; CAT, categorical boosting; LR, logistic regression; DT, decision tree; RF, random forest; XGB, extreme gradient boosting; GBM, gradient boosting machine; NB, Naive Bayes; LDA, linear discriminant analysis; QDA, quadratic discriminant analysis; NNET, neural network; GLMNET, generalized linear models with elastic net regularization; SVM, support vector machine; KNN, k-nearest neighbor.

Model interpretation

To provide interpretability for the ML models, we calculated feature importance scores for each model, which were derived based on each algorithm’s intrinsic properties, showing that “N Stage” was the most influential risk factor for metastasis (Fig. 6A). We then applied the SHAP framework to further explain the top-performing CatBoost model. By evaluating the mean absolute SHAP values, it was evident that “Tumor size” had the greatest predictive power (Fig. 6B). Additionally, a bee swarm plot was used to illustrate the effect of each risk factor on metastasis (Fig. 6C). The y-axis indicates the strength of each risk factor, while the x-axis shows their impact on the model output, specifically the probability of metastasis, as quantified by SHAP values. The SHAP analysis revealed that factors such as absence of surgery, high N stage, advanced T stage, older age, lack of surgery in additional sites, and no radiotherapy are associated with a higher probability of metastasis (Fig. 6C). To demonstrate model interpretability, we examined two representative patients. SHAP values were employed to evaluate the influence of each feature on the model’s predictions. In this study, lower SHAP values corresponded to a reduced likelihood of metastasis, while higher values indicated an increased probability. We set the median SHAP value (0.0359) as the threshold for distinguishing between low and high metastasis risk. For example, the first patient, who did not experience metastasis, had a SHAP value and prediction score of -0.0143, suggesting a low metastasis risk (Fig. 6D). Conversely, the second patient, with confirmed metastasis, showed a SHAP value and prediction score of 0.0984, indicating a high probability of metastasis (Fig. 6E).

Fig. 6
figure 6

ML model interpretation. (A) Importance ranking of features in 13 ML prediction algorithms. (B) The importance ranking of different variables according to the mean (|SHAP value|) in the optimal CatBoost model. (C) The importance ranking of different risk factors predicting metastasis with stability and interpretation in the optimal CatBoost model. The higher SHAP value of a feature is given, the higher risk of distant metastasis the patient would have. The yellow part in feature value represents higher value. (D) SHAP value explanation in a classical sample without distant metastasis. (E) SHAP value explanation in a classical sample with distant metastasis.

Prognostic model for metastatic MIBC patients

To build an accurate model for predicting OS in MIBC patients with metastasis, we performed univariate and multivariate Cox analyses. These analyses identified “Histology,” “T Stage,” “N Stage,” “Surgery Type,” “Lymph Node Surgery,” “Chemotherapy,” “Tumor Size,” “Bone Metastasis,” “Brain Metastasis,” “Liver Metastasis,” and “Lung Metastasis” as independent prognostic factors for OS in metastatic MIBC (P < 0.05, Table 4). Using the clinical variables identified, we generated 101 prognostic ML combinations within a LOOCV framework. The C-index of each model was calculated across the training, internal validation, and external validation sets (Fig. 7A). Among the top five combinations with the highest C-index across all datasets, we assessed model performance through logarithmic loss, recall, and decision calibration, which highlighted the “RSF + Enet [alpha = 0.8]” model for its strong calibration and accuracy (Supplementary Fig. 1C). The optimal model was “RSF + Enet [alpha = 0.8],” employing RSF for feature selection (Fig. 7B) and an elastic network with alpha set to 0.8 for model construction (Fig. 7C), achieving the highest average C-index (0.679) across the three datasets (Fig. 7A). A prognostic ML model, “RSF + Enet [alpha = 0.8],” was developed to predict OS in metastatic MIBC patients, identifying “Chemotherapy” as the most influential factor in both the RSF and Enet [alpha = 0.8] algorithms (Fig. 7B, C). Using risk scores calculated by Enet[alpha = 0.8], we set the median risk score in the training cohort as a threshold to classify patients into high- and low-risk groups. Kaplan-Meier (K-M) curves validated the model’s effectiveness in risk stratification, showing that low-risk patients had significantly longer OS than high-risk patients in the training, internal validation, and external validation cohorts (Fig. 7D). Additionally, ROC curves for 1-, 3-, and 5-year OS highlighted the model’s strong specificity (Fig. 7E). Time-dependent ROC curves showed that the “RSF + Enet [alpha = 0.8]” model consistently outperformed others across most time points, demonstrating superior discrimination and predictive power over traditional clinical variables (Fig. 7F). Calibration curves (Fig. 7G) and DCA curves (Fig. 7H) further highlighted the model’s accuracy and clinical value. Together, these evaluation metrics confirm that the “RSF + Enet [alpha = 0.8]” model provides a robust and superior tool for predicting OS in MIBC patients with metastasis, outperforming other ML models in reliability and predictive performance.

Table 4 Univariate and multivariate Cox regression analyses of MIBC patients with distant metastasis for overall survival in the training cohort.
Fig. 7
figure 7

Establishment and validation of prognostic model for MIBC patients with distant metastasis. (A) A total of 101 kinds of prognostic models via a leave-one-out cross-validation framework and further calculated the C-index of each model. (B) Feature selection process by RSF algorithm. (C) Model construction by Enet[alpha=0.8] algorithm and visualization of feature importance. (D) K-M curves of low-risk and high-risk groups divided by ML model in training, internal validation and external validation cohorts. (E) ROC curves of ML model in training, internal validation and external validation cohorts. (F) Time dependent AUC values of ML model in training, internal validation and external validation cohorts. (G) Calibration curves of ML model in training, internal validation and external validation cohorts. (H) DCA curves of ML model in training, internal validation and external validation cohorts. Left: training cohort, Middle: internal validation cohort, Right: external validation cohort.

Discussion

Muscle-invasive bladder cancer (MIBC) is a particularly aggressive form of bladder cancer, known for its ability to invade the detrusor muscle. Once this barrier is breached, the risk of metastasis increases significantly, leading to a poorer outlook for patients. Metastasis typically spreads through lymphatic and vascular pathways, often reaching organs like the lungs, liver, and bones. The development of metastatic disease is associated with sharply reduced survival rates, as treatment options become increasingly limited and less effective27. Therefore, accurately predicting metastasis and identifying prognostic risk factors are crucial for managing MIBC. This study aimed to develop innovative ML models to predict metastasis at an early stage and to assess prognosis in MIBC patients with metastasis. By collecting clinical data on key variables and building ML models through a benchmark framework, we generated risk scores to aid in both metastasis prediction and prognosis estimation, supporting precise risk stratification. The model evaluates clinical and demographic factors to generate a risk score for metastasis and prognosis, assisting clinicians in decision-making and personalized treatment planning.

Previous research has identified various biomarkers, molecular subtypes, immune cell infiltration patterns, and protein expression levels as possible indicators of metastasis in MIBC. For example, combining CD8 + T cell infiltration with molecular subtype analysis has revealed that patients with the Basal-CD8 + T low subtype tend to have poorer OS and a higher risk of metastasis compared to other subtypes28. Matrix metalloproteinase-14 (MMP14), associated with the basement membrane, has emerged as a critical marker linked to poor prognosis and heightened metastatic risk in MIBC. Elevated MMP14 expression correlates with aggressive tumor behavior and reduced treatment response29. Additionally, recent studies using deep learning models have shown that lymphocytic inflammation within the tumor stroma may predict lymph node metastasis in MIBC, underscoring the significance of tumor microenvironment features in metastasis prediction30. These findings imply that integrating molecular, immune, and histological biomarkers could improve the precision of metastasis predictions, assisting clinicians in creating more targeted treatment strategies.

Our research emphasizes the use of clinical variables for predicting metastasis, noting that older MIBC patients typically experience higher rates of metastasis and mortality compared to younger patients. This is partially due to their reduced likelihood of undergoing aggressive treatments like NAC and RC31. Additionally, the T stage, reflecting the degree of tumor invasion into the bladder wall and nearby tissues, plays a crucial role in influencing the risk of distant metastasis in MIBC3. As the T stage advances from T2 (invasion of the muscle layer) to T4 (invasion into nearby organs or tissues), the risk of metastasis rises. Higher T stages, particularly T3 and T4, are linked to an increased likelihood of cancer spreading to distant organs, including the liver, lungs, and bones. This underscores the importance of precise staging for prognosis and guiding treatment choices. The N stage, indicating regional lymph node involvement, is a critical factor to assess the risk of distant metastasis for MIBC patients. Those with lymph node metastasis (N1-N3) face a substantially higher chance of developing distant metastasis than patients without lymph node involvement (N0). This relationship between lymph node status and distant spread highlights the importance of precise lymph node staging to inform the selection of neoadjuvant or adjuvant therapies, ultimately aiming to improve patient outcomes3. These above three clinical variables, namely age, T stage and N stage, were not only significantly predictive for metastasis in our ML model, but also powerful in non-ML nomograms, confirming their huge impact on clinical decision-making32.

The choice of surgical treatment for MIBC has a major impact on the risk of metastasis and long-term survival outcomes. RC combined with PLND is considered the gold standard, but it is not a frequent combination during clinical practice. Research has indicated that RC could offer equivalent OS and CSS compared to bladder-preserving trimodal therapy (TMT) particularly in T2N0M0 disease10. Although short-term results (1–2 years) show minimal differences between TMT and RC, RC provides better outcomes at 5-year and 10-year follow-ups, especially in lowering the risk of metastasis33,34. For patients unable to undergo RC due to health conditions or personal preference, bladder-preserving options like TMT remain feasible, although they may carry a greater risk of recurrence and metastasis35,36. Tumor size is also a crucial factor in determining prognosis and the likelihood of metastasis in MIBC. Larger tumors are linked to higher metastasis risks and poorer OS. Research has shown that patients with larger tumors are often at more advanced disease stages, which correlates with an increased chance of distant spread37. Additionally, tumor size, along with factors like pathological response to treatment, is frequently used to estimate long-term survival and recurrence risks in patients undergoing RC or NAC38.

The value of our study lies in its potential to enhance management and treatment strategies for patients with MIBC. By providing a dependable risk stratification tool, the model supports clinicians in making tailored and data-informed decisions for treatment. For example, patients deemed high-risk for metastasis could be prioritized for more intensive surgical options, supplemental therapies, and rigorous post-operative monitoring, which may contribute to improved survival rates. On the other hand, patients classified as low-risk could benefit from less aggressive treatments, reducing the likelihood of unnecessary side effects and complications linked to overtreatment. Additionally, the model’s predictions can help tailor adjuvant therapy choices, set appropriate follow-up intervals, and determine when extra lab tests are needed. Incorporating this predictive tool into clinical workflows is likely to enhance data-driven decision-making, improving both patient outcomes and resource efficiency. Furthermore, it can aid in standardizing care across providers and institutions, thereby minimizing variation in treatment practices and outcomes for MIBC patients.

Despite the innovative attempt of developing a ML model for predicting metastasis in our study, clinical utility depends on the interpretability and practicality of the selected variables should be noticed39. One of the main strengths of our study is utilization of interpretable ML algorithms, particularly through SHAP values and the DALEX framework, to pinpoint crucial factors affecting metastasis predictions. The CatBoost algorithm, a gradient boosting method built on symmetric decision trees (oblivious trees), delivers high accuracy with minimal parameter tuning, excelling at handling categorical data40. CatBoost’s performance is on par with other advanced ML techniques, proving effective across various applications. However, the model’s “black-box” nature calls for interpretability, which we provided using SHAP summary plots and force maps. These visualization tools give clinicians a straightforward way to understand the primary factors driving survival predictions, making the model’s insights more accessible for informed clinical decisions. Furthermore, incorporating advanced techniques, such as RFE with cross-validation (RFECV), GridSearchCV for hyperparameter tuning, and SMOTE oversampling to address sample imbalance, has significantly boosted the model’s accuracy in predicting metastasis. With this precise ML model, clinicians can create individualized treatment plans, allowing for timely adjustments in therapy that enhance diagnostic and prognostic outcomes for MIBC patients.

Although this study has several strengths, it also has some limitations. Firstly, approximately 8–9% of MIBC patients in SEER database developed metastasis, which is lower than anticipated based on the general survival trends. The primary reason for this discrepancy lies in the limitations of the retrospective database. In retrospective studies, reporting of metastasis may be incomplete, particularly in patients who transition to palliative care. Although SEER database contains systematic follow-up data, metastatic screening and documentation were not consistently applied across all cases. Some patients may have been lost to follow-up or received treatment at other institutions, which may have led to an underrepresentation of metastatic cases. We acknowledge that this limitation may impact the generalizability of our results, particularly when assessing overall prognosis and metastasis-related risks. Future studies should aim to include a more representative sample of metastatic patients, or explore ways to improve metastatic screening and follow-up processes to minimize such biases. Secondly, over 50% of MIBC patients in SEER database received only ‘Local tumor destruction/excision,’ with a subset also undergoing radiotherapy (17%). We acknowledge that this high proportion may influence overall dataset composition and outcomes, as these patients generally have a poor prognosis and shorter follow-up. However, we deliberately included these patients to ensure our study reflects real-world clinical practice, where certain patients, due to age, comorbidities, or treatment constraints, may receive limited interventions rather than standard radical treatment. Excluding them entirely could introduce selection bias and limit the generalizability of our findings. Nonetheless, we recognize that the inclusion of these patients may impact survival-related analyses and should be carefully considered when interpreting the results. Future studies may benefit from stratified analyses or alternative modeling approaches to account for the potential influence of treatment heterogeneity on outcomes.

Besides, while we determined the required sample size for external validation, patient availability with complete follow-up data was limited, resulting in a smaller validation cohort than ideal. Acknowledging that larger sample sizes provide more reliable evaluations, we gathered the maximum feasible number of samples given current constraints. To address this limitation, a 10-fold cross-validation approach was used and improved the model’s generalizability. Future studies will focus on enlarging the external validation cohort to further validate the model’s reliability. Moreover, differences in data collection methods across various hospitals in SEER database, along with the study’s retrospective design, led to some missing clinical variables. Additionally, essential clinicopathological details like imaging type and scope, comorbidities, inflammatory markers, genetic information, and occupational exposures were excluded due to their absence in the SEER database. Although we included a wide array of features to enhance predictive accuracy, this complexity might limit the model’s practicality in real-world clinical applications. Finally, the model has not yet been tested in clinical practice, highlighting the need for prospective, multicenter, large-scale validation studies to thoroughly evaluate its generalizability and suitability in real-world settings. Such future research will be essential for confirming the model’s effectiveness across varied patient groups and clinical contexts.

Conclusions

In this study, we identified key risk factors for predicting metastasis in MIBC patients and developed a CatBoost-based prediction model within an ML benchmark framework, outperforming both traditional clinical metrics and other ML models in accuracy and effectiveness. Additionally, we created an RSF + Enet [alpha = 0.8] prognostic model to reliably forecast outcomes for MIBC patients with metastasis. Our work provides a foundation for future advancements in metastasis prediction and prognosis assessment, aiding clinicians in making informed treatment decisions and personalizing therapy plans.