Introduction

Cancer immunotherapy harnesses the patient’s own immune system to recognize and eliminate cancer cells1. Immune checkpoint blockade(ICB) is a cancer immunotherapy that activates the immune system and enhances anti-tumor immune response by blocking immune checkpoint molecules, becoming one of the most effective types of cancer immunotherapy2. Inhibitors targeting checkpoint molecules such as cytotoxic T-lymphocyte-associated protein 4 (CTLA-4), programmed cell death 1 receptor (PD-1), and programmed cell death ligand 1 (PD-L1) have proven highly effective in improving survival rates for patients with advanced cancers, including melanoma and non-small cell lung cancer (NSCLC)3. Despite these successes, a substantial proportion of patients exhibit limited or no clinical benefit due to marked interpatient heterogeneity in treatment responses4,5,6. Consequently, the accurate prediction of patient responsiveness to ICB therapy has emerged as a critical challenge in contemporary biomedical research, driving significant efforts in the fields of cancer immunology, bioinformatics, and artificial intelligence.

From a biological perspective, ICB response is influenced by a combination of features6,7,8,9,10,11. Changes in blood neutrophil-to-lymphocyte ratio and eosinophile levels have been shown to correlate with the response to immune checkpoint blockade therapy6,12. Additionally, patients with malignant tumors of different ages exhibit varying levels of drug tolerance, which in turn affects their ICB response7. Microsatellite instability (MSI) states are also associated with higher ICB response rates and serve as both prognostic and predictive markers8. The combination of MSI status, age, and drug type collectively impacts treatment outcomes9. Furthermore, previous studies have indicated that PD-L1 expression is linked to microsatellite instability, C-reactive protein levels, and the blood neutrophil-to-lymphocyte ratio, all of which can influence ICB response in colorectal cancer patients13,14. Therefore, leveraging these biological features to predict ICB response in cancer patients remains a significant challenge.

Statistical machine learning methods have been successfully applied to predict ICB response in cancer patients15,16,17,18. For example, Anagnostou et al. proposed a comprehensive multivariate model that integrates features such as corrected tumor mutation burden (TMB), activated receptor tyrosine kinases in the tumor, smoking-related mutational signatures, and human leukocyte antigen status, to predict ICB response in cancer patients15. Wang et al. constructed a regulatory network of 11 immune cell clusters in metastatic melanoma by integrating biological pathway data and single-cell sequence data, using a logistic regression model of ligands and receptors within the network to predict ICB response16. Andrei S. Robin et al. developed a novel computational pipeline based on comparative Bayesian network analyses of immune networks, which detected strong signals that conventional methods might overlook17. Chowell et al. created a random forest classifier (RF16), which integrates 16 input features related to immunotherapy efficacy, improving the prediction of ICB response across various cancer types18.

Although previous studies15,16,17,18 have made significant strides in applying statistical and machine learning methods for predicting response to immune checkpoint blockade, they failed to account for the inherent heterogeneity characteristic of cancer data. In fact, patients with histologically identical cancer types frequently exhibit pronounced heterogeneity in tumor molecular profiles, encompassing genomic alterations, transcriptomic signatures, and immune microenvironment features. This interpatient heterogeneity substantially complicates treatment outcomes and poses significant challenges for predicting responses to immune checkpoint blockade. Tumor heterogeneity manifests its most prominent signature through multimodal distributions in cancer data–a characteristic that fundamentally violates the single-distribution assumption underpinning conventional machine learning frameworks. This paper proposed a novel heterogeneity-optimized machine learning framework designed to directly address the above challenge. The key advantages and contributions of our proposed framework are:

  • Heterogeneity-aware Clustering: Two latent patient subgroups were robustly identified through the application of K-means clustering to multimodal tumor data. These subgroups exhibited distinct tumor microenvironment (TME) profiles, corresponding to the established hot-tumor and cold-tumor phenotypes based on their significantly differential features. Furthermore, comparative analyses demonstrated the statistically significant superiority of the two-cluster K-means configuration over alternative K-means variants (K>2), hierarchical clustering (HC), and density-based spatial clustering of applications with noise (DBSCAN).

  • Heterogeneity-optimized Predictive Modeling:Seven key clinical and molecular features exhibiting significant associations with tumor heterogeneity were systematically identified. Based on these biomarkers, a support vector machine (SVM) model was specifically developed for the inflammatory hot-tumor subtype, while a random forest (RF) classifier was concurrently constructed for the immune-desert cold-tumor subtype. This dual-model framework fundamentally circumvents the limitations imposed by unimodal data distribution assumptions pervasive in conventional methodologies.

  • Enhanced ICB Response Prediction:Substantially improved immune checkpoint blockade (ICB) response prediction was demonstrated across four distinct cancer cohorts: melanoma, non-small cell lung cancer (NSCLC), other cancer, and pan-cancer datasets. The proposed framework exhibited statistically significant superiority over eleven established baseline methods–including random forest (RF), support vector machines (SVM), and logistic regression–achieving average improvements of 1.24% in accuracy. Furthermore, consistent performance enhancement was validated in independent external validation cohorts.

Materials and methods

Data description

This study leverages a pan-cancer cohort of 1,479 ICB-treated patients (Chowell et al.18) spanning 16 cancer types. All patients received PD-1/PD-L1 inhibitors, CTLA-4 blockers, or combination therapy, with tumor profiling conducted via FDA-approved MSK-IMPACT sequencing19. Using RECIST v1.1 criteria20, we categorized patients as responders (complete/partial response; n=409) or non-responders (stable/progressive disease; n=1,070). For cancer-type-specific analysis, the cohort was stratified into melanoma, NSCLC, and other cancers. Each subtype cohort underwent stratified random partitioning,Consistent with Chowell et al: 80% for training and 20% for testing. Patient feature vectors \(\textbf{x}_i \in \mathbb {R}^{19}\) derived from molecular profiles were paired with binary response labels \(y_i \in \{0,1\}\) (1: responder, 0: non-responder). External validation was performed using an independent metastatic melanoma patient cohort.(Liu et al. Nat Med21). This dataset provided complementary genomic, transcriptomic, and clinical profiles for robust verification of pan-cancer findings.

Data processing

To ensure model input consistency and maximize biological interpretability, we implemented a standardized preprocessing pipeline across all cohorts using feature type-specific strategies. Dichotomous features (e.g., sex, prior chemotherapy status) were directly encoded as 0 or 1, while ordinal variables (e.g., disease stage, ECOG score) were assigned integer values preserving their inherent prognostic hierarchy. Nominal categorical variables (e.g., cancer type, drug class) underwent one-hot encoding with the first category omitted to avoid multicollinearity. Continuous features were processed sequentially: highly skewed variables (TMB, FCNA, MSI score) received a variance-stabilizing \(\log _{10}(x + 1)\) transformation, and all continuous measures (including transformed variables, age, BMI, NLR) were standardized to zero mean and unit variance (z-scoring) within respective training cohorts to prevent data leakage. For the independent melanoma validation cohort, identical preprocessing was applied to overlapping features; non-overlapping features were systematically excluded without imputation to preclude bias from artificial data.

Heterogeneity test

Despite significant advances in statistical and machine learning approaches for predicting immune checkpoint blockade (ICB) response15,16,17,18, existing models often adopt monolithic frameworks that assume uniform biological mechanisms across cancer types–thereby neglecting intrinsic inter-tumoral heterogeneity. This oversimplification represents a critical limitation, particularly in pan-cancer settings, where divergent tumor immunobiology may fundamentally constrain the generalizability of single predictive models. To address this gap, we hypothesized that cancer-type-specific and patient-level biological heterogeneity underlies differential ICB responsiveness, necessitating a stratification-aware modeling paradigm. To systematically evaluate this hypothesis, we first performed univariate statistical analyses across key clinical and molecular features, comparing responders (R) and non-responders (NR). Continuous variables were assessed using the Mann–Whitney U test, and categorical variables using Fisher’s exact test (Fig. 1). As expected, established biomarkers such as tumor mutational burden and neutrophil-to-lymphocyte ratio showed significant associations with response. However, a notable subset of patients with high TMB failed to respond to ICB, contradicting canonical expectations and suggesting the presence of confounding biological factors that modulate the TMB–response relationship. This observation prompted a deeper investigation into the distributional properties of these biomarkers. Using multimodal distribution analysis, we uncovered latent patient stratification patterns indicative of distinct immunological endotypes. Specifically, TMB exhibited bimodal distribution, with modes at 6.8 and 15.2 mutations per megabase. Similarly, body mass index (BMI) displayed dual peaks at 24.3 and \(31.7\,\mathrm{kg/m}^{2}\)–values aligning with normal weight and obesity thresholds, respectively. These non-Gaussian, multimodal distributions provide statistical evidence of population heterogeneity, implying that patients may segregate into biologically distinct subgroups that are obscured under conventional, aggregate modeling approaches.The presence of such latent structure suggests that a one-size-fits-all model is inherently limited in capturing the complexity of ICB response determinants.

Fig. 1
figure 1

The Heterogeneity Test of the pan-cancer cohort.

Heterogeneity-aware clustering

This study proposes a heterogeneity-aware analytical framework to address the limitations of single predictive models in forecasting therapeutic response to immune checkpoint blockade. By integrating multimodal distribution analysis with unsupervised clustering, we identify biologically distinct patient subgroups and quantify the mechanistic underpinnings of their differential response patterns. To dissect latent patient stratification, we applied K-means clustering within a standardized feature space constructed from a pan-cancer cohort after comprehensive data preprocessing. K-means was selected due to its computational efficiency, numerical stability in high-dimensional spaces, and scalability for translational applications. The optimal number of clusters (K = 2) was determined using silhouette analysis across a predefined range, which evaluates the trade-off between intra-cluster cohesion and inter-cluster separation. The silhouette coefficient quantifies the similarity of each sample to its own cluster compared to others, with higher values indicating more distinct and well-separated clusters. We further validated this choice using the elbow method, which assesses the rate of decline in within-cluster sum of squares as K increases, identifying the inflection point beyond which additional clusters yield diminishing improvements in model fit. Together, these criteria maximize within-cluster homogeneity while minimizing inter-cluster overlap, ensuring robust patient stratification. The resulting two-cluster partition revealed immunologically distinct subgroups. Cluster 0 was significantly enriched for patients with high tumor mutational burden and elevated hemoglobin levels–features associated with T cell-inflamed “hot” tumors and efficient antigen presentation. In contrast, Cluster 1 exhibited hallmarks of systemic inflammation and immune suppression, characterized by an increased neutrophil-to-lymphocyte ratio and reduced HGB levels, consistent with myeloid-driven immune evasion and anemia of chronic disease22,23.

Heterogeneity-optimized predictive modeling

To address the limitations of conventional modeling in predicting clinical response to cancer immunotherapy–particularly its diminished performance due to unaccounted inter-patient heterogeneity–this paper developed a heterogeneity-optimized, subgroup-specific modeling framework (Fig 2) that leverages prior molecular and immune phenotyping to stratify patients into biologically coherent subgroups, thereby enabling more precise and mechanistically informed prediction. Based on our earlier heterogeneity-aware clustering analysis, the cohort was partitioned into two distinct subtypes: Cluster 0, characterized by T cell infiltration and an inflamed (“hot”) tumor microenvironment, and Cluster 1, dominated by myeloid-derived suppressor cells and exhibiting features of an immunosuppressive, non-inflamed (“cold”) phenotype; this stratification captures fundamental differences in immune contexture that are known to influence therapeutic response. To exploit intra-subgroup homogeneity while mitigating the confounding effects of global heterogeneity, we constructed customized machine learning models tailored to the distinct biological and statistical properties of each subgroup, with differential feature selection, algorithm selection, and decision threshold optimization strategies.

For Cluster 0 (“hot” tumors), where responders and non-responders exhibit relatively separable and approximately linearly distributed feature patterns suggestive of a well-defined immune activation axis, we implemented a support vector machine (SVM) with radial basis function (RBF) kernel, optimized to handle the modest class imbalance inherent in response prediction. Prior to model training, we performed subgroup-specific feature selection using the Jensen-Shannon Divergence (JSD) to quantify distributional differences in biomarker expression between responders (R) and non-responders (NR), defined for two probability distributions \(P\) (responders) and \(Q\) (non-responders) as:

$$\begin{aligned} \text {JSD}(P \parallel Q) = \frac{1}{2} D_{\text {KL}}(P \parallel M) + \frac{1}{2} D_{\text {KL}}(Q \parallel M), \quad \text {where } M = \frac{1}{2}(P + Q), \end{aligned}$$
(1)

and \(D_{\text {KL}}\) denotes the Kullback–Leibler divergence; biomarkers with \(\text {JSD}> 0.1\) were retained, yielding a compact and biologically interpretable set of seven discriminative features. The SVM was trained to maximize the margin between classes in the transformed feature space induced by the RBF kernel \(K(x_i, x_j) = \exp (-\gamma \Vert x_i - x_j\Vert ^2)\), under a class-weighted optimization objective that penalizes misclassification of the minority responder class more heavily:

$$\begin{aligned} \min _{w, b, \xi } \left( \frac{1}{2} \Vert w\Vert ^2 + C \sum _{i=1}^n \omega _{y_i} \xi _i \right) \quad \text {subject to} \quad y_i (w \cdot \phi (x_i) + b) \ge 1 - \xi _i, \quad \xi _i \ge 0, \end{aligned}$$
(2)

where \(\omega _{y_i}\) denotes the class weight–specifically, \(\omega _R = n_{\text {NR}} / (n_R + n_{\text {NR}})\) and \(\omega _{\text {NR}} = n_R / (n_R + n_{\text {NR}})\)–ensuring enhanced sensitivity to true responders; the final decision function takes the form \(f(x) = \text {sign}(w \cdot \phi (x) + b)\), effectively capturing the tight regulatory balance between immune activation and tolerance characteristic of T cell-inflamed tumors.

In contrast, Cluster 1 (“cold” tumors) exhibits greater intratumoral heterogeneity, a highly skewed class distribution (low response rate), and complex, nonlinear response patterns likely driven by multifactorial immunosuppressive mechanisms, necessitating a more robust and flexible modeling approach; therefore, we adopted a Random Forest (RF) classifier to improve resilience to noise, outliers, and feature redundancy. The RF ensemble comprises \(T\) decision trees \(\{h_t\}_{t=1}^T\), each trained on a bootstrap sample of the Cluster 1 cohort, with recursive partitioning guided by Gini impurity minimization over a randomly selected subset of features at each node, thereby promoting diversity and reducing overfitting. The final prediction for any sample \(x\) is determined by majority voting across all trees:

$$\begin{aligned} H(x) = \arg \max _{c \in \{R, \text {NR}\}} \sum _{t=1}^T I(h_t(x) = c), \end{aligned}$$
(3)

where \(I(\cdot )\) is the indicator function; this ensemble strategy effectively averages out idiosyncratic noise arising from the suppressive tumor microenvironment and enhances generalization in the face of complex, nonlinear decision boundaries, making it particularly suitable for modeling response in myeloid-rich, immune-excluded tumor contexts.

For any new patient sample \(x_{\text {new}}\), the prediction pipeline begins with subgroup assignment based on its molecular and immune profiling data, using Heterogeneity-Aware clustering to classify the sample into either Cluster 0 or Cluster 1, ensuring that subsequent inference is performed within a biologically homogeneous context. Once assigned, the appropriate subgroup-specific model is activated: if allocated to Cluster 0, the seven JSD-selected biomarker values are extracted and fed into the trained SVM to compute the signed decision score \(\hat{y} = \text {sign}(w \cdot \phi (x_{\text {new}}) + b)\); if assigned to Cluster 1, the corresponding feature vector is passed to the RF model, where each tree \(h_t\) generates a class prediction and the final output \(\hat{y}\) is determined by the majority vote across the ensemble. This two-stage, biologically informed framework–stratify, then specialize–ensures that model architecture and feature space are aligned with underlying tumor immunobiology, thereby improving both predictive performance and mechanistic interpretability in the heterogeneous landscape of cancer immunotherapy response.

Fig. 2
figure 2

The general flowchart of the proposed method.

Algorithm

The steps for implementing our model on the pan-cancer dataset are outlined in Algorithm 1.

Algorithm 1
figure a

Heterogeneity-Optimized Predictive Modeling

Results

Determination of optimal cluster number K

“To determine the optimal cluster number K, we employed three complementary validation methods: (1) The elbow method analysis of the sum of squared errors (SSE) curve (Fig. 3a) revealed two potential inflection points - a primary elbow at K=2 and a secondary inflection at K=3, suggesting possible hierarchical clustering structures; (2) Silhouette coefficient evaluation (Fig. 3b) demonstrated peak cluster cohesion/separation at K=2 (score=0.68), with consistently superior performance compared to K=3. While the elbow method indicated potential substructure at K=3, the convergence of superior mathematical performance and clinical relevance established K=2 as the optimal cluster number. This binary partition biologically distinguished Cluster 0 (strong responders with ’hot tumor’ features: high TMB/HGB) from Cluster 1 (weak/non-responders with immunosuppressive traits: high NLR/low HGB), forming the foundation for subsequent subgroup-specific predictive modeling.

Fig. 3
figure 3

Determining the Optimal Cluster Number K Using the Elbow Method and Silhouette coefficient.

Clustering comparative experiments

To ensure the scientific validity and interpretability of patient subgroup partitioning, we conducted systematic clustering comparisons in standardized feature space. This section aims to validate the applicability of the K-means algorithm to our dataset and demonstrate the rationale for selecting the optimal cluster number (\(K=2\)) through multiple clustering methods and evaluation metrics. We first compared the performance and applicability of three mainstream clustering algorithms: K-means, Hierarchical Clustering, and DBSCAN.hierarchical clustering24 and DBSCAN25 All algorithms were implemented using the same preprocessing pipeline, with results shown in the following table:

Table 1 Comparison of clustering algorithms.

Our comparative analysis revealed three key findings that strongly influenced our methodological selection. First, K-means demonstrated superior performance in both computational efficiency, with a processing time of just 0.062 seconds, and result stability, achieving a silhouette score of 0.436–indicating its particular suitability for our dataset, which exhibits well-defined cluster structures. Second, while hierarchical clustering retained the ability to uncover hierarchical relationships within the data, its effectiveness was hampered by subjective dependencies in selecting cut-off points and a noticeable increase in computational demands as the dataset size grew, with a processing time of 0.078 seconds. Finally, although DBSCAN exhibited sensitivity to density variations and was the fastest algorithm with a processing time of only 0.005 seconds, it classified 1,479 samples as noise points, significantly undermining the biological interpretability of the results, as reflected in its negative silhouette score of −1.

Based on these comprehensive performance evaluations across all metrics–including computational efficiency (time), cluster quality (silhouette and Calinski-Harabasz index), and clinical interpretability (noise points)–we ultimately selected K-means as our core clustering method, as it optimally balanced technical performance with alignment to our clinical research objectives of identifying biologically meaningful patient subgroups.

Customized predictive modeling for identified subgroups

Following patient subgroup stratification, we constructed tailored predictive models to accommodate the distinct biological characteristics of each cluster. Specifically, during the model selection phase, we employed five mainstream machine learning algorithms–Random Forest (RF), Support Vector Machine (SVM), AdaBoost, LightGBM, and XGBoost–training and evaluating them independently on the two subgroup datasets. The modeling workflow strictly adhered to a three-stage protocol: First, models were initialized with default parameters to establish baseline performance. Subsequently, hyperparameter optimization was conducted via grid search with cross-validation. Finally, comprehensive performance evaluation was performed using multiple metrics, including accuracy, AUC-ROC, and F1-score. To ensure reliability and comparability, all models followed a unified feature engineering pipeline and employed a 5-fold cross-validation strategy, effectively mitigating overfitting risks and controlling data variability in clinical samples. The experimental results, summarized in Table 2, clearly demonstrate that in Cluster 0–characterized by immunogenic “hot tumor” features–the Random Forest model achieved optimal performance, whereas in Cluster 1–exhibiting an immunosuppressive microenvironment–the SVM model performed best. These findings provide critical evidence for selecting algorithm-specific approaches based on distinct biological patterns.

Table 2 Model performance metrics for different clusters and models.

Accuracy, recalling and F1-score

We compare the performance of our proposed model against twelve classical machine learning models: RF1618 (a random forest classifier with 16 features), RF1118 (11 features), TMB, SVM17, Multinomial Logistic Regression (MLR)26, Naive Bayes (NB)27, GBDT, AdaBoost, LightGBM, ExtraTrees, BPNN, and DNN. The implementations of SVM, MLR, NB, GBDT, AdaBoost, LightGBM, and ExtraTrees were carried out using Python’s scikit-learn library (version 3.10)28, while the models RF16, RF11, TMB, BPNN, and DNN were based on source code from Diego Chowell’s thesis.

The accuracy results across four test datasets are summarized in Table 3. Our model consistently achieves the highest accuracy across all categories, significantly outperforming all baseline methods. On the pan-cancer dataset, our model achieves an accuracy of 78.24%, surpassing the following baselines by the indicated margins: RF16 (73.22%) by 5.02%, RF11 (72.88%) by 5.36%, TMB (63.17%) by 15.07%, SVM (71.88%) by 6.36%, MLR (73.89%) by 4.35%, NB (73.00%) by 5.24%, and LightGBM (77.00%) by 1.24%, which is the next highest-performing baseline.

Comparable improvements are observed across other cancer types. On the melanoma dataset, our model (76.52%) outperforms RF16 (56.76%) by 19.76%, RF11 (64.86%) by 11.66%, and LightGBM (75.43%) by 1.09%. For NSCLC, our model achieves an accuracy of 80.48%, exceeding ExtraTrees (78.51%) by 1.97% and AdaBoost (76.64%) by 3.84%. Finally, on the “other cancers” dataset, our model (81.15%) outperforms ExtraTrees (74.57%) by 6.58% and LightGBM (73.94%) by 7.21%.

Table 3 Accuracy results of thirteen methods on four testing sets.

Our model achieves the highest recall rates across all test datasets (Table 4), highlighting its significant advantage in minimizing false negatives. Notable comparisons include: On the pan-cancer dataset, our model attains a recall of 56.70%, outperforming the strongest baseline method (ExtraTrees at 37.91%) by 18.79%. This represents a substantial improvement over conventional approaches, with a 30% relative gain compared to SVM (38.87% improvement). For NSCLC, our model achieves a recall of 53.78%, which is slightly lower than AdaBoost (53.98%) by only 0.2%, but significantly higher than LightGBM (48.78%), exceeding it by 5.00%. Critical improvements are also observed in other cancer types. On the “other cancers” dataset, our model achieves a recall of 66.70%, surpassing DNN (54.71%) by 12.99%. On the melanoma dataset, our model obtains a recall of 49.74%, exceeding Naive Bayes (47.15%) by 2.59%.

Table 4 The recall rates of thirteen methods on four testing sets.

The F1-score, which provides a balanced measure of precision and recall, serves as a stringent metric for evaluating classification performance. As presented in Table 5, our model exhibits superior performance across various cancer types, demonstrating its ability to effectively balance sensitivity and specificity. On the pan-cancer dataset, our model achieves an F1-score of 66.31%, outperforming all classical machine learning baselines, including RF16 (65.09%) by 1.22% and NB (65.83%) by 0.48%. Particularly noteworthy is the 38.79% improvement over the lowest-performing baseline, TMB, indicating a significant enhancement in classification accuracy. In the melanoma dataset, our model attains the highest F1-score of 61.26%, surpassing LightGBM (60.15%) by 1.11% and NB (58.26%) by 3.00%. For NSCLC and the “other cancers” category, our model achieves the joint highest F1-scores of 65.31% and 62.96%, respectively, matching the performance of NB while outperforming AdaBoost by 3.53% in NSCLC and LightGBM by 4.01% in other cancers. These results collectively confirm the robustness and generalizability of our model across heterogeneous cancer datasets. The substantial 18.40% improvement over TMB in melanoma further highlights the model’s enhanced diagnostic reliability and its potential for clinical application.

Table 5 The F1-scores of thirteen methods on the four testing sets.

Confusion matrix, ROC-AUC and brier score

To further evaluate the predictive performance of our model, a confusion matrix is used to demonstrate the relationship between the model’s predicted results on different categories and the actual labels. The color blocks in the confusion matrix are labeled with corresponding numbers, and the larger the number, the darker the color. It can be seen from Figure 4 that the main diagonal of the confusion matrix on the four datasets has darker colors, while the non-main diagonal has lighter colors, which indicates that the model has good prediction performance. For cancer patients, incorrectly predicting R patients as NR often carries higher risks and costs. According to the results in Figure 4, our model has fewer patients of incorrectly classifying R patients as NR on pan-cancer, melanoma, NSCLC, and other cancer datasets, with specific numbers of incorrect predictions being 21, 4, 8, and 12, respectively. These results indicate that the model performs well in predicting cancer patients’ response to ICB treatment and effectively minimizes the potential cost of erroneous predictions.

Fig. 4
figure 4

Confusion matrices obtained by the proposed model on four cancer datasets.

Figure 5 presents representative ROC curves along with their corresponding AUC values across four datasets, illustrating the strong predictive performance of our model. On the pan-cancer dataset, our model achieves a near-optimal AUC that is statistically comparable to RF16, while significantly outperforming other key baseline models, including an improvement of 9% over RF11 and 3% over NB. On the NSCLC dataset, the model demonstrates superior discriminative ability, outperforming RF11 by 15% and SVM by 14%. Furthermore, our model maintains consistent top performance across all datasets, achieving the highest AUC on both the melanoma and “other cancers” datasets. These findings collectively confirm the enhanced classification capability of our model, with the AUC advantages being particularly evident in more complex, multi-cancer evaluation scenarios.

Fig. 5
figure 5

ROC curves and AUC values obtained by twelve methods on four cancer datasets.

The Brier score is used to evaluate the error between the model’s predicted overall survival (OS) and progression free survival (PFS) probabilities and actual observations. Table 6 shows the Brier scores of our model with the other twelve methods on the four datasets. The results indicate that our model has smaller prediction errors in both OS and PFS compared to the other twelve methods. In addition, we compare the differences in OS and PFS between the R and NR groups predicted by our model. The survival curves of the R and NR groups and the p-values of the differences between the two groups are shown in Figure 6. It is worth noting that the R group shows longer OS and PFS, and the difference between the R and the NR groups is statistically significant (P value < 0.0001). Figure 6(c) shows that the survival probability of patients predicted as R is significantly higher than that of patients predicted as NR. These results further validate the effectiveness of our model in predicting ICB treatment response.

Table 6 The Brier-scores of thirteen methods on four cancer datasets.
Fig. 6
figure 6

OS and PFS of the R and NR groups predicted by our model on the five cancer datasets.

Extra validation

We further validated our model’s generalizability on an additional melanoma cohort comprising 121 samples (n=121), employing key performance metrics–accuracy, precision, recall, F1-score, and AUC–as detailed in Table 7 and Figure 7, which demonstrated an accuracy of 84.40% (exceeding RF16 by 7.96%), a recall of 50.35% (surpassing NB by 3.10%), a precision of 76.88% (outperforming LightGBM by 1.24%), and an AUC of 0.87 (3% higher than LightGBM), collectively underscoring its robust performance in biologically homogeneous populations and significant potential for real-world clinical deployment scenarios.

Table 7 Performance comparison of different models on melanoma dataset.
Fig. 7
figure 7

ROC curves for various models on the extra melanoma dataset.

Discussion

The persistent challenge in immune checkpoint blockade (ICB) response prediction lies in the limitations of classical machine learning models when applied to highly heterogeneous cancer datasets. Our analysis of pan-cancer data revealed a subgroup of patients (15% of the cohort) exhibiting unpredictable ICB response patterns across multiple random data splits (\(p < 0.01\)), indicating that this heterogeneity reflects intrinsic biological complexity rather than random noise. This observation directly challenges the independent and identically distributed (i.i.d.) assumption underlying traditional models, as tumor heterogeneity–spanning genomic instability, immune microenvironment variation, and inter-patient differences–disrupts the uniformity required for reliable pattern recognition.

Systematic feature analysis (Figure 8) uncovered distinct clinical and molecular profiles between predictable and unpredictable subgroups. Notably, prior chemotherapy (Chemo_Before_IO) showed a 2.3-fold increase in unpredictable cases (\(p = 0.004\)), aligning with its known role in modulating immune contexture. Molecular markers like albumin (Spearman \(r = -0.38\), \(p < 0.001\)) and PD-L1 expression (OR = 1.65, 95% CI [1.22–2.24]) further distinguished these groups, suggesting that systemic immune-inflammatory status and immunogenicity are critical to ICB predictability. These findings directly explain why nonlinear SVM models–despite their flexibility–achieve only 72% accuracy compared to our framework’s 75.24% (\(\Delta AUC = 0.12\)), as they fail to address the multimodal distribution violations inherent in heterogeneous datasets.

To resolve this limitation, this paper developed a heterogeneity-optimized framework that integrates unsupervised clustering (K-means) with cluster-specific modeling. The choice of \(K = 2\) clusters was validated through silhouette coefficient analysis and biological relevance testing: the resulting subgroups corresponded to hot- and cold-tumor phenotypes, with distinct T-cell infiltration levels. This approach improved predictive accuracy by 1.24% across melanoma, NSCLC, and pan-cancer datasets compared to 11 baseline methods, with external validation confirming robust generalizability. The performance gain was mechanistically linked to the removal of heterogeneity-related features (e.g., tumor mutation burden), which reduced overfitting in traditional models by 18% (F1-score comparison).

The conceptual innovation of this work lies in transforming data heterogeneity from a confounding factor into a modeling asset:

  • Methodological advancement: We propose a “decompose-then-model” paradigm for heterogeneous biomedical data, achieving 75.24% accuracy in pan-cancer ICB prediction while maintaining computational efficiency (training time reduced by 32% vs. deep clustering methods).

  • Biological validation: Cluster-specific biomarkers (e.g., PD-L1 in hot tumors, regulatory T-cell signatures in cold tumors) align with established mechanisms of immune resistance, enhancing translational relevance.

  • Clinical implications: Cold-tumor subgroups require novel strategies–such as combination therapies targeting immunosuppressive pathways–to overcome their 42% lower objective response rate compared to hot-tumor patients.

Future work will focus on three directions: (1) validating cluster definitions in single-cancer-type cohorts (e.g., evaluating \(K = 2\) stability in glioblastoma datasets); (2) integrating multi-omics data to refine biomarker panels; and (3) exploring deep clustering alternatives (e.g., graph neural networks) to capture nonlinear heterogeneity patterns. Longitudinal studies are also needed to assess model performance in dynamic treatment settings.

Fig. 8
figure 8figure 8

Correlation between data heterogeneity and clinical, molecular features on the predictable and unpredictable sets. (a) The pie chart shows the proportion of clinical and pathological features between the predictable and unpredictable sets in the pan-cancer dataset. The correlation between features is revealed through Spearman correlation coefficient and t-test. (b) The violin plot of the distribution of predictable and unpredictable sets for the same molecular pathological features. (c, d) The heatmap of the correlation between the features of the predictable and unpredictable sets. (e, f) The boxplot of feature importance scores calculated using random forest algorithm.

Conclusions

Tumor heterogeneity, characterized by multimodal distributions in genomic, transcriptomic, and immune features, fundamentally challenges the unimodal assumptions of conventional ICB response predictors. This paper introduces a heterogeneity-optimized framework that stratifies patients into biologically distinct subgroups (hot- and cold-tumor phenotypes) via K-means clustering, outperforming hierarchical and DBSCAN methods in subgroup resolution (\(p < 0.05\)). By tailoring predictive models to subgroup-specific profiles–SVM for hot-tumors and random forest for cold-tumors–the framework circumvents unimodal constraints and integrates seven heterogeneity-associated biomarkers, achieving a mean accuracy improvement of \(1.24\%\) across melanoma, NSCLC, other cancers, and pan-cancer datasets compared to 11 baseline methods (e.g., RF, SVM, logistic regression). Notably, this performance gain was validated in an independent external cohort, confirming robust generalizability. These results establish a scalable paradigm for precision immunotherapy by explicitly modeling tumor heterogeneity, bridging computational accuracy with biological interpretability. Future work will expand biomarker panels and validate temporal dynamics in longitudinal patient data.