Introduction

Histopathology remains the gold standard for cancer diagnosis and treatment planning, yet its traditional manual analysis faces challenges of subjectivity and limited scalability1,2,3. Recent advances in artificial intelligence (AI) have revolutionized computational pathology, enabling the automated extraction of complex patterns from histopathology images with high accuracy and efficiency4,5,6,7,8. These advancements have paved the way for clinical applications such as automated cancer diagnosis, molecular biomarker prediction, and treatment response assessment9,10,11. A pivotal development in this field is the emergence of pathology foundation models12,13,14,15,16,17,18,19, which are pretrained on large-scale pathology datasets via self-supervised learning (SSL)12,13,14,15,16,18,20,21,22,23,24,25,26. These models act as powerful image feature extractors, acquiring transferable representations from whole slide image (WSI) patches and enabling accurate predictions for various clinical tasks by integrating multiple-instance learning (MIL) models5,27,28,29,30,31.

Despite these advancements, a critical limitation persists: features extracted by pathology foundation models exhibit significant site-specific signatures. These signatures, which stem from variations in tissue preparation, staining protocols, and scanning equipment across different clinical centers (e.g., hospitals or research institutions) where samples are processed (Fig. 1a), directly impede cross-domain generalization. Specifically, they foster shortcut learning32,33,34 (Fig. 1b), where models learn superficial site-specific patterns rather than the underlying pathological features essential for robust diagnosis. As a result, these models often perform well on in-domain (IND) testing data (i.e., data from the same submitting sites as the training data), but exhibit drastically reduced accuracy and unreliable predictions on out-of-domain (OOD) testing data (i.e., data from different sites with distinct site-specific patterns), compromising diagnostic reliability and posing a serious risk of misdiagnosis and incorrect treatment decisions in clinical practice. Although stain normalization10,35,36 has been proposed as a potential solution37, our findings demonstrate its ineffectiveness in eliminating complex site-specific artifacts. Adding to these challenges of cross-site generalization, recent studies have highlighted a concerning lack of demographic fairness in pathology foundation models35, arising from variations in tissue appearance across demographic groups and imbalances in training data. Therefore, effectively mitigating site-specific biases and demographic disparities is paramount for the responsible and reliable deployment of pathology foundation models in diverse clinical settings.

Fig. 1: Overview of the computational pathology challenge and the proposed FLEX framework.
figure 1

a The WSI processing pipeline highlights variables that contribute to site-specific signatures. b Conceptual illustration of shortcut learning, where models exploit spurious correlations from site-specific signatures instead of learning task-related biological features. This leads to high performance on IND data but poor generalization to OOD data from unseen sites. c Datasets and tasks used in this study. A large multi-center TCGA cohort is used for training and cross-validation, with two independent external cohorts (CPTAC and in-house NFH dataset) for zero-shot generalization testing. We address 16 diagnostic tasks across four major cancer types, including NSCLC, BRCA, STAD, and CRC, spanning morphology, molecular biomarker, and gene mutation prediction. d The proposed FLEX workflow. Patch features are extracted using a pre-trained pathology VLM. Guided by visual and textual domain knowledge, FLEX selectively suppresses site-specific and demographic signatures while amplifying task-relevant biological information. The enhanced features are then used by a MIL model for slide-level prediction, leading to improved generalizability and fairness. Performance is evaluated using AUROC, fairness metrics, and interpretability methods (UMAP, attention maps). Schematics in panels (a, b), and the organ icons in panel (c) were created with BioRender.com.

In this work, we introduce FLEX (Feature-Level Enhancement for Cross-domain generalization), an approach that leverages domain knowledge to address the intertwined challenges of cross-site generalization and demographic fairness in pathology foundation models. FLEX enhances the model’s focus on generic pathological patterns by employing an information bottleneck. This bottleneck is guided by visual and textual concepts enriched with domain-specific prior knowledge, and strategically aligns patch image features with task-specific textual concepts. These textual concepts are generated by the text encoder of pathology foundation models, and thus, originating from the text domain, they are inherently free from site-specific signatures and demographic biases present in the images. As a result, pathology images from diverse source sites and demographic groups, which exhibit distinct feature distributions, are effectively aligned to a unified feature space. This alignment enables the model to generalize across heterogeneous sites and demographic groups, while enhancing the discriminativeness and expressiveness of encoded patch features for specific tasks. To validate our approach, we first confirm the presence of confounding site-specific signatures and demographic biases in existing foundation models. We then assess FLEX across 16 diverse clinical tasks using a rigorous two-stage evaluation. This involves a Site-Preserved Monte Carlo Cross-Validation (SP-MCCV) within The Cancer Genome Atlas (TCGA) cohort and a zero-shot external validation on two independent cohorts: the public Clinical Proteomic Tumor Analysis Consortium (CPTAC; n = 1738) and our in-house Nanfang Hospital dataset (NFH; n = 5318). This comprehensive evaluation demonstrates three key strengths of FLEX. First, it yields an improvement in cross-domain generalization, with consistent performance gains on both held-out TCGA sites and the external CPTAC and NFH cohorts. Second, it promotes demographic fairness by reducing performance disparities across demographic groups. Third, FLEX shows practical versatility through its compatibility with various pathology vision-language models (VLMs) and downstream MIL architectures, and maintains effectiveness across different data scales. These findings establish FLEX as a promising solution for improving the generalizability and fairness of pathology AI. By addressing the critical barriers of domain shift and bias, it offers a robust pathway for the reliable, responsible, and equitable deployment of foundation models in clinical settings.

Results

We first systematically evaluated the persistence of site-specific signatures and demographic biases in representations from pathology foundation models, and their negative impact on cross-domain generalization and fairness. We then demonstrate that our proposed FLEX framework mitigates these issues. As outlined in Fig. 1d, FLEX leverages visual and textual prior knowledge to enhance the discriminability and robustness of patch features. Our results are presented in three parts. First, we show that FLEX improves cross-domain generalization within the TCGA cohort and in zero-shot validation on external CPTAC and in-house cohorts. Second, we assess its impact on demographic fairness, revealing a marked reduction in performance disparities across racial and ancestry groups. Finally, we establish the practical versatility of FLEX by confirming its compatibility with various foundation models and MIL architectures across different data scales. These findings highlight FLEX as a robust solution for developing more generalizable and equitable pathology AI models.

FLEX improves cross-domain generalization and facilitates accurate clinical usage of pathology foundation models

This section evaluates how site-specific patterns in feature representations extracted by pathology foundation models affect cross-site generalization, potentially limiting their reliability for clinical applications. TCGA is a cornerstone resource for cancer research, which includes WSIs from multiple contributing sites and introducing site-specific signatures. For instance, the non-small cell lung carcinoma (NSCLC) cohort comprises WSIs from 66 sites, breast carcinoma (BRCA) from 36 sites, stomach adenocarcinoma (STAD) from 22 sites, and colorectal adenocarcinoma (CRC) from 37 sites (Fig. 1c). To investigate this, we utilized UMAP38 to visualize patch features extracted by the CONCH18 image encoder from the BRCA cohort (Fig. 2a, middle). Patches were color-coded by the submitting site (left panel) and diagnostic class (right panel). The clear clustering by site revealed the presence of pervasive site-specific signatures, a finding that is robust to UMAP hyperparameter variations (Supplementary Fig. 1). These patterns serve as problematic shortcuts, enabling models to predict clinical labels without learning meaningful task-relevant features and thereby compromising cross-domain generalization, as illustrated in Fig. 1b.

Fig. 2: FLEX enhances model generalizability and diagnostic performance by mitigating site-specific signatures.
figure 2

a UMAP visualization of patch features from the TCGA-BRCA dataset (BRCA-TYPE task), encoded by a CONCH model. Features are colored by submitting site (left panel) or diagnostic class (right panel). Original features and those after Reinhard stain normalization show strong clustering by site and poor separation of diagnostic classes, confirmed by low site integration (LISI: 3.03 and 3.29) and class separability (Silhouette Score: 0.15 and 0.14) scores. Callouts highlight patches from the same site with different diagnoses clustering together, indicating that site-specific signals overpower biological ones. In contrast, FLEX-processed features show effective removal of site signatures (LISI: 4.18) and improved separation of diagnostic classes (Silhouette Score: 0.38). b Schematic of the site-preserved cross-validation setup, defining IND test sets and OOD test sets. c Performance comparison in terms of AUROC across three major task categories and two external datasets (CPTAC, NFH). Bars represent mean AUROC for the Original (O), Reinhard-normalized (R), and FLEX (F) models. P-values indicate the statistical significance of the improvement of FLEX over the best-performing baseline (O or R), calculated using a two-sided paired t-test on the results from 15 cross-validation folds across all tasks within each category (P = 2.688 × 10−5 for OOD and P = 0.003 for IND in morphology classification; P = 3.305 × 10−12 for OOD and P = 0.001 for IND in molecular biomarker prediction; P = 1.010 × 10−10 for OOD and P = 5.764 × 10−5 for IND in gene mutation prediction; P = 9.432 × 10−16 for CPTAC and P = 8.900 × 10−12 for NFH in external validation). d Radar plots comparing model performance (AUROC) on individual tasks for TCGA OOD testing (left) and IND testing (right). e Radar plot of model performance on external validation datasets (CPTAC and NFH). For (d) and (e), asterisks indicate the statistical significance of FLEX compared to the best-performing baseline for each specific task, determined by a two-sided paired t-test across 15 folds (*P  < 0.05, **P  < 0.01, ***P  < 0.001). f Qualitative examples of attention maps on representative WSIs for BRCA-TYPE and STAD-MSI tasks. The color bar indicates attention weight. WSI images were obtained from the TCGA database. Source data are provided as a Source Data file.

We first quantified this issue across 16 pathology tasks using an SP-MCCV strategy (see ‘Site-preserved monte carlo cross-validation’ in “Methods”) within TCGA, where test data were split into IND and OOD subsets based on submitting sites (Fig. 2b). Patch features were extracted using CONCH18 and aggregated with an ABMIL model27. The results showed a significant drop in area under the receiver operating characteristic curve (AUROC) performance for OOD data compared to IND data across all task categories (Fig. 2c). Specifically, the average AUROC for IND was 0.918, 0.754, and 0.651 for morphology, molecular biomarker, and gene mutation tasks, respectively. In contrast, the OOD AUROC dropped to 0.853, 0.721, and 0.628. This performance decline illustrates the substantial impact of domain shift, a challenge that persists even for state-of-the-art pathology foundation models.

A common method for addressing domain shift is stain normalization, such as the Reinhard method39,40,41, which aligns the color distribution of WSIs to a reference template. While this approach can be effective when domain shift is primarily due to color variation, it is often insufficient for more complex biases. As shown in Fig. 2a (right panel), features after Reinhard normalization remain strongly clustered by site. Consequently, this method yielded minimal and inconsistent performance gains, and in some cases, even reduced performance (Fig. 2d, Supplementary Table 1).

In contrast, FLEX is designed to address a broader spectrum of site-specific signatures, including not only staining variations but also subtle variations from tissue preparation and scanner artifacts, through its knowledge-guided information bottleneck. When applied within the TCGA cross-validation, FLEX significantly improved model generalizability for both IND and OOD testing (Fig. 2c). Specifically, the average OOD AUROC increased by 6.45%, 7.21%, and 4.94% for the three task categories, substantially narrowing the IND-OOD performance gap. UMAP visualizations for the BRCA-TYPE task (Fig. 2a, left panel) further confirmed that FLEX effectively suppressed site-specific signatures, forcing features to align based on their biological class rather than their site of origin. Detailed results across all 16 tasks (see ‘Dataset and clinical tasks description’ in “Methods” and Supplementary Table 2) show consistent performance improvements in both IND and OOD settings (Fig. 2d). Notably, FLEX achieved a 15.13% increase in OOD AUROC for the STAD-LAUREN task, reducing the IND-OOD gap from 20.59% to 8.64%. This highlights FLEX’s potential to address key challenges in cross-site generalization.

Furthermore, we conducted a zero-shot evaluation on two independent external cohorts: the public CPTAC dataset and our in-house NFH dataset (see Supplementary Tables 3, 4). Models trained exclusively on TCGA were applied directly to these unseen cohorts without any fine-tuning. These datasets introduce unique site-specific signatures distinct from TCGA, representing a true out-of-distribution challenge. Across all tasks and both cohorts, the baseline model’s performance was modest. However, applying FLEX resulted in a consistent and significant performance uplift, demonstrating its ability to generalize to new clinical environments. The average AUROC for the CPTAC cohort increased from 0.679 to 0.714 (P = 9.432 × 10−16) and for the NFH cohort from 0.733 to 0.771 (P = 8.900 × 10−12), as shown in Fig. 2c. The radar plot in Fig. 2e provides a comprehensive overview, showing that FLEX (pink line) consistently improves performance. Detailed results for each task are provided in Supplementary Fig. 2 and Supplementary Tables 5, 6. These analyses indicate that FLEX learns transferable biological representations rather than memorizing source-domain shortcuts.

To provide qualitative insights into FLEX’s mechanism, we visualized attention maps from the ABMIL models (Fig. 2f). Here, noisy regions are defined as visual information irrelevant to the diagnostic task, such as processing artifacts or task-agnostic biological structures (e.g., non-neoplastic stroma). For the BRCA-TYPE task (Fig. 2f, top panel), the baseline model’s attention is diffuse, scattered across tumor cells and distracting noisy elements. In contrast, the model with FLEX consistently focuses on critical, task-relevant features like the irregular glandular and ductal formations characteristic of Invasive Ductal Carcinoma (IDC) without being distracted by noisy foreground elements. This focused attention provides an intuitive explanation for the quantitative performance gains. Similarly, for the STAD-MSI task, the FLEX-enhanced model successfully identified potential Microsatellite Instability-High (MSI-H) regions, where tumor cells exhibit greater heterogeneity, whereas the baseline model failed to do so, leading to incorrect predictions. This suggests that FLEX enhances the model’s focus on salient pathological patterns while suppressing task-agnostic information.

FLEX effectively improves demographic fairness

Demographic biases within large-scale datasets like the TCGA dataset exacerbate fairness concerns, leading to significant performance disparities in deep learning models, particularly disadvantaging underrepresented populations from minority demographics35. Because self-reported race may not fully capture complex biological heritage, we also stratified cohorts by genetic ancestry42, which provides a more biologically grounded measure. Our fairness analysis was conducted across both stratification methods in the TCGA dataset to ensure a comprehensive evaluation. Additionally, we performed a zero-shot fairness evaluation on the CPTAC dataset, which contains only self-reported race. As baseline experiments confirm (Fig. 3b, c), models trained on original features exhibit notable performance disparities across both stratification methods. Simple stain normalization, while useful for reducing inter-site staining variations, does not effectively address these demographic biases.

Fig. 3: Evaluation of FLEX’s effectiveness in improving demographic fairness.
figure 3

The top three rows (NSCLC-TYPE, BRCA-PR, CRC-BRAF) show fairness when evaluated across self-reported race and ancestry groups using results from a 15-fold SP-MCCV on the TCGA-NSCLC (n = 958), TCGA-BRCA (n = 937), and TCGA-CRC (n = 606) datasets, respectively. The bottom two rows (C-BRCA-TYPE, C-LUAD-EGFR) show results from models trained on TCGA datasets and externally validated on CPTAC-BRCA (n = 323) and CPTAC-LUAD (n = 815). For each cross-validation fold, metrics for each demographic-label group were estimated via bootstrapping (n = 500 replicates). a Fairness gap for five representative tasks. The AUROC gap ratio is the performance difference between the best- and worst-performing subgroups relative to the overall AUROC. Smaller dots represent individual folds (n = 15); large dots with error bars show the mean and 95% CI across folds. Marginal distributions show the distribution of AUROC gap ratios across folds. Presented P-values are from a two-sided Wilcoxon signed-rank test without adjustment, comparing FLEX to the Original method (P = 0.026 when evaluated on ancestry for NSCLC-TYPE; P = 0.018 when evaluated on self-reported race for BRCA-PR; P = 0.048 for C-BRCA-TYPE; P = 0.004 for C-LUAD-EGFR). b AUROC distribution across subgroups. Box plots show the distribution across n = 15 cross-validation folds. The center line represents the median (50th percentile), the box bounds represent the interquartile range (IQR; 25th to 75th percentiles), and the whiskers extend to data points within 1.5 × the IQR. c Violin plots of True Positive Rate (TPR) disparity, defined as the difference between a subgroup’s TPR and the overall TPR. The distributions are derived from results across n = 15 cross-validation folds. Within each violin, the box plot shows the median (center line), interquartile range (IQR), and whiskers extending to 1.5 × IQR. Distributions centered closer to the zero-disparity line indicate higher fairness. Presented P-values are from a two-sided paired-samples t-test without adjustment. The dashed line represents the overall TPR across the three methods. d Quantitative summary of TPR disparity, showing the Root Mean Square Error (RMSE) for Original (O), Reinhard (R), and FLEX (F). Lower values indicate better fairness. Source data are provided as a Source Data file.

By addressing this critical limitation through feature disentanglement, FLEX offers a significant improvement in both demographic fairness and model reliability. We evaluated this using a multi-faceted approach, assessing performance across both self-reported race and ancestry cohorts (Fig. 3). This improvement is also observed within individual clinical sites (Supplementary Fig. 3).

First, we examine the Fairness Gap (AUROC gap ratio)43 in Fig. 3a, which measures the absolute performance difference between the best- and worst-performing racial groups relative to the overall AUROC. On average, FLEX consistently reduces the fairness gap across tasks. For instance, in the NSCLC-TYPE task, FLEX reduces the mean fairness gap for self-reported race (from 0.052 to 0.041) and ancestry (from 0.062 to 0.040). For the external CPTAC cohort, the fairness gaps for the C-BRCA-TYPE and C-LUAD-EGFR tasks decreased from 0.287 and 0.494 to 0.245 and 0.246, respectively. The scatter plots and marginal distributions in Fig. 3a show that FLEX’s results (pink dots) are more tightly clustered at lower gap values, indicating improved fairness. Even in cases like CRC-BRAF where an outlier widens the confidence interval, the central tendency of FLEX’s performance remains superior or comparable to the baselines.

For a more robust assessment of model reliability, we analyzed the True Positive Rate (TPR) disparity35 (Fig. 3c), which quantifies the difference between the TPR of a specific demographic group and the overall TPR. The violin plots for FLEX are consistently narrower and more centered around the zero-disparity line for both self-reported race and ancestry stratifications, demonstrating that the model applies its diagnostic capability more equitably. This is quantitatively supported by the TPR disparity RMSE values (Fig. 3d). In the challenging C-LUAD-EGFR task, for example, FLEX reduces the average RMSE from 0.140 to 0.085, indicating a substantial improvement in fairness. Similarly, the race-wise AUROC distributions in Fig. 3b show that FLEX yields tighter and more equitable performance clusters.

FLEX is versatile across different pathology VLMs

To further evaluate the effectiveness and adaptability of FLEX across different pathology VLMs, we integrated it with three state-of-the-art pathology VLMs (CONCH18, PathGen-CLIP19, and QuiltNet17), which served as patch-level feature extractors. Using ABMIL27 as the downstream MIL framework, we tested performance on 16 pathology tasks, focusing on the OOD generalization setting.

The results show that FLEX consistently and significantly enhances performance across all three foundational VLMs (Fig. 4a). While the baseline performance varies between VLMs, applying FLEX delivers a robust improvement in nearly all scenarios. For CONCH, FLEX improved OOD performance in all 16 tasks, with significant improvement (P < 0.05) observed in 14 out of 16 tasks. For PathGen-CLIP, FLEX improved performance in all tasks, with significant improvement in 9 out of 16 tasks. For QuiltNet, performance improved in 14 out of 16 tasks, with significant improvement in 9 tasks. This broad compatibility underscores that FLEX is a model-agnostic solution for improving generalization. The varying magnitude of improvement may be attributed to differences in the expressivity and discriminativeness of the original patch features extracted by different pathology VLMs. Since FLEX relies on visual and textual knowledge as guidance to enhance the patch features, the quality of this guidance depends on the features extracted by the VLMs. Visual knowledge and textual knowledge are generated by encoding pilot patch images and text prompts with the image and text encoders of the pathology VLMs. If the VLMs lack expressivity and discriminability, the resulting visual and textual knowledge may be less informative and discriminative, limiting the performance improvement achieved by FLEX.

Fig. 4: Performance evaluation of FLEX when incorporated with different pathology VLMs.
figure 4

a Comparison of AUROC performance between original pathology vision-language models (VLMs) and their counterparts enhanced with FLEX. The 16 tasks span four datasets: TCGA-BRCA (n = 937), TCGA-NSCLC (n = 958), TCGA-STAD (n = 414), and TCGA-CRC (n = 606). Each box plot summarizes results from n = 15 independent cross-validation folds derived from SP-MCCV. Box plots display the median (center line), interquartile range (IQR; box limits from 25th to 75th percentiles), and whiskers extending to 1.5 × IQR; individual data points for each fold are overlaid. Indicated P-values were calculated using a two-sided paired-samples t-test with multiple hypothesis correction. C, P, and Q correspond to CONCH, PathGen-CLIP, and QuiltNet, respectively. b UMAP visualizations illustrating the effect of FLEX on the patch feature space for the STAD-EBV task. For each VLM, parallel subplots are colored by site (left) to visualize batch effect mitigation, and by EBV status (right) to visualize class separability. The LISI for site integration (higher is better) and the Silhouette Score for class separation (higher is better) provide quantitative evidence. The visualizations and scores demonstrate that FLEX reduces site-specific clustering while improving the discriminability of task-relevant classes. Source data are provided as a Source Data file.

We further analyzed the mechanism behind this improvement by visualizing the patch feature space for the STAD-EBV task using UMAP38 (Fig. 4b). For all three VLMs, the original features show strong clustering by site, with poor separation between EBV-positive and EBV-negative classes. This is confirmed by low Local Inverse Simpson’s Index (LISI) scores, which measure local data mixing by site, and Silhouette Scores, which measure class separability. After applying FLEX, the feature distributions become more uniform, with site-specific clusters dissolving. This is validated by a significant increase in the LISI score for all three VLMs (e.g., from 2.22 to 3.22 for CONCH), indicating better integration of features from different sites. Concurrently, the separation between EBV classes improves, reflected by a notable increase in the Silhouette Score (e.g., from 0.09 to 0.24 for CONCH). These results demonstrate that FLEX mitigates site-specific biases while enhancing the discriminability of task-relevant biological features, a dual benefit that holds across different foundational models.

FLEX is effective with different sizes of training data and flexible with different MIL models

To evaluate the scalability of FLEX with varying training data scales, we conducted additional experiments on the BRCA cohort. The experimental design is illustrated in Fig. 5a. The dataset was partitioned into five site-preserved outer folds, with one fold held out as the OOD test set in each of five outer loop iterations. Training sets of increasing size (1, 2, 3, and 4 folds) were constructed from the remaining four folds to create a learning curve. Three inner MCCV runs were performed for each scale to ensure stability. For this analysis, three tasks spanning different domains were selected: BRCA-TYPE, BRCA-HER2, and BRCA-PIK3CA. The corresponding training and OOD test set sizes are in Supplementary Table 7. The results shown in Fig. 5b demonstrate that FLEX consistently and significantly (P < 0.05) improves OOD performance across all tasks and data scales, underlining its ability to generalize effectively even with larger training datasets.

Fig. 5: Performance of FLEX across training data scales and MIL models.
figure 5

a Schematic of the experimental design for evaluating performance at different training data scales. The procedure follows the SP-MCCV strategy, which involves 5 outer folds and 3 inner Monte Carlo Cross-Validation (MCCV) folds, for a total of 15 evaluation runs. Specifically for this experiment, the dataset is first partitioned into 5 site-preserved outer folds. In each of the 5 outer loop iterations, one fold is held out as the OOD test set. Training sets of increasing size (1, 2, 3, and 4 folds) are constructed from the remaining 4 folds. For each scale condition, 3 inner MCCV runs are performed, for a total of 15 evaluation runs per condition. b OOD AUROC comparison between FLEX and the baseline (Original) on three tasks from the TCGA-BRCA dataset (n = 875). Box plots show aggregated results from 15 runs for each training data scale, derived from the SP-MCCV strategy. The box plots display the median (center line), the first and third quartiles (Q1-Q3; the box), and whiskers extending to 1.5 × interquartile range. Individual data points from each run are overlaid. Presented P-values are from a two-sided paired-samples t-test without adjustment. c Dumbbell plot showing OOD AUROC after integrating FLEX with five state-of-the-art MIL models across 16 tasks. Datasets used include TCGA-BRCA (n = 937), TCGA-NSCLC (n = 958), TCGA-STAD (n = 414), and TCGA-CRC (n = 606). Each dumbbell connects the mean performance of an Original model to its FLEX-enhanced counterpart. Indicated P values were calculated using a two-sided paired-samples t-test with multiple hypothesis correction. Source data are provided as a Source Data file.

To further assess the flexibility of FLEX across MIL models, we integrated it with five state-of-the-art MIL architectures: ABMIL27, CLAM-SB5, ACMIL28, DTFD-MIL29, and ILRA-MIL30. As shown in Fig. 5c, FLEX consistently improved OOD performance across all models. The average OOD performance gains were 6.01%, 5.55%, 3.82%, 3.53%, and 6.06%, respectively. Additionally, FLEX significantly reduced the performance gap between IND and OOD testing. Specifically, the gap decreased from 0.034, 0.040, 0.041, 0.038, and 0.035 to 0.022, 0.024, 0.021, 0.025, and 0.020 for the corresponding MIL models. Detailed per-task performance metrics for all combinations of VLM and MIL models evaluated are provided in Supplementary Tables 823.

The performance improvement seen in Fig. 5c varies across the 16 pathology tasks due to three main factors: intrinsic task difficulty, the utility of feature refinement for a given task, and the complexity of the MIL model. First, inherent task difficulty sets the upper bound on performance; for example, morphology classification tasks consistently yield higher AUROCs than most gene mutation prediction tasks. Second, the magnitude of improvement from FLEX depends on task-specific conditions. In the STAD-LAUREN task, for instance, all MIL models show a substantial performance gain. This may be because the TCGA-STAD cohort is relatively small, causing baseline models to overfit to noisy features. FLEX’s core mechanism—refining features by filtering task-irrelevant information-"improves the signal-to-noise ratio, which is particularly beneficial when training data is limited. Third, performance gains are more pronounced with simpler MIL models like ABMIL and CLAM-SB compared to more complex architectures. We hypothesize this is related to the joint optimization of FLEX and the MIL model. A simpler MIL backbone may create a more stable learning environment, allowing gradients to effectively guide the feature refinement process within FLEX. In contrast, with more intricate MIL models, the increased number of parameters and interacting components can complicate convergence to an optimal state, potentially limiting the observable performance gain from FLEX.

Discussion

Foundation models in computational pathology have rapidly advanced, demonstrating impressive capabilities in histopathology images understanding. Their integration into weakly supervised computational pathology pipelines has yielded strong performance across diverse diagnostic tasks12,13. Despite this progress, significant challenges remain. These include limitations in cross-domain generalization, demographic fairness, and the practical adaptation of these powerful models for specific diagnostic objectives. Our study undertakes a comprehensive analysis and proposes an effective solution to address these critical challenges of cross-domain generalization and demographic fairness within pathology foundation models. Through a systematic investigation, we identified the detrimental impact of site-specific signatures and demographic biases inherent in pathology VLMs. This understanding directly informed the development of FLEX, an approach engineered to enhance both generalization capacity and fairness across a spectrum of diagnostic tasks.

Our investigation yielded key insights into the nature of these challenges. UMAP visualizations of patch features from pathology VLMs, coupled with the stark performance decline observed in OOD testing, clearly demonstrated the presence of site-specific signatures impeding cross-site generalization. Furthermore, our fairness gap analysis and evident disparities in TPR across demographic groups compellingly highlighted inherent biases within current pathology foundation models. Crucially, we demonstrated that standard stain normalization is insufficient to mitigate these deeper biases. It primarily addresses pixel-level color variations and fails to account for the broader spectrum of site-specific signatures, including subtle variations from tissue preparation and scanner artifacts. To overcome these shortcomings, we introduced FLEX, an approach leveraging both visual and textual prior knowledge to strategically suppresses site-specific patterns and demographic bias while amplifying task-relevant feature discriminability, thereby facilitating effective task-specific adaptation. Visual concepts, consisting of representative patch images for target classes (Supplementary Fig. 4), guide the retrieval of task-relevant patches from WSIs, facilitating information bottleneck training. Complementing visual concepts, we incorporated domain expertise to create learnable textual concepts. These textual concepts guide an information bottleneck mechanism within FLEX. This mechanism enhances task-related information within patch features and aligns them to a unified feature space, filtering out extraneous and biasing information. This ultimately leads to significant improvements in both cross-site generalization and demographic fairness. The ablation study on visual prompts and textual prompt length is presented in Supplementary Fig. 5.

FLEX represents a significant step in adapting foundation models for computational pathology. It effectively mitigates the negative impacts of site-specific signatures and demographic biases while improving feature discriminability for downstream tasks. The core mechanism of FLEX uses an information bottleneck guided by textual pathological concepts to disentangle task-relevant features from spurious variations. Because these textual concepts are inherently free of site or demographic information, this process forces the model to discard extraneous signals, including complex site and demographic signatures. The efficacy of FLEX was validated through extensive evaluations across 16 pathology tasks, using various pathology VLMs and MIL frameworks. These consistent results, obtained via a site-preserved cross-validation strategy and supported by fairness analyses, underscore the method’s robustness. Beyond performance improvements, FLEX offers insights into the strategic use of multimodal prior knowledge for task-specific adaptation of foundation models, contributing to the development of more effective and equitable diagnostic tools.

While FLEX leverages the inherent alignment between visual and textual embedding spaces of pathology VLMs, its current reliance on this alignment limits compatibility with vision-only foundation models. To address this, future work could explore cross-modal distillation techniques or develop a vision-only adaptation module, enabling FLEX to generalize to a broader range of architectures. Moreover, the current FLEX is primarily optimized for classification tasks, such as cancer subtyping and biomarker prediction. However, complex diagnostic tasks like prognosis in computational pathology are crucial for personalized medicine and patient management. To address this, we may extend FLEX to incorporate prognostic-specific textual prompts to guide the model towards relevant features for predicting patient outcomes. Future research will focus on expanding FLEX’s versatility. Key directions include broadening compatibility to vision-only foundation models and extending its application to a wider range of tasks, including regression and prognosis. We will also explore its integration with pathology vision large language models and work to enhance the interpretability and adaptability of these models. This effort includes the quantitative validation of attention mechanisms against expert-annotated regions by pathologists, which would provide an objective confirmation of the model’s focus. Such work will be crucial for translating these advancements into clinically impactful and trustworthy tools.

Methods

This retrospective study was conducted in compliance with all relevant ethical regulations. The research utilized publicly accessible data from the TCGA and the CPTAC cohorts, which were originally collected under approved protocols with informed consent from all participants. Information regarding participant compensation for these public cohorts was not available to the authors. The in-house NFH dataset was collected and used under the approval of the Medical Ethics Committee of NanFang Hospital of Southern Medical University (approval number: NFEC-2025-419), which granted a waiver of informed consent due to the retrospective nature of the study and the use of de-identified data, posing minimal risk to participants. No compensation was provided for the retrospective collection of NFH data. All data were handled with a strict commitment to patient privacy, fairness, and transparency.

The demographic characteristics of the patient cohorts, including sex and age, are detailed in the subsequent dataset descriptions. For all cohorts, participant sex and race were determined from self-reported information available in the associated clinical metadata, while genetic ancestry was derived from genomic data as established by the TCGA Pan-Cancer Atlas project42. Patient sex was not considered in the study design. Our fairness analysis focused primarily on self-reported race and genetic ancestry to directly address significant disparities previously reported in the literature for these groups in computational pathology.

Dataset and clinical tasks description

This study utilized datasets from the TCGA cohort, including data from breast invasive carcinoma (BRCA), non-small cell lung cancer (NSCLC, encompassing lung adenocarcinoma (LUAD) and lung squamous cell carcinoma (LUSC)), stomach adenocarcinoma (STAD), and colorectal carcinoma (CRC, encompassing colon adenocarcinoma (COAD) and rectum adenocarcinoma (READ)). The original WSIs were collected from the publicly available TCGA program (https://www.cancer.gov/ccg/research/genome-sequencing/tcga), and corresponding clinical, molecular, and gene mutation information was obtained from the cBioPortal database (https://www.cbioportal.org). In this study, we conducted 16 WSI analysis tasks to comprehensively evaluate the effectiveness of FLEX across different task types. These tasks include: three morphology classification tasks across three organs, five molecular biomarker status prediction tasks across two organs, and eight gene mutation prediction tasks across four organs. This diverse range of tasks ensures a robust assessment of FLEX in various computational pathology applications. A detailed breakdown of patient and slide counts for each task in the TCGA cohort is provided in Supplementary Table 2.

The demographic and clinical characteristics of the primary TCGA cohorts are as follows: the TCGA-BRCA cohort included 876 female (98.8%) and 11 male (1.2%) patients, with a median age of 58 years (range: 26–90). The TCGA-NSCLC cohort comprised 341 female (40.4%) and 503 male (59.6%) patients, with a median age of 67 years (range: 38–90). The TCGA-STAD cohort included 132 female (34.0%) and 256 male (66.0%) patients, with a median age of 67 years (range: 30–90). The TCGA-CRC cohort consisted of 275 female (48.2%) and 296 male (51.8%) patients, with a median age of 67 years (range: 31–90).

Morphology classification

For the morphology classification tasks, the objective was to classify histopathology images into distinct subtypes of the same cancer type:

  • BRCA-TYPE: This binary classification task involves classifying histopathology images from the TCGA-BRCA cohort into invasive ductal carcinoma (IDC) and invasive lobular carcinoma (ILC)44.

  • NSCLC-TYPE: This two-class classification task categorizes histopathology images from the TCGA-NSCLC cohort into lung adenocarcinoma (LUAD) and lung squamous cell carcinoma (LUSC)45.

  • STAD-LAUREN: This task focuses on the Lauren classification of histopathology images from the TCGA-STAD cohort, distinguishing between the intestinal type and diffuse type46.

Molecular biomarker status prediction

The molecular biomarker status prediction tasks aim to identify clinically relevant biomarker statuses of cancer based on histopathology images, leveraging specific morphological correlations:

  • BRCA-HER2, BRCA-ER, and BRCA-PR: These binary classification tasks predict the human epidermal growth factor receptor 2 (HER2), estrogen receptor (ER), and progesterone receptor (PR) statuses of breast cancer using histopathology images from the TCGA-BRCA cohort47,48,49.

  • STAD-EBV and STAD-MSI: These binary classification tasks predict the Epstein-Barr virus (EBV) and Microsatellite Instability (MSI) statuses of stomach cancer using histopathology images from the TCGA-STAD cohort50,51.

Gene mutation prediction

Gene mutation prediction tasks aim to predict the mutation status of specific genes in cancer using histopathology images. In this study, the following binary classification tasks were conducted:

  • BRCA-PIK3CA and BRCA-CDH1: These tasks predict the PIK3CA and CDH1 gene mutation statuses in breast cancer using histopathology images from the TCGA-BRCA cohort52,53.

  • LUAD-EGFR and LUAD-STK11: These tasks predict the EGFR and STK11 gene mutation statuses in lung adenocarcinoma using histopathology images from the TCGA-LUAD cohort54,55.

  • STAD-TP53 and STAD-MUC16: These tasks predict the TP53 and MUC16 gene mutation statuses in stomach adenocarcinoma using histopathology images from the TCGA-STAD cohort56,57.

  • CRC-BRAF and CRC-TP53: These tasks predict the BRAF and TP53 gene mutation statuses in colorectal carcinoma using histopathology images from the TCGA-CRC cohort58,59.

By targeting tasks of varying complexity and clinical significance, we aimed to comprehensively assess the effectiveness of FLEX across a diverse range of WSI analysis applications.

External Validation Datasets

To evaluate the zero-shot generalization capability of models trained on TCGA, we used two independent external cohorts for testing: the public CPTAC dataset and a large in-house dataset from NFH. These datasets introduce significant domain shifts, including different patient populations, slide preparation protocols, and scanning equipment, providing a stringent test of real-world applicability. Detailed statistics for the tasks performed on these external cohorts are available in Supplementary Table 3 for CPTAC and Supplementary Table 4 for NFH. The specific tasks for external validation were selected based on data availability and their overlap with the TCGA tasks.

CPTAC Dataset

The CPTAC collections are a publicly available resource (https://portal.gdc.cancer.gov/projects/CPTAC-2, https://portal.gdc.cancer.gov/projects/CPTAC-3) that includes genomic and proteomic data along with corresponding WSIs. For our external validation, we selected a subset of tasks that mirror those in our primary TCGA analysis. The CPTAC patient demographics for the tasks used in this study were as follows: the CPTAC-BRCA cohort included 116 female patients with a median age of 59 years (range: 30–89). The CPTAC-NSCLC cohort included 82 female (30.9%) and 183 male (69.1%) patients with a median age of 65 years (range: 36–88). The CPTAC-COAD cohort consisted of 61 female (58.7%) and 43 male (41.4%) patients with a median age of 65 years (range: 35–93). The specific tasks covered three cancer types:

  • BRCA (C-BRCA-TYPE, C-BRCA-PIK3CA, C-BRCA-CDH1): Including subtype classification (IDC vs. ILC) and prediction of PIK3CA and CDH1 gene mutations using histopathology images from the CPTAC-BRCA cohort.

  • NSCLC (C-NSCLC-TYPE, C-LUAD-EGFR, C-LUAD-STK11): Including subtype classification between LUAD and LUSC, and prediction of EGFR and STK11 gene mutations in LUAD using histopathology images from the CPTAC-NSCLC cohort.

  • COAD (C-COAD-BRAF, C-COAD-TP53): Including prediction of BRAF and TP53 gene mutations in COAD using histopathology images from the CPTAC-COAD cohort.

Nanfang Hospital of Southern Medical University (NFH) In-house Dataset

The NFH dataset is a large-scale, retrospectively collected in-house cohort from Nanfang Hospital of Southern Medical University, Guangzhou, China. This dataset provides a challenging test case due to its distinct technical and demographic characteristics compared to the Western-centric TCGA cohort. The demographic breakdown for the cohorts used in this study is as follows: the NFH-BRCA cohort consisted of 761 female (99.9%) and 1 male (0.1%) patient, with a median age of 52 years (range: 17–89). The NFH-NSCLC cohort included 910 female (46.7%) and 1037 male (53.3%) patients with a median age of 59 years (range: 15–87). The NFH-STAD cohort included 146 female (37.0%) and 249 male (63.0%) patients with a median age of 60 years (range: 20–89). The NFH-CRC cohort included 482 female (41.6%) and 678 male (58.4%) patients with a median age of 62 years (range: 21–93). The validation tasks on the NFH dataset cover four major cancer types:

  • BRCA (N-BRCA-TYPE, N-BRCA-HER2, N-BRCA-ER, N-BRCA-PR): Including subtype classification (IDC vs. ILC) and prediction of HER2, ER, and PR biomarker statuses using histopathology images from the NFH-BRCA cohort.

  • NSCLC (N-NSCLC-TYPE, N-LUAD-EGFR): Subtype classification between LUAD and LUSC and prediction of EGFR gene mutation status in LUAD using histopathology images from the NFH-NSCLC cohort.

  • STAD (N-STAD-LAUREN): Lauren classification between Intestinal and Diffuse type in STAD using histopathology images from the NFH-STAD cohort.

  • CRC (N-CRC-BRAF): Prediction of BRAF gene mutation status in CRC using histopathology images from the NFH-CRC cohort.

By evaluating model performance on these diverse external datasets without any fine-tuning, we assess the effectiveness of FLEX in overcoming dataset-specific biases and generalizing to unseen clinical environments.

Preprocessing of histopathology images

Segmentation and patching

We adopted the preprocessing steps outlined in CLAM5, which involve segmenting and cropping tissue regions from each WSI and extracting instance features from each patch using a feature extractor. Specifically, each WSI is first loaded into memory at a downsampled resolution (i.e., 32 × downscale) and converted from RGB to HSV color space to facilitate tissue segmentation. A binary mask for the foreground is then computed by thresholding the saturation channel after applying median blurring to smooth edges. Morphological closing operations are subsequently performed to fill small gaps and holes within the tissue regions. The segmented foreground tissue regions are further cropped into 512 × 512 patches from within the foreground contours at 20 × magnification for each slide.

Feature extraction

Due to the potentially vast size of the patch bag for each WSI (e.g., exceeding 10,000 patches), each patch was transformed into a compact feature representation using the pretrained feature extractor neural network. In this study, the feature extractor was implemented using the frozen image encoder of pathology VLMs, including CONCH18, PathGen-CLIP19, and QuiltNet17.

Stain normalization

For stain normalization, we employed the Reinhard method60 and the Macenko method39 to standardize the color distribution of the histopathology images. For implementation, we utilized the torch-staintools package for efficient normalization (https://github.com/CielAl/torch-staintools).

Pathology vision-language foundation models

With the advancement of large-scale pre-trained VLMs in computer vision and natural language processing, several pathology-specific VLMs have been developed to integrate histopathology images and textual concepts. Compared to vision-only foundation models, pathology VLMs better capture the intricate relationships between histopathology images and textual data. This alignment enhances the generalizability and robustness of image features22. The proposed FLEX is designed to seamlessly integrate with various pathology VLMs to improve the performance of WSI analysis tasks. In this study, we employed three state-of-the-art pathology VLMs:

CONCH18

CONCH is a visual-language foundation model pre-trained on over 1.17 million image-caption pairs, leveraging diverse histopathology images and biomedical text. It supports versatile applications in histopathology with minimal or no supervised fine-tuning, addressing label scarcity and enhancing adaptability across diseases and tasks. The CONCH model is publicly available on GitHub: https://GitHub.com/mahmoodlab/CONCH.

PathGen-CLIP19

PathGen-CLIP is a pathology-specific VLM trained using PathGen-1.6M19, a dataset containing 1.6 million high-quality image-caption pairs generated from large-scale WSI datasets such as TCGA. The PathGen-CLIP model is publicly available on GitHub: https://GitHub.com/PathGen-1-6M/PathGen-1.6M.

QuiltNet17

QuiltNet is a visual-language model for histopathology trained on Quilt-1M17, a dataset of 1 million image-caption pairs curated from sources such as YouTube, Twitter, and research papers. QuiltNet is publicly available on GitHub: https://GitHub.com/wisdomikezogwo/quilt1m.

Multiple instance learning methods

MIL61 is the predominant weakly supervised learning paradigm in computational pathology. In MIL, each WSI is treated as a bag of patch instances, and the models predict the bag-level label based on aggregated instance features. To demonstrate the adaptability and generalizability of FLEX, we integrated it with five state-of-the-art MIL methods in our experimental comparisons. The selected MIL methods are as follows:

ABMIL27

ABMIL models the Bernoulli distribution of a bag and parameterizes the aggregation operator using an attention mechanism. We used the original implementation available on GitHub: https://GitHub.com/AMLab-Amsterdam/AttentionDeepMIL.

ACMIL28

ACMIL mitigates overfitting by introducing multi-branch attention and stochastic Top-k instance masking, addressing attention value concentration, and enhancing the capture of diverse discriminative instances. The original implementation was used from GitHub: https://GitHub.com/dazhangyu123/ACMIL.

CLAM-SB5

CLAM-SB identifies diagnostically relevant sub-regions and refines the feature space through instance-level clustering of these regions. The implementation from the original GitHub repository was followed: https://GitHub.com/mahmoodlab/CLAM.

DTFD-MIL29

DTFD-MIL addresses WSI classification challenges with small sample sizes by using pseudo-bags to expand training data and a double-tier MIL framework for better feature utilization. The original implementation was used from GitHub: https://GitHub.com/hrzhang1123/DTFD-MIL.

ILRA-MIL30

ILRA-MIL employs a low-rank constraint to group similar patches and separates different ones, utilizing low-rank latent vectors for efficient global interaction modeling. The original GitHub implementation was used: https://GitHub.com/jinxixiang/low_rank_wsi.

Site-preserved Monte Carlo cross-validation

Histopathology images within the TCGA cohort originate from diverse source sites (e.g., hospitals, research institutions), often exhibiting site-specific signatures. These signatures can induce models to learn spurious correlations, hindering their ability to generalize to new, unseen sites. To rigorously evaluate the generalizability and effectiveness of FLEX and specifically assess performance on data from novel source sites, we implemented an SP-MCCV strategy. This approach is specifically designed to ensure models are tested on completely independent source sites not encountered during training. As illustrated in Fig. 6d, this nested SP-MCCV strategy is conceptually inspired by the Preserved-Site Cross-Validation method37 and includes outer and inner cross-validation loops:

Fig. 6: Overview of the proposed framework and experimental setup.
figure 6

a Visual knowledge generation pipeline. This is a one-time offline procedure during training. ABMIL identifies high-attention patches, followed by GPT-4o-based filtering and human expert verification to select representative patches. Class-specific visual concepts are generated by averaging features from representative patches using the VLM image encoder. b Pipeline for textual knowledge generation, also a one-time offline procedure. GPT-4o is prompted to create representative text prompts (both task-specific and task-agnostic). After expert verification, these prompts are enriched with learnable, data-driven contexts and encoded by the VLM text encoder to create the final textual concepts. c The architecture of FLEX. During inference, original patch features are extracted by the VLM image encoder; a variational encoder then generates parameters for a Gaussian distribution for each patch, from which enhanced features are sampled and aggregated by the MIL model. During training, a Visual Concept Guided Pilot Patch Selection module uses the pre-computed visual concept to select the top-k most relevant enhanced patches. These selected patches are then used in the Textual Concept Guided Feature Calibration process, where an InfoNCE loss aligns the features with the textual concepts by minimizing the distance to the corresponding class concept while maximizing their distance from other concepts. This process helps to optimize the variational encoder of FLEX. d Schematic of the SP-MCCV strategy. The dataset is partitioned into outer folds based on the clinical site to create distinct training and OOD test sets. Inner folds are then used to randomly split the training data for IND evaluation. Some illustrations were created with BioRender.com.

Outer loop: site-preserved partitioning

Our strategy treats each Tissue Source Site (TSS) as an independent entity, representing a unique source of protocols and potential biases. We do not group sites based on known factors (e.g., scanner type), as this would fail to capture the full complexity of site-specific artifacts and would represent a less stringent test of generalization. Instead, we partition the entire set of TSS codes (see https://gdc.cancer.gov/resources-tcga-users/tcga-code-tables/tissue-source-site-codesfor site-TSS mapping) into Ks mutually exclusive site-folds using convex optimization to ensure that each site-fold contains a balanced proportion of patients for the diagnostic labels of interest37. In each outer loop of the cross-validation, one of these site-folds is designated as the IND dataset, while the remaining Ks − 1 site-folds are aggregated to form the OOD test set, simulating a deployment to unseen medical centers.

Inner fold: Monte Carlo cross-validation within IND data

Within each site-preserved outer loop, the IND dataset (comprised of seen sites) was further subjected to Monte Carlo Cross-Validation (MCCV). Specifically, the IND data was randomly divided into training and IND testing subsets at a 7:3 ratio. The OOD dataset from the held-out site-fold was kept entirely separate and used exclusively for the final OOD performance evaluation.

This nested SP-MCCV procedure guarantees that model training and validation occur on data originating from distinct, site-specific distributions. For the BRCA and NSCLC datasets, we employed Ks = 5 site-preserved splits, with each IND fold undergoing 3 MCCV iterations, resulting in a total of 15 evaluation folds. For datasets with fewer source sites and WSIs, we reduced Ks to 3 to maintain sufficient training data per fold, while increasing MCCV iterations to 5 per IND fold, also yielding a total of 15 folds. This adjustment ensures robust statistical evaluation across datasets of varying sizes and site diversity. In summary, this SP-MCCV strategy provides a rigorous and comprehensive evaluation of model performance, specifically across unseen source sites. By strictly separating training and testing data at the source site level, SP-MCCV allows for a robust and reliable assessment of FLEX’s generalizability and effectiveness in mitigating site-specific biases.

FLEX architecture

As illustrated in Fig. 1d, the proposed FLEX is a flexible and adaptable module that can be seamlessly integrated with existing workflows for WSI analysis tasks. For clarity, we denote the input histopathology dataset as \({{{\mathcal{D}}}}=\{\left({{{\bf{P}}}},{{{\bf{y}}}}\right)\}\), where P = {p1p2, …, pN} is a set of N patches cropped from the WSI and y is the corresponding label. The traditional WSI analysis pipeline comprises two stages: feature extraction and MIL-based classification. In the feature extraction stage, the image encoder Ei of pathology VLMs is employed to extract patch-level features from the patches: \({{{\bf{X}}}}={{{{\rm{E}}}}}_{i}\left({{{\bf{P}}}}\right)\), where X = {x1x2, …, xN} is the extracted original patch-level features with dimension of D. In the MIL-based classification stage, MIL models aggregate the patch-level features into a comprehensive representation, upon which a classifier is applied to predict the bag-level label \(\hat{{{{\bf{y}}}}}\): \(\hat{{{{\bf{y}}}}}={{{\rm{CLS}}}}\left({{{\rm{MIL}}}}\left({{{\bf{X}}}}\right)\right)\), where CLS and MIL are the classifier and the MIL model, respectively. The objective of FLEX is to enhance the original patch-level features X by leveraging the knowledge from visual concepts V and textual concepts T. This process filters out site-specific signatures and task-agnostic information, resulting in enhanced features that are subsequently fed into the MIL model for generalizable and robust WSI analysis: \({{{\bf{Z}}}}={{{\bf{FLEX}}}}\left({{{\bf{X}}}},{{{\bf{V}}}},{{{\bf{T}}}}\right)\), \(\hat{{{{\bf{y}}}}}={{{\rm{CLS}}}}\left({{{\rm{MIL}}}}\left({{{\bf{Z}}}}\right)\right)\), where Z = {z1z2, …, zN} represents the enhanced patch-level features, each with a dimension of D. Specifically, the visual concepts V consist of representative patch images for the target class, which are used to retrieve important patches from the original WSI that may contain task-related information. The textual concepts T, comprising representative text prompts with learnable contexts, guide the feature enhancement process by filtering out site-specific signatures and task-agnostic information while preserving task-related information. In the following sections, we detail the design of FLEX and its integration with pathology VLMs and MIL models.

Visual concept generation

The process of visual concept generation is illustrated in Fig. 6a. First, an ABMIL model was trained on the training data, and patches with high attention scores were randomly selected from the training set. However, due to potential overfitting, these high-attention patches may not be representative of the target class. To mitigate this issue, a multimodal large language model (e.g., GPT-4o) was employed to filter out task-irrelevant and noisy patches. This was achieved by prompting GPT-4o with queries such as whether the input patch could reliably indicate its belonging to the target class. This step served as a preliminary screening to identify representative patches. To further ensure task relevance, a human expert manually verified the selected patch candidates. Finally, the visual concept for each target class was generated by averaging the features of the selected patches, which were extracted using the image encoder of the pathology VLMs.

Textual concept generation

The process of textual concept generation is illustrated in Fig. 6b. It involves creating task-specific and task-agnostic prompts that serve as positive and negative supervision signals. Recognizing the need for accuracy in medical applications, our methodology incorporates an expert validation protocol. First, the prompt generation is constrained to simple, descriptive labels grounded in established oncologic terminology (e.g., “invasive ductal carcinoma”, “connective tissue”), which minimizes the risk of factual inaccuracies. Second, and most critically, we implemented a human-in-the-loop expert validation process. All textual concepts, whether initially derived from established class names or drafted with assistance from GPT-4o, were rigorously reviewed, edited for clinical precision, and formally approved by a board-certified pathologist. This expert oversight ensures that every prompt used is medically accurate and relevant to the prediction task. The specific task-specific and task-agnostic prompts used for each cancer type are detailed in Supplementary Tables 2429. With these controls, prompts were defined as follows: for tasks with clear class boundaries, such as BRCA subtyping, task-specific prompts are the class names themselves, following previous studies18,62. However, for more complex tasks with vague class boundaries, high complexity, or limited data, class names alone may be insufficient. To address this, we propose using the same class names as task-specific prompts attached to learnable contexts. These contexts adaptively capture task-specific information from the images during training, enhancing the prompts’ effectiveness. Task-agnostic prompts consist of general descriptions irrelevant to the task, such as the names of other tissue types (e.g., “connective tissue” or “normal tissue”). These prompts guide the feature enhancement process to filter out task-agnostic information. These are also attached to learnable contexts to improve their effectiveness in filtering out irrelevant features. Finally, textual concepts are generated by averaging the embeddings of these prompts (with their learnable contexts) produced by the VLM’s text encoder.

Knowledge-guided feature enhancement

The feature enhancement process, illustrated in Fig. 6c, is guided by visual and textual knowledge and comprises three main steps: visual concept-guided pilot patch selection, textual concept-guided feature calibration, and feature enhancement using a variational information bottleneck. To enhance the original patch-level features, we leverage the information bottleneck principle63,64, which filters out site-specific signatures and task-agnostic information while preserving task-related information. The information bottleneck principle seeks a compressed representation of the input data that minimizes the mutual information between the input and the compressed representation and maximizes the mutual information between the compressed representation and the target output.

First, cropped patches from WSIs are processed through the frozen image encoder of the pathology VLMs to extract the original patch-level features X. For each original feature x and its corresponding enhanced feature z, we define the mutual information between them as:

$$I\left({{{\bf{x}}}},{{{\bf{z}}}}\right)=\int\,p({{{\bf{x}}}},{{{\bf{z}}}})\log \frac{p({{{\bf{z}}}}| {{{\bf{x}}}})}{p({{{\bf{z}}}})}{{{\rm{d}}}}{{{\bf{x}}}}{{{\rm{d}}}}{{{\bf{z}}}},$$
(1)
$$=\int\,p({{{\bf{x}}}},{{{\bf{z}}}})\log p({{{\bf{z}}}}| {{{\bf{x}}}}){{{\rm{d}}}}{{{\bf{x}}}}{{{\rm{d}}}}{{{\bf{z}}}}-\int\,p({{{\bf{z}}}})\log p({{{\bf{z}}}}){{{\rm{d}}}}{{{\bf{z}}}},$$
(2)

where p(xz) represents the joint distribution of the original and enhanced features, p(x) and p(z) are the marginal distributions, and p(zx) is the conditional distribution of the enhanced features given the original features. However, the computing of p(z) = ∫ p(zx)p(x)dx is intractable. To address this issue, we let the Gaussian distribution of the enhanced features z, r(z), be a variational approximation of the true marginal distribution p(z). Since the KL divergence between the variational posterior r(z) and the true posterior p(z) is non-negative, we have: \(\int\,p({{{\bf{z}}}})\log p({{{\bf{z}}}}){{{\rm{d}}}}{{{\bf{z}}}}\ge \int\,p({{{\bf{z}}}})\log r({{{\bf{z}}}}){{{\rm{d}}}}{{{\bf{z}}}}\). Therefore, we derive an upper bound for the mutual information: \(I\left({{{\bf{x}}}},{{{\bf{z}}}}\right)\le \int\,p({{{\bf{x}}}})p({{{\bf{z}}}}| {{{\bf{x}}}})\log \frac{p({{{\bf{z}}}}| {{{\bf{x}}}})}{r({{{\bf{z}}}})}{{{\rm{d}}}}{{{\bf{x}}}}{{{\rm{d}}}}{{{\bf{z}}}}\).

We further utilize a variational distribution qθ(zx) to approximate the true posterior p(zx), which was implemented by a neural network with the original features x as input and the mean and variance of the Gaussian distribution of the enhanced features z as output. After that, we can sample the enhanced features z from the variational distribution: \({{{\bf{z}}}} \sim {{{\mathcal{N}}}}\left({{{{\rm{NN}}}}}^{\mu }\left({{{\bf{x}}}}\right),{{{{\rm{NN}}}}}^{\Sigma }\left({{{\bf{x}}}}\right)\right)\), where NNμ and NNΣ are the neural networks for the mean and variance of the Gaussian distribution of the enhanced features, respectively. To enable efficient optimization, we apply the reparameterization trick65, which provides an unbiased gradient estimate to optimize the objective66. Using this approach, the upper bound of the mutual information \(I\left({{{\bf{x}}}},{{{\bf{z}}}}\right)\) can be computed as: \(\int\,p({{{\bf{x}}}})p({{{\bf{z}}}}| {{{\bf{x}}}})\log \frac{p({{{\bf{z}}}}| {{{\bf{x}}}})}{r({{{\bf{z}}}})}{{{\rm{d}}}}{{{\bf{x}}}}{{{\rm{d}}}}{{{\bf{z}}}}\approx {{\mathbb{E}}}_{{{{\bf{x}}}}}[{q}_{\theta }({{{\bf{z}}}}| {{{\bf{x}}}})\log \frac{{q}_{\theta }({{{\bf{z}}}}| {{{\bf{x}}}})}{r({{{\bf{z}}}})}]\), which can be optimized by minimizing the Kullback-Leibler (KL) divergence between the variational posterior qθ(zx) and the Gaussian distribution r(z): \({{{{\mathcal{L}}}}}_{{{{\rm{IB}}}}}={{\mathbb{E}}}_{{{{\bf{x}}}}}[{D}_{{{{\rm{KL}}}}}({q}_{\theta }({{{\bf{z}}}}| {{{\bf{x}}}})| | r({{{\bf{z}}}}))]\). Traditionally, the maximization of mutual information between the enhanced features and the target output is achieved by minimizing the cross-entropy loss between the bag-level predictions and the ground-truth labels. However, weakly supervised MIL models often fail to provide accurate supervision signals for optimizing the variational information bottleneck, which operates at the patch level. To address this challenge, we propose leveraging visual and textual concepts to guide the feature enhancement process, offering more accurate patch-level supervision signals.

Specifically, as only a subset of patches contains task-related information, we select pilot patches zpilot to guide the feature enhancement process. These pilot patches are identified by calculating the cosine similarity between the original patch features and the visual concept of the target class. Patches with high similarity scores are designated as pilot patches, ensuring they contain relevant task-related information to effectively drive feature refinement.

Following the selection of pilot patches, textual concepts are employed to guide the feature enhancement process. Given that the text encoder of pathology VLMs is inherently general due to pretraining on large-scale, diverse pathology datasets with domain-agnostic learning objectives, we posit that textual concepts are robust and generalizable. These concepts can effectively filter out site-specific signatures, while task-specific textual concepts provide relevant information for the target class. This enables the enhancement process to preserve task-related information and eliminate task-agnostic noise. Consequently, the task-specific textual concept for the target class is treated as the positive supervision signal \({{{{\bf{c}}}}}_{{{{\rm{pos}}}}}^{{{{\rm{p}}}}}\), whereas other task-specific and task-agnostic textual concepts are treated as negative supervision signals \({{{{\bf{c}}}}}_{{{{\rm{neg}}}}}^{{{{\rm{p}}}}}\). To implement this guidance, we utilize the InfoNCE loss67, which minimizes the distance between the enhanced features and the positive supervision signal while maximizing the distance between the enhanced features and the negative supervision signals:

$${{{{\mathcal{L}}}}}_{{{{{\rm{NCE}}}}}_{z}}^{{{{\rm{p}}}}}=-\frac{1}{M}{\sum }_{j=1}^{M}\log \frac{\exp \left(\frac{{{{{\bf{z}}}}}_{j}^{{{{\rm{pilot}}}}}\cdot {{{{\bf{c}}}}}_{{{{\rm{pos}}}}}^{{{{\rm{p}}}}}}{\tau }\right)}{\exp \left(\frac{{{{{\bf{z}}}}}_{j}^{{{{\rm{pilot}}}}}\cdot {{{{\bf{c}}}}}_{{{{\rm{pos}}}}}^{{{{\rm{p}}}}}}{\tau }\right)+\mathop{\sum }_{i=1}^{K}\exp \left(\frac{{{{{\bf{z}}}}}_{j}^{{{{\rm{pilot}}}}}\cdot {{{{\bf{c}}}}}_{{{{{\rm{neg}}}}}_{i}}^{{{{\rm{p}}}}}}{\tau }\right)},$$
(3)

where M represents the number of pilot patches, K denotes the number of negative supervision signals, and τ is the temperature parameter. To further enhance the distinction between task-specific textual concepts and task-agnostic concepts, and to ensure task-specific textual concepts from different classes are more distinguishable, we propose an additional optimization strategy. In this approach, the corresponding task-specific textual concept of the target class serves as the optimization target, the pilot patches are treated as positive samples, and the other task-specific textual concepts and task-agnostic textual concepts are treated as negative samples:

$${{{{\mathcal{L}}}}}_{{{{{\rm{NCE}}}}}_{c}}^{{{{\rm{p}}}}}=-\log \frac{\exp \left(\frac{{{{{\bf{c}}}}}_{{{{\rm{pos}}}}}^{{{{\rm{p}}}}}\cdot {{{{\bf{z}}}}}^{{{{\rm{pilot}}}}}}{\tau }\right)}{\exp \left(\frac{{{{{\bf{c}}}}}_{{{{\rm{pos}}}}}^{{{{\rm{p}}}}}\cdot {{{{\bf{z}}}}}^{{{{\rm{pilot}}}}}}{\tau }\right)+\mathop{\sum }_{i=1}^{K}\exp \left(\frac{{{{{\bf{c}}}}}_{{{{\rm{pos}}}}}^{{{{\rm{p}}}}}\cdot {{{{\bf{c}}}}}_{{{{{\rm{neg}}}}}_{i}}^{{{{\rm{p}}}}}}{\tau }\right)}.$$
(4)

The enhanced patch-level features are obtained by inputting the original patch-level features into the information bottleneck. These features are then aggregated by the MIL model to produce the bag-level features, denoted as \({{{{\bf{z}}}}}^{{{{\rm{bag}}}}}={{{\rm{MIL}}}}\left({{{\bf{Z}}}}\right)\). Subsequently, we propose further enhancing the aggregated bag-level features through a contrastive learning process, leveraging the positive supervision signals \({{{{\bf{c}}}}}_{{{{\rm{pos}}}}}^{{{{\rm{s}}}}}\) and negative supervision signals \({{{{\bf{c}}}}}_{{{{\rm{neg}}}}}^{{{{\rm{s}}}}}\), similar to the feature enhancement process applied to the patch-level features. It is important to note that the textual concepts used for patch-level (\({{{{\bf{c}}}}}_{{{{\rm{pos}}}}}^{{{{\rm{p}}}}}\) and \({{{{\bf{c}}}}}_{{{{\rm{neg}}}}}^{{{{\rm{p}}}}}\)) and slide-level (\({{{{\bf{c}}}}}_{{{{\rm{pos}}}}}^{{{{\rm{s}}}}}\) and \({{{{\bf{c}}}}}_{{{{\rm{neg}}}}}^{{{{\rm{s}}}}}\)) feature enhancement share the same initialization but are optimized independently. The resulting InfoNCE losses for the slide-level feature enhancement are defined as:

$${{{{\mathcal{L}}}}}_{{{{{\rm{NCE}}}}}_{z}}^{{{{\rm{s}}}}}=-\log \frac{\exp \left(\frac{{{{{\bf{z}}}}}^{{{{\rm{bag}}}}}\cdot {{{{\bf{c}}}}}_{{{{\rm{pos}}}}}^{{{{\rm{s}}}}}}{\tau }\right)}{\exp \left(\frac{{{{{\bf{z}}}}}^{{{{\rm{bag}}}}}\cdot {{{{\bf{c}}}}}_{{{{\rm{pos}}}}}^{{{{\rm{s}}}}}}{\tau }\right)+\mathop{\sum }_{i=1}^{K}\exp \left(\frac{{{{{\bf{z}}}}}^{{{{\rm{bag}}}}}\cdot {{{{\bf{c}}}}}_{{{{{\rm{neg}}}}}_{i}}^{{{{\rm{s}}}}}}{\tau }\right)},$$
(5)
$${{{{\mathcal{L}}}}}_{{{{{\rm{NCE}}}}}_{c}}^{{{{\rm{s}}}}}=-\log \frac{\exp \left(\frac{{{{{\bf{c}}}}}_{{{{\rm{pos}}}}}^{{{{\rm{s}}}}}\cdot {{{{\bf{z}}}}}^{{{{\rm{bag}}}}}}{\tau }\right)}{\exp \left(\frac{{{{{\bf{c}}}}}_{{{{\rm{pos}}}}}^{{{{\rm{s}}}}}\cdot {{{{\bf{z}}}}}^{{{{\rm{bag}}}}}}{\tau }\right)+\mathop{\sum }_{i=1}^{K}\exp \left(\frac{{{{{\bf{c}}}}}_{{{{\rm{pos}}}}}^{{{{\rm{s}}}}}\cdot {{{{\bf{c}}}}}_{{{{{\rm{neg}}}}}_{i}}^{{{{\rm{s}}}}}}{\tau }\right)}.$$
(6)

The final prediction of the bag-level label is obtained by feeding the enhanced bag-level features into the classifier \(\hat{{{{\bf{y}}}}}={{{\rm{CLS}}}}\left({{{{\bf{z}}}}}^{{{{\rm{bag}}}}}\right)\). The prediction loss is then optimized with the ground-truth label by minimizing the cross-entropy loss \({{{{\mathcal{L}}}}}_{{{{\rm{CE}}}}}\).

The total loss for FLEX is a combination of the cross-entropy loss for the MIL model, the variational information bottleneck loss, and the InfoNCE losses:

$${{{{\mathcal{L}}}}}_{{{{\rm{FLEX}}}}}={{{{\mathcal{L}}}}}_{{{{\rm{CE}}}}}+{\uplambda }_{{{{\rm{1}}}}}{{{{\mathcal{L}}}}}_{{{{\rm{IB}}}}}+{\uplambda }_{{{{\rm{2}}}}}\left({{{{\mathcal{L}}}}}_{{{{{\rm{NCE}}}}}_{z}}^{{{{\rm{p}}}}}+{{{{\mathcal{L}}}}}_{{{{{\rm{NCE}}}}}_{c}}^{{{{\rm{p}}}}}\right)+{\uplambda }_{{{{\rm{3}}}}}\left({{{{\mathcal{L}}}}}_{{{{{\rm{NCE}}}}}_{z}}^{{{{\rm{s}}}}}+{{{{\mathcal{L}}}}}_{{{{{\rm{NCE}}}}}_{c}}^{{{{\rm{s}}}}}\right),$$
(7)

where λ1, λ2, and λ3 are the hyperparameters to balance the different components of the loss function.

Training details

All experiments were conducted on a server with 8 NVIDIA L40S GPUs. The length of the learnable context was set to 4 tokens for morphology classification and molecular biomarker status prediction tasks, and 2 tokens for gene mutation prediction tasks. An ablation study was performed to evaluate the impact of the learnable context length by tuning it within 0, 2, 4, 6, 8, 10, 12 for each task. The average performance for each task type is reported in the supplementary material. During training, all weights of the pathology VLMs were frozen, and only the learnable context, the weights of the variational information bottleneck in FLEX, and the weights of the MIL models were optimized. For optimization, the Adam optimizer was used with a learning rate of 0.0001 for ABMIL, ACMIL, and CLAM-SB, and 0.00001 for DTFD-MIL and ILRA-MIL. The batch size was set to 1, and all tasks were trained for 20 epochs. For evaluation, the AUROC and the F1 score were adopted as metrics. Results for all models are reported as the mean and standard deviation over 15 folds of SP-MCCV for both OOD and IND testing data.

Evaluation metrics

To comprehensively assess the efficacy of FLEX, we evaluated its performance across two key dimensions: standard performance metrics and demographic fairness metrics.

Performance metrics

Given the inherent class imbalance often encountered in WSI analysis tasks, we employed the AUROC and the F1 score as primary performance metrics. These metrics are well-suited for evaluating model performance in imbalanced classification scenarios. The Receiver Operating Characteristic (ROC) curve graphically depicts the trade-off between the True Positive Rate (TPR) and the False Positive Rate (FPR) across varying classification thresholds. AUROC, representing the area under this curve, provides a threshold-agnostic measure of a model’s ability to discriminate between positive and negative instances. Its robustness to class imbalance makes AUROC particularly appropriate for our evaluation. Complementing AUROC, the F1 score offers a balanced measure of precision and recall, providing a more granular perspective on classification accuracy. Defined as the harmonic mean of precision and recall, the F1 score is calculated as follows:

$${F}_{1}=\frac{2\times \,{\mbox{TP}}}{2\times {\mbox{TP}}+{\mbox{FP}}+{\mbox{FN}}\,},$$
(8)

where True Positives (TP), False Positives (FP), and False Negatives (FN) denote the number of true positives, false positives, and false negatives, respectively. The F1 score penalizes models that disproportionately favor either precision or recall, rewarding those with balanced performance.

Fairness metrics

To evaluate FLEX’s impact on demographic fairness, we adopted a suite of fairness metrics: AUROC gap ratio43, TPR disparity35, and race-wise AUROC35. These metrics collectively quantify and characterize potential performance disparities across different demographic groups, offering a multi-faceted assessment of fairness. The AUROC gap ratio, as defined in43, quantifies the relative AUROC performance disparity between the best-performing and worst-performing demographic groups, normalized by the overall AUROC. A smaller AUROC gap ratio indicates improved fairness. TPR disparity, as introduced in35, measures the difference between the overall TPR and the TPR for each demographic group. TPR disparity values closer to zero suggest greater fairness, indicating that the model’s true positive detection rate is more consistent across demographic groups and less biased towards any particular group. Furthermore, race-wise AUROC provides a granular view of performance by calculating the AUROC separately for each demographic group. Consistent AUROC values across different demographic groups are indicative of a fairer model, demonstrating that the model’s discriminatory power is not significantly influenced by demographic factors.

Ethics declarations

TCGA dataset and the CPTAC dataset are publicly accessible with approved protocol and participant’s informed consent. The NFH dataset was collected under the approval of the Medical Ethics Committee of NanFang Hospital of Southern Medical University (NFEC-2025-419) with a waiver of informed consent due to the retrospective nature of the study and the use of de-identified data.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.