Knowledge-guided adaptation of pathology foundation models effectively improves cross-domain generalization and demographic fairness

Huang, Yanyan; Zhao, Weiqin; Zhang, Zhengyu; Chen, Yihang; Fu, Yu; Wu, Feng; Jiang, Yuming; Liang, Li; Wang, Shujun; Yu, Lequan

doi:10.1038/s41467-025-66300-y

Download PDF

Article
Open access
Published: 12 December 2025

Knowledge-guided adaptation of pathology foundation models effectively improves cross-domain generalization and demographic fairness

Nature Communications volume 16, Article number: 11485 (2025) Cite this article

4510 Accesses
15 Altmetric
Metrics details

Subjects

Abstract

Foundation models in computational pathology suffer from site-specific and demographic biases, which compromise their generalizability and fairness. We introduce FLEX, a framework that employs a task-specific information bottleneck, guided by visual and textual domain knowledge, to disentangle robust pathological features from these artifacts. Using three large cohorts (The Cancer Genome Atlas, Clinical Proteomic Tumor Analysis Consortium, and an in-house dataset) across 16 clinical tasks, totaling over 9,900 slides, we demonstrate that FLEX achieves superior zero-shot generalization to unseen external cohorts, significantly outperforming baselines and narrowing the performance gap between seen and unseen domains. A comprehensive fairness analysis confirms that FLEX also effectively mitigates disparities across demographic groups. Furthermore, its versatility and scalability are proven through compatibility with various foundation models and multiple-instance learning architectures. Our work establishes FLEX as a promising solution for developing more generalizable and equitable pathology AI for diverse clinical settings.

Achieving flexible fairness metrics in federated medical imaging

Article Open access 08 April 2025

An institutional framework to support ethical fair and equitable artificial intelligence augmented care

Article Open access 05 February 2025

Enhancing domain generalization in the AI-based analysis of chest radiographs with federated learning

Article Open access 19 December 2023

Introduction

Histopathology remains the gold standard for cancer diagnosis and treatment planning, yet its traditional manual analysis faces challenges of subjectivity and limited scalability^1,2,3. Recent advances in artificial intelligence (AI) have revolutionized computational pathology, enabling the automated extraction of complex patterns from histopathology images with high accuracy and efficiency^4,5,6,7,8. These advancements have paved the way for clinical applications such as automated cancer diagnosis, molecular biomarker prediction, and treatment response assessment^9,10,11. A pivotal development in this field is the emergence of pathology foundation models^{12,13,14,15,16,17,18,19}, which are pretrained on large-scale pathology datasets via self-supervised learning (SSL)^{12,13,14,15,16,18,20,21,22,23,24,25,26}. These models act as powerful image feature extractors, acquiring transferable representations from whole slide image (WSI) patches and enabling accurate predictions for various clinical tasks by integrating multiple-instance learning (MIL) models^{5,27,28,29,30,31}.

Despite these advancements, a critical limitation persists: features extracted by pathology foundation models exhibit significant site-specific signatures. These signatures, which stem from variations in tissue preparation, staining protocols, and scanning equipment across different clinical centers (e.g., hospitals or research institutions) where samples are processed (Fig. 1a), directly impede cross-domain generalization. Specifically, they foster shortcut learning^32,33,34 (Fig. 1b), where models learn superficial site-specific patterns rather than the underlying pathological features essential for robust diagnosis. As a result, these models often perform well on in-domain (IND) testing data (i.e., data from the same submitting sites as the training data), but exhibit drastically reduced accuracy and unreliable predictions on out-of-domain (OOD) testing data (i.e., data from different sites with distinct site-specific patterns), compromising diagnostic reliability and posing a serious risk of misdiagnosis and incorrect treatment decisions in clinical practice. Although stain normalization^10,35,36 has been proposed as a potential solution³⁷, our findings demonstrate its ineffectiveness in eliminating complex site-specific artifacts. Adding to these challenges of cross-site generalization, recent studies have highlighted a concerning lack of demographic fairness in pathology foundation models³⁵, arising from variations in tissue appearance across demographic groups and imbalances in training data. Therefore, effectively mitigating site-specific biases and demographic disparities is paramount for the responsible and reliable deployment of pathology foundation models in diverse clinical settings.

**Fig. 1: Overview of the computational pathology challenge and the proposed FLEX framework.**

In this work, we introduce FLEX (Feature-Level Enhancement for Cross-domain generalization), an approach that leverages domain knowledge to address the intertwined challenges of cross-site generalization and demographic fairness in pathology foundation models. FLEX enhances the model’s focus on generic pathological patterns by employing an information bottleneck. This bottleneck is guided by visual and textual concepts enriched with domain-specific prior knowledge, and strategically aligns patch image features with task-specific textual concepts. These textual concepts are generated by the text encoder of pathology foundation models, and thus, originating from the text domain, they are inherently free from site-specific signatures and demographic biases present in the images. As a result, pathology images from diverse source sites and demographic groups, which exhibit distinct feature distributions, are effectively aligned to a unified feature space. This alignment enables the model to generalize across heterogeneous sites and demographic groups, while enhancing the discriminativeness and expressiveness of encoded patch features for specific tasks. To validate our approach, we first confirm the presence of confounding site-specific signatures and demographic biases in existing foundation models. We then assess FLEX across 16 diverse clinical tasks using a rigorous two-stage evaluation. This involves a Site-Preserved Monte Carlo Cross-Validation (SP-MCCV) within The Cancer Genome Atlas (TCGA) cohort and a zero-shot external validation on two independent cohorts: the public Clinical Proteomic Tumor Analysis Consortium (CPTAC; n = 1738) and our in-house Nanfang Hospital dataset (NFH; n = 5318). This comprehensive evaluation demonstrates three key strengths of FLEX. First, it yields an improvement in cross-domain generalization, with consistent performance gains on both held-out TCGA sites and the external CPTAC and NFH cohorts. Second, it promotes demographic fairness by reducing performance disparities across demographic groups. Third, FLEX shows practical versatility through its compatibility with various pathology vision-language models (VLMs) and downstream MIL architectures, and maintains effectiveness across different data scales. These findings establish FLEX as a promising solution for improving the generalizability and fairness of pathology AI. By addressing the critical barriers of domain shift and bias, it offers a robust pathway for the reliable, responsible, and equitable deployment of foundation models in clinical settings.

Results

We first systematically evaluated the persistence of site-specific signatures and demographic biases in representations from pathology foundation models, and their negative impact on cross-domain generalization and fairness. We then demonstrate that our proposed FLEX framework mitigates these issues. As outlined in Fig. 1d, FLEX leverages visual and textual prior knowledge to enhance the discriminability and robustness of patch features. Our results are presented in three parts. First, we show that FLEX improves cross-domain generalization within the TCGA cohort and in zero-shot validation on external CPTAC and in-house cohorts. Second, we assess its impact on demographic fairness, revealing a marked reduction in performance disparities across racial and ancestry groups. Finally, we establish the practical versatility of FLEX by confirming its compatibility with various foundation models and MIL architectures across different data scales. These findings highlight FLEX as a robust solution for developing more generalizable and equitable pathology AI models.

FLEX improves cross-domain generalization and facilitates accurate clinical usage of pathology foundation models

This section evaluates how site-specific patterns in feature representations extracted by pathology foundation models affect cross-site generalization, potentially limiting their reliability for clinical applications. TCGA is a cornerstone resource for cancer research, which includes WSIs from multiple contributing sites and introducing site-specific signatures. For instance, the non-small cell lung carcinoma (NSCLC) cohort comprises WSIs from 66 sites, breast carcinoma (BRCA) from 36 sites, stomach adenocarcinoma (STAD) from 22 sites, and colorectal adenocarcinoma (CRC) from 37 sites (Fig. 1c). To investigate this, we utilized UMAP³⁸ to visualize patch features extracted by the CONCH¹⁸ image encoder from the BRCA cohort (Fig. 2a, middle). Patches were color-coded by the submitting site (left panel) and diagnostic class (right panel). The clear clustering by site revealed the presence of pervasive site-specific signatures, a finding that is robust to UMAP hyperparameter variations (Supplementary Fig. 1). These patterns serve as problematic shortcuts, enabling models to predict clinical labels without learning meaningful task-relevant features and thereby compromising cross-domain generalization, as illustrated in Fig. 1b.

**Fig. 2: FLEX enhances model generalizability and diagnostic performance by mitigating site-specific signatures.**

We first quantified this issue across 16 pathology tasks using an SP-MCCV strategy (see ‘Site-preserved monte carlo cross-validation’ in “Methods”) within TCGA, where test data were split into IND and OOD subsets based on submitting sites (Fig. 2b). Patch features were extracted using CONCH¹⁸ and aggregated with an ABMIL model²⁷. The results showed a significant drop in area under the receiver operating characteristic curve (AUROC) performance for OOD data compared to IND data across all task categories (Fig. 2c). Specifically, the average AUROC for IND was 0.918, 0.754, and 0.651 for morphology, molecular biomarker, and gene mutation tasks, respectively. In contrast, the OOD AUROC dropped to 0.853, 0.721, and 0.628. This performance decline illustrates the substantial impact of domain shift, a challenge that persists even for state-of-the-art pathology foundation models.

A common method for addressing domain shift is stain normalization, such as the Reinhard method^39,40,41, which aligns the color distribution of WSIs to a reference template. While this approach can be effective when domain shift is primarily due to color variation, it is often insufficient for more complex biases. As shown in Fig. 2a (right panel), features after Reinhard normalization remain strongly clustered by site. Consequently, this method yielded minimal and inconsistent performance gains, and in some cases, even reduced performance (Fig. 2d, Supplementary Table 1).

In contrast, FLEX is designed to address a broader spectrum of site-specific signatures, including not only staining variations but also subtle variations from tissue preparation and scanner artifacts, through its knowledge-guided information bottleneck. When applied within the TCGA cross-validation, FLEX significantly improved model generalizability for both IND and OOD testing (Fig. 2c). Specifically, the average OOD AUROC increased by 6.45%, 7.21%, and 4.94% for the three task categories, substantially narrowing the IND-OOD performance gap. UMAP visualizations for the BRCA-TYPE task (Fig. 2a, left panel) further confirmed that FLEX effectively suppressed site-specific signatures, forcing features to align based on their biological class rather than their site of origin. Detailed results across all 16 tasks (see ‘Dataset and clinical tasks description’ in “Methods” and Supplementary Table 2) show consistent performance improvements in both IND and OOD settings (Fig. 2d). Notably, FLEX achieved a 15.13% increase in OOD AUROC for the STAD-LAUREN task, reducing the IND-OOD gap from 20.59% to 8.64%. This highlights FLEX’s potential to address key challenges in cross-site generalization.

Furthermore, we conducted a zero-shot evaluation on two independent external cohorts: the public CPTAC dataset and our in-house NFH dataset (see Supplementary Tables 3, 4). Models trained exclusively on TCGA were applied directly to these unseen cohorts without any fine-tuning. These datasets introduce unique site-specific signatures distinct from TCGA, representing a true out-of-distribution challenge. Across all tasks and both cohorts, the baseline model’s performance was modest. However, applying FLEX resulted in a consistent and significant performance uplift, demonstrating its ability to generalize to new clinical environments. The average AUROC for the CPTAC cohort increased from 0.679 to 0.714 (P = 9.432 × 10⁻¹⁶) and for the NFH cohort from 0.733 to 0.771 (P = 8.900 × 10⁻¹²), as shown in Fig. 2c. The radar plot in Fig. 2e provides a comprehensive overview, showing that FLEX (pink line) consistently improves performance. Detailed results for each task are provided in Supplementary Fig. 2 and Supplementary Tables 5, 6. These analyses indicate that FLEX learns transferable biological representations rather than memorizing source-domain shortcuts.

To provide qualitative insights into FLEX’s mechanism, we visualized attention maps from the ABMIL models (Fig. 2f). Here, noisy regions are defined as visual information irrelevant to the diagnostic task, such as processing artifacts or task-agnostic biological structures (e.g., non-neoplastic stroma). For the BRCA-TYPE task (Fig. 2f, top panel), the baseline model’s attention is diffuse, scattered across tumor cells and distracting noisy elements. In contrast, the model with FLEX consistently focuses on critical, task-relevant features like the irregular glandular and ductal formations characteristic of Invasive Ductal Carcinoma (IDC) without being distracted by noisy foreground elements. This focused attention provides an intuitive explanation for the quantitative performance gains. Similarly, for the STAD-MSI task, the FLEX-enhanced model successfully identified potential Microsatellite Instability-High (MSI-H) regions, where tumor cells exhibit greater heterogeneity, whereas the baseline model failed to do so, leading to incorrect predictions. This suggests that FLEX enhances the model’s focus on salient pathological patterns while suppressing task-agnostic information.

FLEX effectively improves demographic fairness

Demographic biases within large-scale datasets like the TCGA dataset exacerbate fairness concerns, leading to significant performance disparities in deep learning models, particularly disadvantaging underrepresented populations from minority demographics³⁵. Because self-reported race may not fully capture complex biological heritage, we also stratified cohorts by genetic ancestry⁴², which provides a more biologically grounded measure. Our fairness analysis was conducted across both stratification methods in the TCGA dataset to ensure a comprehensive evaluation. Additionally, we performed a zero-shot fairness evaluation on the CPTAC dataset, which contains only self-reported race. As baseline experiments confirm (Fig. 3b, c), models trained on original features exhibit notable performance disparities across both stratification methods. Simple stain normalization, while useful for reducing inter-site staining variations, does not effectively address these demographic biases.

**Fig. 3: Evaluation of FLEX’s effectiveness in improving demographic fairness.**

By addressing this critical limitation through feature disentanglement, FLEX offers a significant improvement in both demographic fairness and model reliability. We evaluated this using a multi-faceted approach, assessing performance across both self-reported race and ancestry cohorts (Fig. 3). This improvement is also observed within individual clinical sites (Supplementary Fig. 3).

First, we examine the Fairness Gap (AUROC gap ratio)⁴³ in Fig. 3a, which measures the absolute performance difference between the best- and worst-performing racial groups relative to the overall AUROC. On average, FLEX consistently reduces the fairness gap across tasks. For instance, in the NSCLC-TYPE task, FLEX reduces the mean fairness gap for self-reported race (from 0.052 to 0.041) and ancestry (from 0.062 to 0.040). For the external CPTAC cohort, the fairness gaps for the C-BRCA-TYPE and C-LUAD-EGFR tasks decreased from 0.287 and 0.494 to 0.245 and 0.246, respectively. The scatter plots and marginal distributions in Fig. 3a show that FLEX’s results (pink dots) are more tightly clustered at lower gap values, indicating improved fairness. Even in cases like CRC-BRAF where an outlier widens the confidence interval, the central tendency of FLEX’s performance remains superior or comparable to the baselines.

For a more robust assessment of model reliability, we analyzed the True Positive Rate (TPR) disparity³⁵ (Fig. 3c), which quantifies the difference between the TPR of a specific demographic group and the overall TPR. The violin plots for FLEX are consistently narrower and more centered around the zero-disparity line for both self-reported race and ancestry stratifications, demonstrating that the model applies its diagnostic capability more equitably. This is quantitatively supported by the TPR disparity RMSE values (Fig. 3d). In the challenging C-LUAD-EGFR task, for example, FLEX reduces the average RMSE from 0.140 to 0.085, indicating a substantial improvement in fairness. Similarly, the race-wise AUROC distributions in Fig. 3b show that FLEX yields tighter and more equitable performance clusters.

FLEX is versatile across different pathology VLMs

To further evaluate the effectiveness and adaptability of FLEX across different pathology VLMs, we integrated it with three state-of-the-art pathology VLMs (CONCH¹⁸, PathGen-CLIP¹⁹, and QuiltNet¹⁷), which served as patch-level feature extractors. Using ABMIL²⁷ as the downstream MIL framework, we tested performance on 16 pathology tasks, focusing on the OOD generalization setting.

The results show that FLEX consistently and significantly enhances performance across all three foundational VLMs (Fig. 4a). While the baseline performance varies between VLMs, applying FLEX delivers a robust improvement in nearly all scenarios. For CONCH, FLEX improved OOD performance in all 16 tasks, with significant improvement (P < 0.05) observed in 14 out of 16 tasks. For PathGen-CLIP, FLEX improved performance in all tasks, with significant improvement in 9 out of 16 tasks. For QuiltNet, performance improved in 14 out of 16 tasks, with significant improvement in 9 tasks. This broad compatibility underscores that FLEX is a model-agnostic solution for improving generalization. The varying magnitude of improvement may be attributed to differences in the expressivity and discriminativeness of the original patch features extracted by different pathology VLMs. Since FLEX relies on visual and textual knowledge as guidance to enhance the patch features, the quality of this guidance depends on the features extracted by the VLMs. Visual knowledge and textual knowledge are generated by encoding pilot patch images and text prompts with the image and text encoders of the pathology VLMs. If the VLMs lack expressivity and discriminability, the resulting visual and textual knowledge may be less informative and discriminative, limiting the performance improvement achieved by FLEX.

**Fig. 4: Performance evaluation of FLEX when incorporated with different pathology VLMs.**

We further analyzed the mechanism behind this improvement by visualizing the patch feature space for the STAD-EBV task using UMAP³⁸ (Fig. 4b). For all three VLMs, the original features show strong clustering by site, with poor separation between EBV-positive and EBV-negative classes. This is confirmed by low Local Inverse Simpson’s Index (LISI) scores, which measure local data mixing by site, and Silhouette Scores, which measure class separability. After applying FLEX, the feature distributions become more uniform, with site-specific clusters dissolving. This is validated by a significant increase in the LISI score for all three VLMs (e.g., from 2.22 to 3.22 for CONCH), indicating better integration of features from different sites. Concurrently, the separation between EBV classes improves, reflected by a notable increase in the Silhouette Score (e.g., from 0.09 to 0.24 for CONCH). These results demonstrate that FLEX mitigates site-specific biases while enhancing the discriminability of task-relevant biological features, a dual benefit that holds across different foundational models.

FLEX is effective with different sizes of training data and flexible with different MIL models

To evaluate the scalability of FLEX with varying training data scales, we conducted additional experiments on the BRCA cohort. The experimental design is illustrated in Fig. 5a. The dataset was partitioned into five site-preserved outer folds, with one fold held out as the OOD test set in each of five outer loop iterations. Training sets of increasing size (1, 2, 3, and 4 folds) were constructed from the remaining four folds to create a learning curve. Three inner MCCV runs were performed for each scale to ensure stability. For this analysis, three tasks spanning different domains were selected: BRCA-TYPE, BRCA-HER2, and BRCA-PIK3CA. The corresponding training and OOD test set sizes are in Supplementary Table 7. The results shown in Fig. 5b demonstrate that FLEX consistently and significantly (P < 0.05) improves OOD performance across all tasks and data scales, underlining its ability to generalize effectively even with larger training datasets.

**Fig. 5: Performance of FLEX across training data scales and MIL models.**

To further assess the flexibility of FLEX across MIL models, we integrated it with five state-of-the-art MIL architectures: ABMIL²⁷, CLAM-SB⁵, ACMIL²⁸, DTFD-MIL²⁹, and ILRA-MIL³⁰. As shown in Fig. 5c, FLEX consistently improved OOD performance across all models. The average OOD performance gains were 6.01%, 5.55%, 3.82%, 3.53%, and 6.06%, respectively. Additionally, FLEX significantly reduced the performance gap between IND and OOD testing. Specifically, the gap decreased from 0.034, 0.040, 0.041, 0.038, and 0.035 to 0.022, 0.024, 0.021, 0.025, and 0.020 for the corresponding MIL models. Detailed per-task performance metrics for all combinations of VLM and MIL models evaluated are provided in Supplementary Tables 8–23.

The performance improvement seen in Fig. 5c varies across the 16 pathology tasks due to three main factors: intrinsic task difficulty, the utility of feature refinement for a given task, and the complexity of the MIL model. First, inherent task difficulty sets the upper bound on performance; for example, morphology classification tasks consistently yield higher AUROCs than most gene mutation prediction tasks. Second, the magnitude of improvement from FLEX depends on task-specific conditions. In the STAD-LAUREN task, for instance, all MIL models show a substantial performance gain. This may be because the TCGA-STAD cohort is relatively small, causing baseline models to overfit to noisy features. FLEX’s core mechanism—refining features by filtering task-irrelevant information-"improves the signal-to-noise ratio, which is particularly beneficial when training data is limited. Third, performance gains are more pronounced with simpler MIL models like ABMIL and CLAM-SB compared to more complex architectures. We hypothesize this is related to the joint optimization of FLEX and the MIL model. A simpler MIL backbone may create a more stable learning environment, allowing gradients to effectively guide the feature refinement process within FLEX. In contrast, with more intricate MIL models, the increased number of parameters and interacting components can complicate convergence to an optimal state, potentially limiting the observable performance gain from FLEX.

Discussion

Foundation models in computational pathology have rapidly advanced, demonstrating impressive capabilities in histopathology images understanding. Their integration into weakly supervised computational pathology pipelines has yielded strong performance across diverse diagnostic tasks^12,13. Despite this progress, significant challenges remain. These include limitations in cross-domain generalization, demographic fairness, and the practical adaptation of these powerful models for specific diagnostic objectives. Our study undertakes a comprehensive analysis and proposes an effective solution to address these critical challenges of cross-domain generalization and demographic fairness within pathology foundation models. Through a systematic investigation, we identified the detrimental impact of site-specific signatures and demographic biases inherent in pathology VLMs. This understanding directly informed the development of FLEX, an approach engineered to enhance both generalization capacity and fairness across a spectrum of diagnostic tasks.

Our investigation yielded key insights into the nature of these challenges. UMAP visualizations of patch features from pathology VLMs, coupled with the stark performance decline observed in OOD testing, clearly demonstrated the presence of site-specific signatures impeding cross-site generalization. Furthermore, our fairness gap analysis and evident disparities in TPR across demographic groups compellingly highlighted inherent biases within current pathology foundation models. Crucially, we demonstrated that standard stain normalization is insufficient to mitigate these deeper biases. It primarily addresses pixel-level color variations and fails to account for the broader spectrum of site-specific signatures, including subtle variations from tissue preparation and scanner artifacts. To overcome these shortcomings, we introduced FLEX, an approach leveraging both visual and textual prior knowledge to strategically suppresses site-specific patterns and demographic bias while amplifying task-relevant feature discriminability, thereby facilitating effective task-specific adaptation. Visual concepts, consisting of representative patch images for target classes (Supplementary Fig. 4), guide the retrieval of task-relevant patches from WSIs, facilitating information bottleneck training. Complementing visual concepts, we incorporated domain expertise to create learnable textual concepts. These textual concepts guide an information bottleneck mechanism within FLEX. This mechanism enhances task-related information within patch features and aligns them to a unified feature space, filtering out extraneous and biasing information. This ultimately leads to significant improvements in both cross-site generalization and demographic fairness. The ablation study on visual prompts and textual prompt length is presented in Supplementary Fig. 5.

FLEX represents a significant step in adapting foundation models for computational pathology. It effectively mitigates the negative impacts of site-specific signatures and demographic biases while improving feature discriminability for downstream tasks. The core mechanism of FLEX uses an information bottleneck guided by textual pathological concepts to disentangle task-relevant features from spurious variations. Because these textual concepts are inherently free of site or demographic information, this process forces the model to discard extraneous signals, including complex site and demographic signatures. The efficacy of FLEX was validated through extensive evaluations across 16 pathology tasks, using various pathology VLMs and MIL frameworks. These consistent results, obtained via a site-preserved cross-validation strategy and supported by fairness analyses, underscore the method’s robustness. Beyond performance improvements, FLEX offers insights into the strategic use of multimodal prior knowledge for task-specific adaptation of foundation models, contributing to the development of more effective and equitable diagnostic tools.

While FLEX leverages the inherent alignment between visual and textual embedding spaces of pathology VLMs, its current reliance on this alignment limits compatibility with vision-only foundation models. To address this, future work could explore cross-modal distillation techniques or develop a vision-only adaptation module, enabling FLEX to generalize to a broader range of architectures. Moreover, the current FLEX is primarily optimized for classification tasks, such as cancer subtyping and biomarker prediction. However, complex diagnostic tasks like prognosis in computational pathology are crucial for personalized medicine and patient management. To address this, we may extend FLEX to incorporate prognostic-specific textual prompts to guide the model towards relevant features for predicting patient outcomes. Future research will focus on expanding FLEX’s versatility. Key directions include broadening compatibility to vision-only foundation models and extending its application to a wider range of tasks, including regression and prognosis. We will also explore its integration with pathology vision large language models and work to enhance the interpretability and adaptability of these models. This effort includes the quantitative validation of attention mechanisms against expert-annotated regions by pathologists, which would provide an objective confirmation of the model’s focus. Such work will be crucial for translating these advancements into clinically impactful and trustworthy tools.

Methods

This retrospective study was conducted in compliance with all relevant ethical regulations. The research utilized publicly accessible data from the TCGA and the CPTAC cohorts, which were originally collected under approved protocols with informed consent from all participants. Information regarding participant compensation for these public cohorts was not available to the authors. The in-house NFH dataset was collected and used under the approval of the Medical Ethics Committee of NanFang Hospital of Southern Medical University (approval number: NFEC-2025-419), which granted a waiver of informed consent due to the retrospective nature of the study and the use of de-identified data, posing minimal risk to participants. No compensation was provided for the retrospective collection of NFH data. All data were handled with a strict commitment to patient privacy, fairness, and transparency.

The demographic characteristics of the patient cohorts, including sex and age, are detailed in the subsequent dataset descriptions. For all cohorts, participant sex and race were determined from self-reported information available in the associated clinical metadata, while genetic ancestry was derived from genomic data as established by the TCGA Pan-Cancer Atlas project⁴². Patient sex was not considered in the study design. Our fairness analysis focused primarily on self-reported race and genetic ancestry to directly address significant disparities previously reported in the literature for these groups in computational pathology.

Dataset and clinical tasks description

This study utilized datasets from the TCGA cohort, including data from breast invasive carcinoma (BRCA), non-small cell lung cancer (NSCLC, encompassing lung adenocarcinoma (LUAD) and lung squamous cell carcinoma (LUSC)), stomach adenocarcinoma (STAD), and colorectal carcinoma (CRC, encompassing colon adenocarcinoma (COAD) and rectum adenocarcinoma (READ)). The original WSIs were collected from the publicly available TCGA program (https://www.cancer.gov/ccg/research/genome-sequencing/tcga), and corresponding clinical, molecular, and gene mutation information was obtained from the cBioPortal database (https://www.cbioportal.org). In this study, we conducted 16 WSI analysis tasks to comprehensively evaluate the effectiveness of FLEX across different task types. These tasks include: three morphology classification tasks across three organs, five molecular biomarker status prediction tasks across two organs, and eight gene mutation prediction tasks across four organs. This diverse range of tasks ensures a robust assessment of FLEX in various computational pathology applications. A detailed breakdown of patient and slide counts for each task in the TCGA cohort is provided in Supplementary Table 2.

The demographic and clinical characteristics of the primary TCGA cohorts are as follows: the TCGA-BRCA cohort included 876 female (98.8%) and 11 male (1.2%) patients, with a median age of 58 years (range: 26–90). The TCGA-NSCLC cohort comprised 341 female (40.4%) and 503 male (59.6%) patients, with a median age of 67 years (range: 38–90). The TCGA-STAD cohort included 132 female (34.0%) and 256 male (66.0%) patients, with a median age of 67 years (range: 30–90). The TCGA-CRC cohort consisted of 275 female (48.2%) and 296 male (51.8%) patients, with a median age of 67 years (range: 31–90).

Morphology classification

For the morphology classification tasks, the objective was to classify histopathology images into distinct subtypes of the same cancer type:

BRCA-TYPE: This binary classification task involves classifying histopathology images from the TCGA-BRCA cohort into invasive ductal carcinoma (IDC) and invasive lobular carcinoma (ILC)⁴⁴.
NSCLC-TYPE: This two-class classification task categorizes histopathology images from the TCGA-NSCLC cohort into lung adenocarcinoma (LUAD) and lung squamous cell carcinoma (LUSC)⁴⁵.
STAD-LAUREN: This task focuses on the Lauren classification of histopathology images from the TCGA-STAD cohort, distinguishing between the intestinal type and diffuse type⁴⁶.

Molecular biomarker status prediction

The molecular biomarker status prediction tasks aim to identify clinically relevant biomarker statuses of cancer based on histopathology images, leveraging specific morphological correlations:

BRCA-HER2, BRCA-ER, and BRCA-PR: These binary classification tasks predict the human epidermal growth factor receptor 2 (HER2), estrogen receptor (ER), and progesterone receptor (PR) statuses of breast cancer using histopathology images from the TCGA-BRCA cohort^47,48,49.
STAD-EBV and STAD-MSI: These binary classification tasks predict the Epstein-Barr virus (EBV) and Microsatellite Instability (MSI) statuses of stomach cancer using histopathology images from the TCGA-STAD cohort^50,51.

Gene mutation prediction

Gene mutation prediction tasks aim to predict the mutation status of specific genes in cancer using histopathology images. In this study, the following binary classification tasks were conducted:

BRCA-PIK3CA and BRCA-CDH1: These tasks predict the PIK3CA and CDH1 gene mutation statuses in breast cancer using histopathology images from the TCGA-BRCA cohort^52,53.
LUAD-EGFR and LUAD-STK11: These tasks predict the EGFR and STK11 gene mutation statuses in lung adenocarcinoma using histopathology images from the TCGA-LUAD cohort^54,55.
STAD-TP53 and STAD-MUC16: These tasks predict the TP53 and MUC16 gene mutation statuses in stomach adenocarcinoma using histopathology images from the TCGA-STAD cohort^56,57.
CRC-BRAF and CRC-TP53: These tasks predict the BRAF and TP53 gene mutation statuses in colorectal carcinoma using histopathology images from the TCGA-CRC cohort^58,59.

By targeting tasks of varying complexity and clinical significance, we aimed to comprehensively assess the effectiveness of FLEX across a diverse range of WSI analysis applications.

External Validation Datasets

To evaluate the zero-shot generalization capability of models trained on TCGA, we used two independent external cohorts for testing: the public CPTAC dataset and a large in-house dataset from NFH. These datasets introduce significant domain shifts, including different patient populations, slide preparation protocols, and scanning equipment, providing a stringent test of real-world applicability. Detailed statistics for the tasks performed on these external cohorts are available in Supplementary Table 3 for CPTAC and Supplementary Table 4 for NFH. The specific tasks for external validation were selected based on data availability and their overlap with the TCGA tasks.

CPTAC Dataset

The CPTAC collections are a publicly available resource (https://portal.gdc.cancer.gov/projects/CPTAC-2, https://portal.gdc.cancer.gov/projects/CPTAC-3) that includes genomic and proteomic data along with corresponding WSIs. For our external validation, we selected a subset of tasks that mirror those in our primary TCGA analysis. The CPTAC patient demographics for the tasks used in this study were as follows: the CPTAC-BRCA cohort included 116 female patients with a median age of 59 years (range: 30–89). The CPTAC-NSCLC cohort included 82 female (30.9%) and 183 male (69.1%) patients with a median age of 65 years (range: 36–88). The CPTAC-COAD cohort consisted of 61 female (58.7%) and 43 male (41.4%) patients with a median age of 65 years (range: 35–93). The specific tasks covered three cancer types:

BRCA (C-BRCA-TYPE, C-BRCA-PIK3CA, C-BRCA-CDH1): Including subtype classification (IDC vs. ILC) and prediction of PIK3CA and CDH1 gene mutations using histopathology images from the CPTAC-BRCA cohort.
NSCLC (C-NSCLC-TYPE, C-LUAD-EGFR, C-LUAD-STK11): Including subtype classification between LUAD and LUSC, and prediction of EGFR and STK11 gene mutations in LUAD using histopathology images from the CPTAC-NSCLC cohort.
COAD (C-COAD-BRAF, C-COAD-TP53): Including prediction of BRAF and TP53 gene mutations in COAD using histopathology images from the CPTAC-COAD cohort.

Nanfang Hospital of Southern Medical University (NFH) In-house Dataset

The NFH dataset is a large-scale, retrospectively collected in-house cohort from Nanfang Hospital of Southern Medical University, Guangzhou, China. This dataset provides a challenging test case due to its distinct technical and demographic characteristics compared to the Western-centric TCGA cohort. The demographic breakdown for the cohorts used in this study is as follows: the NFH-BRCA cohort consisted of 761 female (99.9%) and 1 male (0.1%) patient, with a median age of 52 years (range: 17–89). The NFH-NSCLC cohort included 910 female (46.7%) and 1037 male (53.3%) patients with a median age of 59 years (range: 15–87). The NFH-STAD cohort included 146 female (37.0%) and 249 male (63.0%) patients with a median age of 60 years (range: 20–89). The NFH-CRC cohort included 482 female (41.6%) and 678 male (58.4%) patients with a median age of 62 years (range: 21–93). The validation tasks on the NFH dataset cover four major cancer types:

BRCA (N-BRCA-TYPE, N-BRCA-HER2, N-BRCA-ER, N-BRCA-PR): Including subtype classification (IDC vs. ILC) and prediction of HER2, ER, and PR biomarker statuses using histopathology images from the NFH-BRCA cohort.
NSCLC (N-NSCLC-TYPE, N-LUAD-EGFR): Subtype classification between LUAD and LUSC and prediction of EGFR gene mutation status in LUAD using histopathology images from the NFH-NSCLC cohort.
STAD (N-STAD-LAUREN): Lauren classification between Intestinal and Diffuse type in STAD using histopathology images from the NFH-STAD cohort.
CRC (N-CRC-BRAF): Prediction of BRAF gene mutation status in CRC using histopathology images from the NFH-CRC cohort.

By evaluating model performance on these diverse external datasets without any fine-tuning, we assess the effectiveness of FLEX in overcoming dataset-specific biases and generalizing to unseen clinical environments.

Preprocessing of histopathology images

Segmentation and patching

We adopted the preprocessing steps outlined in CLAM⁵, which involve segmenting and cropping tissue regions from each WSI and extracting instance features from each patch using a feature extractor. Specifically, each WSI is first loaded into memory at a downsampled resolution (i.e., 32 × downscale) and converted from RGB to HSV color space to facilitate tissue segmentation. A binary mask for the foreground is then computed by thresholding the saturation channel after applying median blurring to smooth edges. Morphological closing operations are subsequently performed to fill small gaps and holes within the tissue regions. The segmented foreground tissue regions are further cropped into 512 × 512 patches from within the foreground contours at 20 × magnification for each slide.

Feature extraction

Due to the potentially vast size of the patch bag for each WSI (e.g., exceeding 10,000 patches), each patch was transformed into a compact feature representation using the pretrained feature extractor neural network. In this study, the feature extractor was implemented using the frozen image encoder of pathology VLMs, including CONCH¹⁸, PathGen-CLIP¹⁹, and QuiltNet¹⁷.

Stain normalization

For stain normalization, we employed the Reinhard method⁶⁰ and the Macenko method³⁹ to standardize the color distribution of the histopathology images. For implementation, we utilized the torch-staintools package for efficient normalization (https://github.com/CielAl/torch-staintools).

Pathology vision-language foundation models

With the advancement of large-scale pre-trained VLMs in computer vision and natural language processing, several pathology-specific VLMs have been developed to integrate histopathology images and textual concepts. Compared to vision-only foundation models, pathology VLMs better capture the intricate relationships between histopathology images and textual data. This alignment enhances the generalizability and robustness of image features²². The proposed FLEX is designed to seamlessly integrate with various pathology VLMs to improve the performance of WSI analysis tasks. In this study, we employed three state-of-the-art pathology VLMs:

CONCH¹⁸

CONCH is a visual-language foundation model pre-trained on over 1.17 million image-caption pairs, leveraging diverse histopathology images and biomedical text. It supports versatile applications in histopathology with minimal or no supervised fine-tuning, addressing label scarcity and enhancing adaptability across diseases and tasks. The CONCH model is publicly available on GitHub: https://GitHub.com/mahmoodlab/CONCH.

PathGen-CLIP¹⁹

PathGen-CLIP is a pathology-specific VLM trained using PathGen-1.6M¹⁹, a dataset containing 1.6 million high-quality image-caption pairs generated from large-scale WSI datasets such as TCGA. The PathGen-CLIP model is publicly available on GitHub: https://GitHub.com/PathGen-1-6M/PathGen-1.6M.

QuiltNet¹⁷

QuiltNet is a visual-language model for histopathology trained on Quilt-1M¹⁷, a dataset of 1 million image-caption pairs curated from sources such as YouTube, Twitter, and research papers. QuiltNet is publicly available on GitHub: https://GitHub.com/wisdomikezogwo/quilt1m.

Multiple instance learning methods

MIL⁶¹ is the predominant weakly supervised learning paradigm in computational pathology. In MIL, each WSI is treated as a bag of patch instances, and the models predict the bag-level label based on aggregated instance features. To demonstrate the adaptability and generalizability of FLEX, we integrated it with five state-of-the-art MIL methods in our experimental comparisons. The selected MIL methods are as follows:

ABMIL²⁷

ABMIL models the Bernoulli distribution of a bag and parameterizes the aggregation operator using an attention mechanism. We used the original implementation available on GitHub: https://GitHub.com/AMLab-Amsterdam/AttentionDeepMIL.

ACMIL²⁸

ACMIL mitigates overfitting by introducing multi-branch attention and stochastic Top-k instance masking, addressing attention value concentration, and enhancing the capture of diverse discriminative instances. The original implementation was used from GitHub: https://GitHub.com/dazhangyu123/ACMIL.

CLAM-SB⁵

CLAM-SB identifies diagnostically relevant sub-regions and refines the feature space through instance-level clustering of these regions. The implementation from the original GitHub repository was followed: https://GitHub.com/mahmoodlab/CLAM.

DTFD-MIL²⁹

DTFD-MIL addresses WSI classification challenges with small sample sizes by using pseudo-bags to expand training data and a double-tier MIL framework for better feature utilization. The original implementation was used from GitHub: https://GitHub.com/hrzhang1123/DTFD-MIL.

ILRA-MIL³⁰

ILRA-MIL employs a low-rank constraint to group similar patches and separates different ones, utilizing low-rank latent vectors for efficient global interaction modeling. The original GitHub implementation was used: https://GitHub.com/jinxixiang/low_rank_wsi.

Site-preserved Monte Carlo cross-validation

Histopathology images within the TCGA cohort originate from diverse source sites (e.g., hospitals, research institutions), often exhibiting site-specific signatures. These signatures can induce models to learn spurious correlations, hindering their ability to generalize to new, unseen sites. To rigorously evaluate the generalizability and effectiveness of FLEX and specifically assess performance on data from novel source sites, we implemented an SP-MCCV strategy. This approach is specifically designed to ensure models are tested on completely independent source sites not encountered during training. As illustrated in Fig. 6d, this nested SP-MCCV strategy is conceptually inspired by the Preserved-Site Cross-Validation method³⁷ and includes outer and inner cross-validation loops:

**Fig. 6: Overview of the proposed framework and experimental setup.**

Outer loop: site-preserved partitioning

Our strategy treats each Tissue Source Site (TSS) as an independent entity, representing a unique source of protocols and potential biases. We do not group sites based on known factors (e.g., scanner type), as this would fail to capture the full complexity of site-specific artifacts and would represent a less stringent test of generalization. Instead, we partition the entire set of TSS codes (see https://gdc.cancer.gov/resources-tcga-users/tcga-code-tables/tissue-source-site-codesfor site-TSS mapping) into K_s mutually exclusive site-folds using convex optimization to ensure that each site-fold contains a balanced proportion of patients for the diagnostic labels of interest³⁷. In each outer loop of the cross-validation, one of these site-folds is designated as the IND dataset, while the remaining K_s − 1 site-folds are aggregated to form the OOD test set, simulating a deployment to unseen medical centers.

Inner fold: Monte Carlo cross-validation within IND data

Within each site-preserved outer loop, the IND dataset (comprised of seen sites) was further subjected to Monte Carlo Cross-Validation (MCCV). Specifically, the IND data was randomly divided into training and IND testing subsets at a 7:3 ratio. The OOD dataset from the held-out site-fold was kept entirely separate and used exclusively for the final OOD performance evaluation.

This nested SP-MCCV procedure guarantees that model training and validation occur on data originating from distinct, site-specific distributions. For the BRCA and NSCLC datasets, we employed K_s = 5 site-preserved splits, with each IND fold undergoing 3 MCCV iterations, resulting in a total of 15 evaluation folds. For datasets with fewer source sites and WSIs, we reduced K_s to 3 to maintain sufficient training data per fold, while increasing MCCV iterations to 5 per IND fold, also yielding a total of 15 folds. This adjustment ensures robust statistical evaluation across datasets of varying sizes and site diversity. In summary, this SP-MCCV strategy provides a rigorous and comprehensive evaluation of model performance, specifically across unseen source sites. By strictly separating training and testing data at the source site level, SP-MCCV allows for a robust and reliable assessment of FLEX’s generalizability and effectiveness in mitigating site-specific biases.

FLEX architecture

As illustrated in Fig. 1d, the proposed FLEX is a flexible and adaptable module that can be seamlessly integrated with existing workflows for WSI analysis tasks. For clarity, we denote the input histopathology dataset as ${{{\mathcal{D}}}}=\{\left({{{\bf{P}}}},{{{\bf{y}}}}\right)\}$, where P = {p₁, p₂, …, p_N} is a set of N patches cropped from the WSI and y is the corresponding label. The traditional WSI analysis pipeline comprises two stages: feature extraction and MIL-based classification. In the feature extraction stage, the image encoder E_i of pathology VLMs is employed to extract patch-level features from the patches: ${{{\bf{X}}}}={{{{\rm{E}}}}}_{i}\left({{{\bf{P}}}}\right)$, where X = {x₁, x₂, …, x_N} is the extracted original patch-level features with dimension of D. In the MIL-based classification stage, MIL models aggregate the patch-level features into a comprehensive representation, upon which a classifier is applied to predict the bag-level label $\hat{{{{\bf{y}}}}}$: $\hat{{{{\bf{y}}}}}={{{\rm{CLS}}}}\left({{{\rm{MIL}}}}\left({{{\bf{X}}}}\right)\right)$, where CLS and MIL are the classifier and the MIL model, respectively. The objective of FLEX is to enhance the original patch-level features X by leveraging the knowledge from visual concepts V and textual concepts T. This process filters out site-specific signatures and task-agnostic information, resulting in enhanced features that are subsequently fed into the MIL model for generalizable and robust WSI analysis: ${{{\bf{Z}}}}={{{\bf{FLEX}}}}\left({{{\bf{X}}}},{{{\bf{V}}}},{{{\bf{T}}}}\right)$, $\hat{{{{\bf{y}}}}}={{{\rm{CLS}}}}\left({{{\rm{MIL}}}}\left({{{\bf{Z}}}}\right)\right)$, where Z = {z₁, z₂, …, z_N} represents the enhanced patch-level features, each with a dimension of D. Specifically, the visual concepts V consist of representative patch images for the target class, which are used to retrieve important patches from the original WSI that may contain task-related information. The textual concepts T, comprising representative text prompts with learnable contexts, guide the feature enhancement process by filtering out site-specific signatures and task-agnostic information while preserving task-related information. In the following sections, we detail the design of FLEX and its integration with pathology VLMs and MIL models.

Visual concept generation

The process of visual concept generation is illustrated in Fig. 6a. First, an ABMIL model was trained on the training data, and patches with high attention scores were randomly selected from the training set. However, due to potential overfitting, these high-attention patches may not be representative of the target class. To mitigate this issue, a multimodal large language model (e.g., GPT-4o) was employed to filter out task-irrelevant and noisy patches. This was achieved by prompting GPT-4o with queries such as whether the input patch could reliably indicate its belonging to the target class. This step served as a preliminary screening to identify representative patches. To further ensure task relevance, a human expert manually verified the selected patch candidates. Finally, the visual concept for each target class was generated by averaging the features of the selected patches, which were extracted using the image encoder of the pathology VLMs.

Textual concept generation

The process of textual concept generation is illustrated in Fig. 6b. It involves creating task-specific and task-agnostic prompts that serve as positive and negative supervision signals. Recognizing the need for accuracy in medical applications, our methodology incorporates an expert validation protocol. First, the prompt generation is constrained to simple, descriptive labels grounded in established oncologic terminology (e.g., “invasive ductal carcinoma”, “connective tissue”), which minimizes the risk of factual inaccuracies. Second, and most critically, we implemented a human-in-the-loop expert validation process. All textual concepts, whether initially derived from established class names or drafted with assistance from GPT-4o, were rigorously reviewed, edited for clinical precision, and formally approved by a board-certified pathologist. This expert oversight ensures that every prompt used is medically accurate and relevant to the prediction task. The specific task-specific and task-agnostic prompts used for each cancer type are detailed in Supplementary Tables 24–29. With these controls, prompts were defined as follows: for tasks with clear class boundaries, such as BRCA subtyping, task-specific prompts are the class names themselves, following previous studies^18,62. However, for more complex tasks with vague class boundaries, high complexity, or limited data, class names alone may be insufficient. To address this, we propose using the same class names as task-specific prompts attached to learnable contexts. These contexts adaptively capture task-specific information from the images during training, enhancing the prompts’ effectiveness. Task-agnostic prompts consist of general descriptions irrelevant to the task, such as the names of other tissue types (e.g., “connective tissue” or “normal tissue”). These prompts guide the feature enhancement process to filter out task-agnostic information. These are also attached to learnable contexts to improve their effectiveness in filtering out irrelevant features. Finally, textual concepts are generated by averaging the embeddings of these prompts (with their learnable contexts) produced by the VLM’s text encoder.

Knowledge-guided feature enhancement

The feature enhancement process, illustrated in Fig. 6c, is guided by visual and textual knowledge and comprises three main steps: visual concept-guided pilot patch selection, textual concept-guided feature calibration, and feature enhancement using a variational information bottleneck. To enhance the original patch-level features, we leverage the information bottleneck principle^63,64, which filters out site-specific signatures and task-agnostic information while preserving task-related information. The information bottleneck principle seeks a compressed representation of the input data that minimizes the mutual information between the input and the compressed representation and maximizes the mutual information between the compressed representation and the target output.

First, cropped patches from WSIs are processed through the frozen image encoder of the pathology VLMs to extract the original patch-level features X. For each original feature x and its corresponding enhanced feature z, we define the mutual information between them as:

$$I\left({{{\bf{x}}}},{{{\bf{z}}}}\right)=\int\,p({{{\bf{x}}}},{{{\bf{z}}}})\log \frac{p({{{\bf{z}}}}| {{{\bf{x}}}})}{p({{{\bf{z}}}})}{{{\rm{d}}}}{{{\bf{x}}}}{{{\rm{d}}}}{{{\bf{z}}}},$$

(1)

$$=\int\,p({{{\bf{x}}}},{{{\bf{z}}}})\log p({{{\bf{z}}}}| {{{\bf{x}}}}){{{\rm{d}}}}{{{\bf{x}}}}{{{\rm{d}}}}{{{\bf{z}}}}-\int\,p({{{\bf{z}}}})\log p({{{\bf{z}}}}){{{\rm{d}}}}{{{\bf{z}}}},$$

(2)

where p(x, z) represents the joint distribution of the original and enhanced features, p(x) and p(z) are the marginal distributions, and p(z∣x) is the conditional distribution of the enhanced features given the original features. However, the computing of p(z) = ∫ p(z∣x)p(x)dx is intractable. To address this issue, we let the Gaussian distribution of the enhanced features z, r(z), be a variational approximation of the true marginal distribution p(z). Since the KL divergence between the variational posterior r(z) and the true posterior p(z) is non-negative, we have: $\int\,p({{{\bf{z}}}})\log p({{{\bf{z}}}}){{{\rm{d}}}}{{{\bf{z}}}}\ge \int\,p({{{\bf{z}}}})\log r({{{\bf{z}}}}){{{\rm{d}}}}{{{\bf{z}}}}$. Therefore, we derive an upper bound for the mutual information: $I\left({{{\bf{x}}}},{{{\bf{z}}}}\right)\le \int\,p({{{\bf{x}}}})p({{{\bf{z}}}}| {{{\bf{x}}}})\log \frac{p({{{\bf{z}}}}| {{{\bf{x}}}})}{r({{{\bf{z}}}})}{{{\rm{d}}}}{{{\bf{x}}}}{{{\rm{d}}}}{{{\bf{z}}}}$.

We further utilize a variational distribution q_θ(z∣x) to approximate the true posterior p(z∣x), which was implemented by a neural network with the original features x as input and the mean and variance of the Gaussian distribution of the enhanced features z as output. After that, we can sample the enhanced features z from the variational distribution: ${{{\bf{z}}}} \sim {{{\mathcal{N}}}}\left({{{{\rm{NN}}}}}^{\mu }\left({{{\bf{x}}}}\right),{{{{\rm{NN}}}}}^{\Sigma }\left({{{\bf{x}}}}\right)\right)$, where NN^μ and NN^Σ are the neural networks for the mean and variance of the Gaussian distribution of the enhanced features, respectively. To enable efficient optimization, we apply the reparameterization trick⁶⁵, which provides an unbiased gradient estimate to optimize the objective⁶⁶. Using this approach, the upper bound of the mutual information $I\left({{{\bf{x}}}},{{{\bf{z}}}}\right)$ can be computed as: $\int\,p({{{\bf{x}}}})p({{{\bf{z}}}}| {{{\bf{x}}}})\log \frac{p({{{\bf{z}}}}| {{{\bf{x}}}})}{r({{{\bf{z}}}})}{{{\rm{d}}}}{{{\bf{x}}}}{{{\rm{d}}}}{{{\bf{z}}}}\approx {{\mathbb{E}}}_{{{{\bf{x}}}}}[{q}_{\theta }({{{\bf{z}}}}| {{{\bf{x}}}})\log \frac{{q}_{\theta }({{{\bf{z}}}}| {{{\bf{x}}}})}{r({{{\bf{z}}}})}]$, which can be optimized by minimizing the Kullback-Leibler (KL) divergence between the variational posterior q_θ(z∣x) and the Gaussian distribution r(z): ${{{{\mathcal{L}}}}}_{{{{\rm{IB}}}}}={{\mathbb{E}}}_{{{{\bf{x}}}}}[{D}_{{{{\rm{KL}}}}}({q}_{\theta }({{{\bf{z}}}}| {{{\bf{x}}}})| | r({{{\bf{z}}}}))]$. Traditionally, the maximization of mutual information between the enhanced features and the target output is achieved by minimizing the cross-entropy loss between the bag-level predictions and the ground-truth labels. However, weakly supervised MIL models often fail to provide accurate supervision signals for optimizing the variational information bottleneck, which operates at the patch level. To address this challenge, we propose leveraging visual and textual concepts to guide the feature enhancement process, offering more accurate patch-level supervision signals.

Specifically, as only a subset of patches contains task-related information, we select pilot patches z^pilot to guide the feature enhancement process. These pilot patches are identified by calculating the cosine similarity between the original patch features and the visual concept of the target class. Patches with high similarity scores are designated as pilot patches, ensuring they contain relevant task-related information to effectively drive feature refinement.

Following the selection of pilot patches, textual concepts are employed to guide the feature enhancement process. Given that the text encoder of pathology VLMs is inherently general due to pretraining on large-scale, diverse pathology datasets with domain-agnostic learning objectives, we posit that textual concepts are robust and generalizable. These concepts can effectively filter out site-specific signatures, while task-specific textual concepts provide relevant information for the target class. This enables the enhancement process to preserve task-related information and eliminate task-agnostic noise. Consequently, the task-specific textual concept for the target class is treated as the positive supervision signal ${{{{\bf{c}}}}}_{{{{\rm{pos}}}}}^{{{{\rm{p}}}}}$, whereas other task-specific and task-agnostic textual concepts are treated as negative supervision signals ${{{{\bf{c}}}}}_{{{{\rm{neg}}}}}^{{{{\rm{p}}}}}$. To implement this guidance, we utilize the InfoNCE loss⁶⁷, which minimizes the distance between the enhanced features and the positive supervision signal while maximizing the distance between the enhanced features and the negative supervision signals:

$${{{{\mathcal{L}}}}}_{{{{{\rm{NCE}}}}}_{z}}^{{{{\rm{p}}}}}=-\frac{1}{M}{\sum }_{j=1}^{M}\log \frac{\exp \left(\frac{{{{{\bf{z}}}}}_{j}^{{{{\rm{pilot}}}}}\cdot {{{{\bf{c}}}}}_{{{{\rm{pos}}}}}^{{{{\rm{p}}}}}}{\tau }\right)}{\exp \left(\frac{{{{{\bf{z}}}}}_{j}^{{{{\rm{pilot}}}}}\cdot {{{{\bf{c}}}}}_{{{{\rm{pos}}}}}^{{{{\rm{p}}}}}}{\tau }\right)+\mathop{\sum }_{i=1}^{K}\exp \left(\frac{{{{{\bf{z}}}}}_{j}^{{{{\rm{pilot}}}}}\cdot {{{{\bf{c}}}}}_{{{{{\rm{neg}}}}}_{i}}^{{{{\rm{p}}}}}}{\tau }\right)},$$

(3)

where M represents the number of pilot patches, K denotes the number of negative supervision signals, and τ is the temperature parameter. To further enhance the distinction between task-specific textual concepts and task-agnostic concepts, and to ensure task-specific textual concepts from different classes are more distinguishable, we propose an additional optimization strategy. In this approach, the corresponding task-specific textual concept of the target class serves as the optimization target, the pilot patches are treated as positive samples, and the other task-specific textual concepts and task-agnostic textual concepts are treated as negative samples:

$${{{{\mathcal{L}}}}}_{{{{{\rm{NCE}}}}}_{c}}^{{{{\rm{p}}}}}=-\log \frac{\exp \left(\frac{{{{{\bf{c}}}}}_{{{{\rm{pos}}}}}^{{{{\rm{p}}}}}\cdot {{{{\bf{z}}}}}^{{{{\rm{pilot}}}}}}{\tau }\right)}{\exp \left(\frac{{{{{\bf{c}}}}}_{{{{\rm{pos}}}}}^{{{{\rm{p}}}}}\cdot {{{{\bf{z}}}}}^{{{{\rm{pilot}}}}}}{\tau }\right)+\mathop{\sum }_{i=1}^{K}\exp \left(\frac{{{{{\bf{c}}}}}_{{{{\rm{pos}}}}}^{{{{\rm{p}}}}}\cdot {{{{\bf{c}}}}}_{{{{{\rm{neg}}}}}_{i}}^{{{{\rm{p}}}}}}{\tau }\right)}.$$

(4)

The enhanced patch-level features are obtained by inputting the original patch-level features into the information bottleneck. These features are then aggregated by the MIL model to produce the bag-level features, denoted as ${{{{\bf{z}}}}}^{{{{\rm{bag}}}}}={{{\rm{MIL}}}}\left({{{\bf{Z}}}}\right)$. Subsequently, we propose further enhancing the aggregated bag-level features through a contrastive learning process, leveraging the positive supervision signals ${{{{\bf{c}}}}}_{{{{\rm{pos}}}}}^{{{{\rm{s}}}}}$ and negative supervision signals ${{{{\bf{c}}}}}_{{{{\rm{neg}}}}}^{{{{\rm{s}}}}}$, similar to the feature enhancement process applied to the patch-level features. It is important to note that the textual concepts used for patch-level (${{{{\bf{c}}}}}_{{{{\rm{pos}}}}}^{{{{\rm{p}}}}}$ and ${{{{\bf{c}}}}}_{{{{\rm{neg}}}}}^{{{{\rm{p}}}}}$) and slide-level (${{{{\bf{c}}}}}_{{{{\rm{pos}}}}}^{{{{\rm{s}}}}}$ and ${{{{\bf{c}}}}}_{{{{\rm{neg}}}}}^{{{{\rm{s}}}}}$) feature enhancement share the same initialization but are optimized independently. The resulting InfoNCE losses for the slide-level feature enhancement are defined as:

$${{{{\mathcal{L}}}}}_{{{{{\rm{NCE}}}}}_{z}}^{{{{\rm{s}}}}}=-\log \frac{\exp \left(\frac{{{{{\bf{z}}}}}^{{{{\rm{bag}}}}}\cdot {{{{\bf{c}}}}}_{{{{\rm{pos}}}}}^{{{{\rm{s}}}}}}{\tau }\right)}{\exp \left(\frac{{{{{\bf{z}}}}}^{{{{\rm{bag}}}}}\cdot {{{{\bf{c}}}}}_{{{{\rm{pos}}}}}^{{{{\rm{s}}}}}}{\tau }\right)+\mathop{\sum }_{i=1}^{K}\exp \left(\frac{{{{{\bf{z}}}}}^{{{{\rm{bag}}}}}\cdot {{{{\bf{c}}}}}_{{{{{\rm{neg}}}}}_{i}}^{{{{\rm{s}}}}}}{\tau }\right)},$$

(5)

$${{{{\mathcal{L}}}}}_{{{{{\rm{NCE}}}}}_{c}}^{{{{\rm{s}}}}}=-\log \frac{\exp \left(\frac{{{{{\bf{c}}}}}_{{{{\rm{pos}}}}}^{{{{\rm{s}}}}}\cdot {{{{\bf{z}}}}}^{{{{\rm{bag}}}}}}{\tau }\right)}{\exp \left(\frac{{{{{\bf{c}}}}}_{{{{\rm{pos}}}}}^{{{{\rm{s}}}}}\cdot {{{{\bf{z}}}}}^{{{{\rm{bag}}}}}}{\tau }\right)+\mathop{\sum }_{i=1}^{K}\exp \left(\frac{{{{{\bf{c}}}}}_{{{{\rm{pos}}}}}^{{{{\rm{s}}}}}\cdot {{{{\bf{c}}}}}_{{{{{\rm{neg}}}}}_{i}}^{{{{\rm{s}}}}}}{\tau }\right)}.$$

(6)

The final prediction of the bag-level label is obtained by feeding the enhanced bag-level features into the classifier $\hat{{{{\bf{y}}}}}={{{\rm{CLS}}}}\left({{{{\bf{z}}}}}^{{{{\rm{bag}}}}}\right)$. The prediction loss is then optimized with the ground-truth label by minimizing the cross-entropy loss ${{{{\mathcal{L}}}}}_{{{{\rm{CE}}}}}$.

The total loss for FLEX is a combination of the cross-entropy loss for the MIL model, the variational information bottleneck loss, and the InfoNCE losses:

$${{{{\mathcal{L}}}}}_{{{{\rm{FLEX}}}}}={{{{\mathcal{L}}}}}_{{{{\rm{CE}}}}}+{\uplambda }_{{{{\rm{1}}}}}{{{{\mathcal{L}}}}}_{{{{\rm{IB}}}}}+{\uplambda }_{{{{\rm{2}}}}}\left({{{{\mathcal{L}}}}}_{{{{{\rm{NCE}}}}}_{z}}^{{{{\rm{p}}}}}+{{{{\mathcal{L}}}}}_{{{{{\rm{NCE}}}}}_{c}}^{{{{\rm{p}}}}}\right)+{\uplambda }_{{{{\rm{3}}}}}\left({{{{\mathcal{L}}}}}_{{{{{\rm{NCE}}}}}_{z}}^{{{{\rm{s}}}}}+{{{{\mathcal{L}}}}}_{{{{{\rm{NCE}}}}}_{c}}^{{{{\rm{s}}}}}\right),$$

(7)

where λ₁, λ₂, and λ₃ are the hyperparameters to balance the different components of the loss function.

Training details

All experiments were conducted on a server with 8 NVIDIA L40S GPUs. The length of the learnable context was set to 4 tokens for morphology classification and molecular biomarker status prediction tasks, and 2 tokens for gene mutation prediction tasks. An ablation study was performed to evaluate the impact of the learnable context length by tuning it within 0, 2, 4, 6, 8, 10, 12 for each task. The average performance for each task type is reported in the supplementary material. During training, all weights of the pathology VLMs were frozen, and only the learnable context, the weights of the variational information bottleneck in FLEX, and the weights of the MIL models were optimized. For optimization, the Adam optimizer was used with a learning rate of 0.0001 for ABMIL, ACMIL, and CLAM-SB, and 0.00001 for DTFD-MIL and ILRA-MIL. The batch size was set to 1, and all tasks were trained for 20 epochs. For evaluation, the AUROC and the F1 score were adopted as metrics. Results for all models are reported as the mean and standard deviation over 15 folds of SP-MCCV for both OOD and IND testing data.

Evaluation metrics

To comprehensively assess the efficacy of FLEX, we evaluated its performance across two key dimensions: standard performance metrics and demographic fairness metrics.

Performance metrics

Given the inherent class imbalance often encountered in WSI analysis tasks, we employed the AUROC and the F1 score as primary performance metrics. These metrics are well-suited for evaluating model performance in imbalanced classification scenarios. The Receiver Operating Characteristic (ROC) curve graphically depicts the trade-off between the True Positive Rate (TPR) and the False Positive Rate (FPR) across varying classification thresholds. AUROC, representing the area under this curve, provides a threshold-agnostic measure of a model’s ability to discriminate between positive and negative instances. Its robustness to class imbalance makes AUROC particularly appropriate for our evaluation. Complementing AUROC, the F1 score offers a balanced measure of precision and recall, providing a more granular perspective on classification accuracy. Defined as the harmonic mean of precision and recall, the F1 score is calculated as follows:

$${F}_{1}=\frac{2\times \,{\mbox{TP}}}{2\times {\mbox{TP}}+{\mbox{FP}}+{\mbox{FN}}\,},$$

(8)

where True Positives (TP), False Positives (FP), and False Negatives (FN) denote the number of true positives, false positives, and false negatives, respectively. The F1 score penalizes models that disproportionately favor either precision or recall, rewarding those with balanced performance.

Fairness metrics

To evaluate FLEX’s impact on demographic fairness, we adopted a suite of fairness metrics: AUROC gap ratio⁴³, TPR disparity³⁵, and race-wise AUROC³⁵. These metrics collectively quantify and characterize potential performance disparities across different demographic groups, offering a multi-faceted assessment of fairness. The AUROC gap ratio, as defined in⁴³, quantifies the relative AUROC performance disparity between the best-performing and worst-performing demographic groups, normalized by the overall AUROC. A smaller AUROC gap ratio indicates improved fairness. TPR disparity, as introduced in³⁵, measures the difference between the overall TPR and the TPR for each demographic group. TPR disparity values closer to zero suggest greater fairness, indicating that the model’s true positive detection rate is more consistent across demographic groups and less biased towards any particular group. Furthermore, race-wise AUROC provides a granular view of performance by calculating the AUROC separately for each demographic group. Consistent AUROC values across different demographic groups are indicative of a fairer model, demonstrating that the model’s discriminatory power is not significantly influenced by demographic factors.

Ethics declarations

TCGA dataset and the CPTAC dataset are publicly accessible with approved protocol and participant’s informed consent. The NFH dataset was collected under the approval of the Medical Ethics Committee of NanFang Hospital of Southern Medical University (NFEC-2025-419) with a waiver of informed consent due to the retrospective nature of the study and the use of de-identified data.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

The whole-slide images and corresponding clinical metadata for the public TCGA and CPTAC cohorts used in this study are available from the Genomic Data Commons Data Portal (https://portal.gdc.cancer.gov/) and The Cancer Imaging Archive (https://www.cancerimagingarchive.net/). Additional clinical, molecular status, and gene mutation data for these cohorts are available in the cBioPortal database (https://www.cbioportal.org/). The in-house Nanfang Hospital (NFH) dataset is available under restricted access due to patient privacy regulations and institutional review board requirements. Access can be obtained by contacting the corresponding author and is subject to a formal data usage agreement, necessary ethical approvals, and compliance with all institutional policies. Source data are provided with this paper.

Code availability

The source code for FLEX used in this study has been deposited in a permanent archive on Zenodo (https://doi.org/10.5281/zenodo.17347304)⁶⁸. The code is also maintained on GitHub (https://github.com/HKU-MedAI/FLEX) under the Apache-2.0 license. All scripts and materials required to reproduce our analyses are available in the same repository.

References

Laak, J., Litjens, G. & Ciompi, F. Deep learning in histopathology: the path to the clinic. Nat. Med. 27, 775–784 (2021).
Article PubMed Google Scholar
Shmatko, A., Ghaffari Laleh, N., Gerstung, M. & Kather, J. N. Artificial intelligence in histopathology: enhancing cancer research and clinical oncology. Nat. Cancer 3, 1026–1038 (2022).
Article PubMed Google Scholar
Tseng, L.-J., Matsuyama, A. & MacDonald-Dickinson, V. Histology: the gold standard for diagnosis? Can. Vet. J. 64, 389 (2023).
PubMed PubMed Central Google Scholar
Kather, J. N. et al. Deep learning can predict microsatellite instability directly from histology in gastrointestinal cancer. Nat. Med. 25, 1054–1056 (2019).
Article CAS PubMed PubMed Central Google Scholar
Lu, M. Y. et al. Data-efficient and weakly supervised computational pathology on whole-slide images. Nat. Biomed. Eng. 5, 555–570 (2021).
Article PubMed PubMed Central Google Scholar
Song, A. H. et al. Artificial intelligence for digital and computational pathology. Nat. Rev. Bioeng. 1, 930–949 (2023).
Article CAS Google Scholar
Campanella, G. et al. Clinical-grade computational pathology using weakly supervised deep learning on whole slide images. Nat. Med. 25, 1301–1309 (2019).
Article CAS PubMed PubMed Central Google Scholar
Huang, Y. et al. Conslide: Asynchronous hierarchical interaction transformer with breakup-reorganize rehearsal for continual whole slide image analysis. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 21349–21360 (2023).
Wagner, S. J. et al. Transformer-based biomarker prediction from colorectal cancer histology: a large-scale multicentric study. Cancer Cell 41, 1650–1661 (2023).
Article CAS PubMed PubMed Central Google Scholar
El Nahhas, O.S. et al. From whole-slide image to biomarker prediction: end-to-end weakly supervised deep learning in computational pathology. Nat. Protoc. 20, 293–316 (2024).
Fu, X., Sahai, E. & Wilkins, A. Application of digital pathology-based advanced analytics of tumour microenvironment organisation to predict prognosis and therapeutic response. J. Pathol. 260, 578–591 (2023).
Article PubMed PubMed Central Google Scholar
Chen, R. J. et al. Towards a general-purpose foundation model for computational pathology. Nat. Med. 30, 850–862 (2024).
Article CAS PubMed PubMed Central Google Scholar
Xu, H. et al. A whole-slide foundation model for digital pathology from real-world data. Nature 630, 181–188 (2024).
Vorontsov, E. et al. A foundation model for clinical-grade computational pathology and rare cancers detection. Nat. Med. 30, 2924–2935 (2024).
Wang, X. et al. A pathology foundation model for cancer diagnosis and prognosis prediction. Nature 634, 970–978 (2024).
Article ADS CAS PubMed PubMed Central Google Scholar
Huang, Z., Bianchi, F., Yuksekgonul, M., Montine, T. J. & Zou, J. A visual–language foundation model for pathology image analysis using medical twitter. Nat. Med. 29, 2307–2316 (2023).
Article CAS PubMed Google Scholar
Ikezogwo, W. et al. Quilt-1m: One million image-text pairs for histopathology. Adv. Neural Inf. Process. Syst. 36, 37995–38017 (2023).
PubMed PubMed Central Google Scholar
Lu, M. Y. et al. A visual-language foundation model for computational pathology. Nat. Med. 30, 863–874 (2024).
Sun, Y. et al. PathGen-1.6M: 1.6 Million Pathology Image-text Pairs Generation through Multi-agent Collaboration. In The Thirteenth International Conference on Learning Representations. (2025).
Moor, M. et al. Foundation models for generalist medical artificial intelligence. Nature 616, 259–265 (2023).
Article ADS CAS PubMed Google Scholar
Waqas, A. et al. Revolutionizing digital pathology with the power of generative artificial intelligence and foundation models. Lab. Invest. 103, 100255 (2023).
Article MathSciNet PubMed Google Scholar
Neidlinger, P. et al. Benchmarking foundation models as feature extractors for weakly supervised computational pathology. Nat. Biomed. Eng. 1–11 (2025).
Zimmermann, E. et al. Virchow2: Scaling self-supervised mixed magnification models in pathology. arXiv Preprint at https://doi.org/10.48550/arXiv.2408.00738 (2024).
Shaikovski, G. et al. Prism: A multi-modal generative foundation model for slide-level histopathology. arXiv Preprint at https://doi.org/10.48550/arXiv.2405.10254 (2024).
Lu, M.Y. et al. A multimodal generative AI copilot for human pathology. Nature 634, 466–473 (2024).
Article ADS CAS PubMed PubMed Central Google Scholar
Nechaev, D., Pchelnikov, A., Ivanova, E. Hibou: A family of foundational vision transformers for pathology. arXiv Preprint at https://doi.org/10.48550/arXiv.2406.05074 (2024).
Ilse, M., Tomczak, J., Welling, M. Attention-based deep multiple instance learning. In: International Conference on Machine Learning. 2127–2136 (PMLR, 2018).
Zhang, Y. et al. Attention-challenging multiple instance learning for whole slide image classification. In: European conference on computer vision. 125–143 (Springer, 2024).
Zhang, H. et al. Dtfd-mil: Double-tier feature distillation multiple instance learning for histopathology whole slide image classification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18802–18812 (2022).
Xiang, J., Zhang, J. Exploring low-rank property in multiple instance learning for whole slide image classification. In: The Eleventh International Conference on Learning Representations. (2023).
Huang, Y., Zhao, W., Fu, Y., Zhu, L., Yu, L. Unleash the power of state space model for whole slide image with local aware scanning and importance resampling. IEEE Transactions on Medical Imaging. (IEEE, 2024).
Ong Ly, C. et al. Shortcut learning in medical AI hinders generalization: method for estimating AI model generalization without external data. NPJ Digit. Med. 7, 124 (2024).
Article PubMed PubMed Central Google Scholar
Chen, R. J. et al. Algorithmic fairness in artificial intelligence for medicine and healthcare. Nat. Biomed. Eng. 7, 719–742 (2023).
Article PubMed PubMed Central Google Scholar
Geirhos, R. et al. Shortcut learning in deep neural networks. Nat. Mach. Intell. 2, 665–673 (2020).
Article Google Scholar
Vaidya, A. et al. Demographic bias in misdiagnosis by computational pathology models. Nat. Med. 30, 1174–1190 (2024).
Article CAS PubMed Google Scholar
Salehi, P., Chalechale, A. Pix2pix-based stain-to-stain translation: a solution for robust stain normalization in histopathology images analysis. In: 2020 International Conference on Machine Vision and Image Processing (MVIP). 1–7 (IEEE, 2020).
Howard, F. M. et al. The impact of site-specific digital histology signatures on deep learning model accuracy and bias. Nat. Commun. 12, 4423 (2021).
Article ADS CAS PubMed PubMed Central Google Scholar
Healy, J. & McInnes, L. Uniform manifold approximation and projection. Nat. Rev. Methods Primers 4, 82 (2024).
Macenko, M. et al. A method for normalizing histology slides for quantitative analysis. In: 2009 IEEE International Symposium on Biomedical Imaging: from Nano to Macro. 1107–1110 (IEEE, 2009).
Tellez, D. et al. Quantifying the effects of data augmentation and stain color normalization in convolutional neural networks for computational pathology. Med. image Anal. 58, 101544 (2019).
Article PubMed Google Scholar
Bejnordi, B. E. et al. Stain-specific standardization of whole-slide histopathological images. IEEE Trans. Med. imaging 35, 404–415 (2015).
Article PubMed Google Scholar
Carrot-Zhang, J. et al. Comprehensive analysis of genetic ancestry and its molecular correlates in cancer. Cancer Cell 37, 639–654 (2020).
Article CAS PubMed PubMed Central Google Scholar
Ktena, I. et al. Generative models improve the fairness of medical classifiers under distribution shifts. Nat. Med. 30, 1166–1173 (2024).
Article CAS PubMed PubMed Central Google Scholar
Barroso-Sousa, R. & Metzger-Filho, O. Differences between invasive lobular and invasive ductal carcinoma of the breast: results and therapeutic implications. Ther. Adv. Med. Oncol. 8, 261–266 (2016).
Article CAS PubMed PubMed Central Google Scholar
Tang, M. et al. The histologic phenotype of lung cancers is associated with transcriptomic features rather than genomic characteristics. Nat. Commun. 12, 7081 (2021).
Article ADS CAS PubMed PubMed Central Google Scholar
Chen, Y.-C. et al. Clinicopathological variation of the Lauren classification in gastric cancer. Pathol. Oncol. Res. 22, 197–202 (2016).
Article CAS PubMed Google Scholar
Loibl, S. & Gianni, L. Her2-positive breast cancer. Lancet 389, 2415–2429 (2017).
Article CAS PubMed Google Scholar
Sommer, S., Fuqua, S.A. Estrogen receptor and breast cancer. In: Seminars in Cancer Biology. 11, 339–352 (Elsevier, 2001).
Yip, C.-H. & Rhodes, A. Estrogen and progesterone receptors in breast cancer. Future Oncol. 10, 2293–2301 (2014).
Article CAS PubMed Google Scholar
Fukayama, M. Epstein-barr virus and gastric carcinoma. Pathol. Int. 60, 337–350 (2010).
Article CAS PubMed Google Scholar
Leung, W. K., Kim, J. J., Kim, J. G., Graham, D. Y. & Sepulveda, A. R. Microsatellite instability in gastric intestinal metaplasia in patients with and without gastric cancer. Am. J. Pathol. 156, 537–543 (2000).
Article CAS PubMed PubMed Central Google Scholar
Zardavas, D., Phillips, W. A. & Loi, S. Pik3ca mutations in breast cancer: reconciling findings from preclinical and clinical data. Breast Cancer Res. 16, 1–10 (2014).
Article Google Scholar
Sarrió, D. et al. Epigenetic and genetic alterations of APC and CDH1 genes in lobular breast cancer: relationships with abnormal e-cadherin and catenin expression and microsatellite instability. Int. J. Cancer 106, 208–215 (2003).
Article PubMed Google Scholar
Cunha Santos, G., Shepherd, F. A. & Tsao, M. S. Egfr mutations and lung cancer. Annu. Rev. Pathol.: Mech. Dis. 6, 49–69 (2011).
Article Google Scholar
Facchinetti, F. et al. Lkb1/stk11 mutations in non-small cell lung cancer patients: descriptive analysis and prognostic value. Lung Cancer 112, 62–68 (2017).
Article PubMed Google Scholar
Fenoglio-Preiser, C., Wang, J., Stemmermann, G. & Noffsinger, A. Tp53 and gastric carcinoma: a review. Hum. Mutat. 21, 258–270 (2003).
Article CAS PubMed Google Scholar
Jonckheere, N. & Van Seuningen, I. Integrative analysis of the Cancer Genome Atlas and cancer cell lines encyclopedia large-scale genomic databases: Muc4/muc16/muc20 signature is associated with poor survival in human carcinomas. J. Transl. Med. 16, 1–22 (2018).
Article CAS Google Scholar
Tol, J., Nagtegaal, I. D. & Punt, C. J. Braf mutation in metastatic colorectal cancer. N. Engl. J. Med. 361, 98–99 (2009).
Article CAS PubMed Google Scholar
Iacopetta, B. Tp53 mutation in colorectal cancer. Hum. Mutat. 21, 271–276 (2003).
Article CAS PubMed Google Scholar
Reinhard, E., Adhikhmin, M., Gooch, B. & Shirley, P. Color transfer between images. IEEE Comput. Graph. Appl. 21, 34–41 (2001).
Article ADS Google Scholar
Maron, O., Lozano-Pérez, T. A framework for multiple-instance learning. Advances in neural information processing systems 10 (1997).
Huang, Y., Zhao, W., Chen, Y., Fu, Y. & Yu, L. Free lunch in pathology foundation model: Task-specific model adaptation with concept-guided feature enhancement. Adv. Neural Inf. Process. Syst. 37, 79963–79995 (2024).
Google Scholar
Tishby, N., Zaslavsky, N. Deep learning and the information bottleneck principle. In: 2015 IEEE Information Theory Workshop (ITW). 1–5 (IEEE, 2015).
Tishby, N., Pereira, F.C., Bialek, W. The information bottleneck method. arXiv Preprint at https://doi.org/10.48550/arXiv.physics/0004057 (2000).
Kingma, D. P. & Welling, M. Auto-Encoding Variational Bayes. In 2nd International Conference on Learning Representations. (2014).
Alemi, A. A., Fischer, I., Dillon, J. V. & Murphy, K. Deep Variational Information Bottleneck. In International Conference on Learning Representations. (2017).
Oord, A.v.d., Li, Y., Vinyals, O. Representation learning with contrastive predictive coding. arXiv Preprint at https://doi.org/10.48550/arXiv.1807.03748 (2018).
Huang, Y. et al. Knowledge-Guided Adaptation of Pathology Foundation Models Effectively Improves Cross-domain Generalization and Demographic Fairness. Code Repository for FLEX. https://doi.org/10.5281/zenodo.17347304 (2025).

Download references

Acknowledgments

This work was supported in part by the Research Grants Council of Hong Kong (Nos. 27206123, 17200125, C5055-24G, and T45-401/22-N) and the Hong Kong Innovation and Technology Fund (Nos. ITS/274/22 and GHP/318/22GD) awarded to LY; and by the RGC Collaborative Research Fund (No. C5055-24G), the Start-up Fund of The Hong Kong Polytechnic University (No. P0045999), the Seed Fund of the Research Institute for Smart Ageing (No. P0050946), the Tsinghua-PolyU Joint Research Initiative Fund (No. P0056509), and PolyU UGC funding (No. P0053716) awarded to SW.

Author information

Authors and Affiliations

School of Computing and Data Science, The University of Hong Kong, Hong Kong SAR, China
Yanyan Huang, Weiqin Zhao, Yihang Chen, Feng Wu & Lequan Yu
Department of Pathology, Nanfang Hospital, School of Basic Medical Sciences, Southern Medical University, Guangzhou, China
Zhengyu Zhang & Li Liang
School of Information Science and Engineering, Lanzhou University, Lanzhou, Gansu, China
Yu Fu
School of Medicine, Wake Forest University, Winston-Salem, North Carolina, USA
Yuming Jiang
Guangdong Provincial Key Laboratory of Molecular Tumor Pathology, Guangzhou, China
Li Liang
Jinfeng Laboratory, Chongqing, China
Li Liang
Department of Biomedical Engineering, The Hong Kong Polytechnic University, Hong Kong SAR, China
Shujun Wang
Research Institute for Smart Ageing, The Hong Kong Polytechnic University, Hong Kong SAR, China
Shujun Wang

Authors

Yanyan Huang
View author publications
Search author on:PubMed Google Scholar
Weiqin Zhao
View author publications
Search author on:PubMed Google Scholar
Zhengyu Zhang
View author publications
Search author on:PubMed Google Scholar
Yihang Chen
View author publications
Search author on:PubMed Google Scholar
Yu Fu
View author publications
Search author on:PubMed Google Scholar
Feng Wu
View author publications
Search author on:PubMed Google Scholar
Yuming Jiang
View author publications
Search author on:PubMed Google Scholar
Li Liang
View author publications
Search author on:PubMed Google Scholar
Shujun Wang
View author publications
Search author on:PubMed Google Scholar
Lequan Yu
View author publications
Search author on:PubMed Google Scholar

Contributions

Y.H., W.Z., and L.Y. conceived the study and designed the experiments. Y.H., W.Z., Z.Z., Y.C., Y.F., F.W., L.L., and S.W. contributed to data analysis and manuscript writing. Z.Z., Y.J. and L.L. provided expert opinions on pathology concepts and data interpretation for model development. L.Y., S.W., and L.L. supervised the study. All authors reviewed and approved the final paper.

Corresponding authors

Correspondence to Li Liang, Shujun Wang or Lequan Yu.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Communications thanks the anonymous reviewers for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Peer Review File

Reporting Summary

Source data

Source Data

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Huang, Y., Zhao, W., Zhang, Z. et al. Knowledge-guided adaptation of pathology foundation models effectively improves cross-domain generalization and demographic fairness. Nat Commun 16, 11485 (2025). https://doi.org/10.1038/s41467-025-66300-y

Download citation

Received: 03 March 2025
Accepted: 04 November 2025
Published: 12 December 2025
Version of record: 29 December 2025
DOI: https://doi.org/10.1038/s41467-025-66300-y

Subjects

Abstract

Similar content being viewed by others

Introduction

Results

FLEX improves cross-domain generalization and facilitates accurate clinical usage of pathology foundation models

FLEX effectively improves demographic fairness

FLEX is versatile across different pathology VLMs

FLEX is effective with different sizes of training data and flexible with different MIL models

Discussion

Methods

Dataset and clinical tasks description

Morphology classification

Molecular biomarker status prediction

Gene mutation prediction

External Validation Datasets

CPTAC Dataset

Nanfang Hospital of Southern Medical University (NFH) In-house Dataset

Preprocessing of histopathology images

Segmentation and patching

Feature extraction

Stain normalization

Pathology vision-language foundation models

CONCH18

PathGen-CLIP19

QuiltNet17

Multiple instance learning methods

ABMIL27

ACMIL28

CLAM-SB5

DTFD-MIL29

ILRA-MIL30

Site-preserved Monte Carlo cross-validation

Outer loop: site-preserved partitioning

Inner fold: Monte Carlo cross-validation within IND data

FLEX architecture

Visual concept generation

Textual concept generation

Knowledge-guided feature enhancement

Training details

Evaluation metrics

Performance metrics

Fairness metrics

Ethics declarations

Reporting summary

Data availability

Code availability

References

Acknowledgments

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Supplementary information

Source data

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links

CONCH¹⁸

PathGen-CLIP¹⁹

QuiltNet¹⁷

ABMIL²⁷

ACMIL²⁸

CLAM-SB⁵

DTFD-MIL²⁹

ILRA-MIL³⁰