Fig. 6: Overview of the proposed framework and experimental setup.

a Visual knowledge generation pipeline. This is a one-time offline procedure during training. ABMIL identifies high-attention patches, followed by GPT-4o-based filtering and human expert verification to select representative patches. Class-specific visual concepts are generated by averaging features from representative patches using the VLM image encoder. b Pipeline for textual knowledge generation, also a one-time offline procedure. GPT-4o is prompted to create representative text prompts (both task-specific and task-agnostic). After expert verification, these prompts are enriched with learnable, data-driven contexts and encoded by the VLM text encoder to create the final textual concepts. c The architecture of FLEX. During inference, original patch features are extracted by the VLM image encoder; a variational encoder then generates parameters for a Gaussian distribution for each patch, from which enhanced features are sampled and aggregated by the MIL model. During training, a Visual Concept Guided Pilot Patch Selection module uses the pre-computed visual concept to select the top-k most relevant enhanced patches. These selected patches are then used in the Textual Concept Guided Feature Calibration process, where an InfoNCE loss aligns the features with the textual concepts by minimizing the distance to the corresponding class concept while maximizing their distance from other concepts. This process helps to optimize the variational encoder of FLEX. d Schematic of the SP-MCCV strategy. The dataset is partitioned into outer folds based on the clinical site to create distinct training and OOD test sets. Inner folds are then used to randomly split the training data for IND evaluation. Some illustrations were created with BioRender.com.