Fig. 5: Vision-language evaluation.
From: A multimodal knowledge-enhanced whole-slide pathology foundation model

a The scheme of zero-shot evaluation. For zero-shot classification, we used class prompts as the text input. For zero-shot retrieval, the text input is a pathology report. b Performance of zero-shot slide classification on 6 independent datasets. The ʻOverallʼ refers to the averaged performance across these 6 datasets. Error bars represent 95% CI with 1000 bootstrap replicates for all bar plots. P-value is given through one-sided Wilcoxon signed-rank test between mSTAR and the second-best FM. c Performance of zero-shot retrieval on an external dataset for Image-to-Text and Text-to-Image tasks. The results on held-out TCGA dataset are presented for reference only to be compared with zero-shot’s capability. d Performance of report generation on one held-out TCGA dataset and two external datasets. P-value for every group of experiments is given through one-sided Wilcoxon signed-rank test between mSTAR and the second-best FM. Detailed performances of every dataset are presented in Supplementary Table 15–17. Source data are provided as a Source Data file.