Cost-effective instruction learning for pathology vision and language analysis

Chen, Kaitao; Liu, Mianxin; Yan, Fang; Ma, Lei; Shi, Xiaoming; Wang, Lilong; Wang, Xiaosong; Zhu, Lifeng; Wang, Zhe; Zhou, Mu; Zhang, Shaoting

doi:10.1038/s43588-025-00818-5

Article
Published: 19 June 2025

Cost-effective instruction learning for pathology vision and language analysis

Kaitao Chen^1,2^na1,
Mianxin Liu ORCID: orcid.org/0000-0001-5171-778X¹^na1,
Fang Yan¹,
Lei Ma ORCID: orcid.org/0000-0001-6024-3854³,
Xiaoming Shi¹,
Lilong Wang¹,
Xiaosong Wang¹,
Lifeng Zhu⁴,
Zhe Wang⁵,
Mu Zhou⁶ &
…
Shaoting Zhang ORCID: orcid.org/0000-0002-8719-448X^1,7

Nature Computational Science volume 5, pages 524–533 (2025)Cite this article

1505 Accesses
1 Altmetric
Metrics details

Subjects

A preprint version of the article is available at arXiv.

Abstract

The advent of vision–language models fosters interactive conversations between artificial intelligence-enabled models and humans. However, applying these models in the clinic faces challenges related to large-scale training data as well as financial and computational resources. Here we propose CLOVER, a cost-effective instruction learning framework for conversational pathology. CLOVER trains a lightweight module and uses instruction tuning while freezing the parameters of the large language model. Instead of using costly GPT-4, we propose well-designed prompts on GPT-3.5 for building generation-based instructions, emphasizing the utility of pathological knowledge derived from the Internet source. We construct a high-quality set of template-based instructions in the context of digital pathology. Using two benchmark datasets, our findings reveal the strength of hybrid-form, pathological visual question–answer instructions. CLOVER outperforms baselines that possess 37 times more training parameters and exhibits few-shot capacity on an external clinical dataset. CLOVER could thus accelerate the adoption of rapid conversational applications in digital pathology.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on SpringerLink
Instant access to the full article PDF.

USD 39.95

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: A schematic illustration of CLOVER.**

**Fig. 2: Comparison with prior SOTA methods. a, Results on the intestine dataset. b, Results on the stomach dataset.**

**Fig. 3: Model performance at different scales of generation-based instruction data on the PathVQA dataset.**

Benchmarking foundation models as feature extractors for weakly supervised computational pathology

Article Open access 01 October 2025

Training high-performance deep learning classifier for diagnosis in oral cytology using diverse annotations

Article Open access 30 July 2024

A clinical deep learning framework for continually learning from cardiac signals across diseases, time, modalities, and institutions

Article Open access 09 July 2021

Data availability

The QUILT-1M, QUILT-VQA and Quilt-instruct²⁷ can be accessed via GitHub at https://quilt1m.github.io/. LLaVA-Med-Pathology¹⁸ can be accessed via GitHub at https://github.com/microsoft/LLaVA-Med. PathVQA³⁴ can be downloaded via HuggingFace at https://huggingface.co/datasets/flaviagiammarino/path-vqa. The clinical dataset from Xinhua Hospital is available upon request from the corresponding author (zhangshaoting@pjlab.org.cn) owing to the privacy protection restriction of the hospital. Requests will be reviewed to ensure confidentiality. A data-sharing agreement must be signed before data release. Source data are provided with this paper.

Code availability

The code, instruction datasets and models are publicly available via GitHub at https://github.com/JLINEkai/CLOVER (ref. ⁵⁷).

References

Zhang, Y. et al. Data-centric foundation models in computational healthcare: a survey. Preprint at https://arxiv.org/abs/2401.02458 (2024).
van Sonsbeek, T., Derakhshani, M. M., Najdenkoska, I., Snoek, C. G. M. & Worring, M. Open-ended medical visual question answering through prefix tuning of language models. In Medical Image Computing and Computer Assisted Intervention – MICCAI 2023 Lecture Notes in Computer Science, vol 14224 (eds Greenspan, H. et al.) https://doi.org/10.1007/978-3-031-43904-9_70 (Springer, Cham, 2023).
Li, P., Liu, G., Tan, L., Liao, J, & Zhong, S. Self-supervised vision-language pretraining for medial visual question answering. In IEEE 20th International Symposium on Biomedical Imaging https://doi.org/10.1109/ISBI53787.2023.10230743 (IEEE, 2023).
Thirunavukarasu, A. J. et al. Large language models in medicine. Nat. Med. 29, 1930–1940 (2023).
Article Google Scholar
Singhal, K. et al. Large language models encode clinical knowledge. Nature 620, 172–180 (2023).
Article Google Scholar
Wei, J. et al. Chain-of-thought prompting elicits reasoning in large language models. In 36th Conference on Neural Information Processing Systems (NeurIPS 2022) https://openreview.net/pdf?id=_VjQlMeSB_J (NeurIPS, 2022).
Schulman, J. et al. ChatGPT: optimizing language models for dialogue. OpenAI Blog https://openai.com/index/chatgpt (2022).
Achiam, J. et al. GPT-4 technical report. Preprint at https://arxiv.org/abs/2303.08774 (2023).
Liu, H., Li, C., Wu, Q. & Lee, Y. J. Visual instruction tuning. In 37th Conference on Neural Information Processing Systems (NeurIPS 2023) https://proceedings.neurips.cc/paper_files/paper/2023/file/6dcf277ea32ce3288914faf369fe6de0-Paper-Conference.pdf (NeurIPS, 2023).
Alayrac, J.-B. et al. Flamingo: a visual language model for few-shot learning. In 36th Conference on Neural Information Processing Systems (NeurIPS 2022) https://papers.neurips.cc/paper_files/paper/2022/file/960a172bc7fbf0177ccccbb411a7d800-Paper-Conference.pdf (NeurIPS, 2022).
Koh, J. Y., Salakhutdinov, R. & Fried, D. Grounding language models to images for multimodal generation. In International Conference on Machine Learning 17283–17300 (PMLR, 2023).
Ye, Q. et al. mPLUG-Owl: modularization empowers large language models with multimodality. Preprint at https://arxiv.org/abs/2304.14178 (2023).
Song, A. H. et al. Artificial intelligence for digital and computational pathology. Nat. Rev. Bioeng. 1, 930–949 (2023).
Article Google Scholar
Schwalbe, N. & Wahl, B. Artificial intelligence and the future of global health. The Lancet 395, 1579–1586 (2020).
Article Google Scholar
Baxi, V., Edwards, R., Montalto, M. & Saha, S. Digital pathology and artificial intelligence in translational medicine and clinical practice. Mod. Pathol. 35, 23–32 (2022).
Article Google Scholar
Wang, X. et al. Editorial for special issue on foundation models for medical image analysis. Med. Image Anal. https://doi.org/10.1016/j.media.2024.103389 (2024).
Zhang, S. & Metaxas, D. On the challenges and perspectives of foundation models for medical image analysis. Med. Image Anal. 91, 102996 (2024).
Article Google Scholar
Li, C. et al. LLaVA-Med: training a large language-and-vision assistant for biomedicine in one day. In 37th Conference on Neural Information Processing Systems (NeurIPS 2023) https://papers.neurips.cc/paper_files/paper/2023/file/5abcdf8ecdcacba028c6662789194572-Paper-Datasets_and_Benchmarks.pdf (NeurIPS, 2023).
Seyfioglu, M. S., Ikezogwo, W. O., Ghezloo, F., Krishna, R. & Shapiro, L. Quilt-LLaVA: visual instruction tuning by extracting localized narratives from open-source histopathology videos. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 13183–13192 (IEEE, 2024).
Wu, C. et al. PMC-LLaMA: toward building open-source language models for medicine. J. Am. Med. Inform. Assoc. https://doi.org/10.1093/jamia/ocae045 (2024).
Moor, M. et al. Med-flamingo: a multimodal medical few-shot learner. In Proc. 3rd Machine Learning for Health Symposium 353–367 (PMLR, 2023).
Wang, X. et al. A pathology foundation model for cancer diagnosis and prognosis prediction. Nature 634, 970–978 (2024).
Article Google Scholar
Lu, M. Y. et al. A visual-language foundation model for computational pathology. Nat. Med. 30, 863–874 (2024).
Article Google Scholar
Lu, M. Y. et al. A multimodal generative AI copilot for human pathology. Nature 634, 466–473 (2024).
Article Google Scholar
Xu, Y. et al. A multimodal knowledge-enhanced whole-slide pathology foundation model. Preprint at https://arxiv.org/abs/2407.15362 (2024).
Zhang, S. et al. Large-scale domain-specific pretraining for biomedical vision-language processing. Preprint at https://arxiv.org/abs/2303.00915 (2023).
Ikezogwo, W. O. et al. Quilt-1m: One million image-text pairs for histopathology. In 37th Conference on Neural Information Processing Systems (NeurIPS 2023) https://proceedings.neurips.cc/paper_files/paper/2023/file/775ec578876fa6812c062644964b9870-Paper-Datasets_and_Benchmarks.pdf (NeurIPS, 2023).
Huang, Z., Bianchi, F., Yuksekgonul, M., Montine, T. J. & Zou, J. A visual–language foundation model for pathology image analysis using medical twitter. Nat. Med. 29, 2307–2316 (2023).
Article Google Scholar
Radford, A. et al. Learning transferable visual models from natural language supervision. In Proc. International Conference on Machine Learning 8748–8763 (PMLR, 2021).
Gao, Y., Gu, D., Zhou, M. & Metaxas, D. Aligning human knowledge with visual concepts towards explainable medical image classification. In International Conference on Medical Image Computing and Computer-Assisted Intervention 46–56 (Springer, 2024).
Zhang, Y. et al. Text-guided foundation model adaptation for pathological image classification. In Proc. International Conference on Medical Image Computing and Computer-Assisted Intervention 272–282 (Springer, 2023).
Ding, K., Zhou, M., Metaxas, D. N. & Zhang, S. Pathology-and-genomics multimodal transformer for survival outcome prediction. In Proc. International Conference on Medical Image Computing and Computer-Assisted Intervention 622–631 (Springer, 2023).
Li, J., Li, D., Savarese, S. & Hoi, S. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In International Conference on Machine Learning 19730–19742 (PMLR, 2023).
He, X. et al. Pathvqa: 30000+ questions for medical visual question answering. Preprint at https://arxiv.org/abs/2003.10286 (2020).
Dubey, A. et al. The Llama 3 herd of models. Preprint at https://arxiv.org/abs/2407.21783 (2024).
Chen, X. et al. Recent advances and clinical applications of deep learning in medical image analysis. Med. Image Anal. 79, 102444 (2022).
Article Google Scholar
Chang, Q. et al. Mining multi-center heterogeneous medical data with distributed synthetic learning. Nat. Commun. 14, 5510 (2023).
Article Google Scholar
Graikos, A. et al. Learned representation-guided diffusion models for large-image generation. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 8532–8542 (IEEE, 2024).
Ding, K. et al. A large-scale synthetic pathological dataset for deep learning-enabled segmentation of breast cancer. Sci. Data 10, 231 (2023).
Article Google Scholar
Sun, Y. et al. PathGen-1.6M: 1.6 million pathology image-text pairs generation through multi-agent collaboration. In Thirteenth International Conference on Learning Representations (ICLR 2025) https://openreview.net/pdf?id=rFpZnn11gj (ICLR, 2025).
Chung, H. W. et al. Scaling instruction-finetuned language models. J. Mach. Learn. Res. 25, 1–53 (2024).
Google Scholar
Liu, H., Li, C., Li, Y. & Lee, Y. J. Improved baselines with visual instruction tuning. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 26296–26306 (IEEE, 2024).
Peng, B., Li, C., He, P., Galley, M. & Gao, J. Instruction tuning with GPT-4. Preprint at https://arxiv.org/abs/2304.03277 (2023).
Zhou, C. et al. LIMA: less is more for alignment. In 37th Conference on Neural Information Processing Systems (NeurIPS 2023) https://proceedings.neurips.cc/paper_files/paper/2023/file/ac662d74829e4407ce1d126477f4a03a-Paper-Conference.pdf (NeurIPS, 2023).
Wei, J. et al. Finetuned language models are zero-shot learners. In Tenth International Conference on Learning Representations https://openreview.net/pdf?id=gEZrGCozdqR (ICLR, 2022).
Chen, P., Zhu, C., Zheng, S., Li, H., & Yang, L. WSI-VQA: interpreting whole slide images by generative visual question answering. In Proc. European Conference on Computer Vision 401–417 (2025).
Ding, K., Zhou, M., Wang, H., Zhang, S. & Metaxas, D. N. Spatially aware graph neural networks and cross-level molecular profile prediction in colon cancer histopathology: a retrospective multi-cohort study. Lancet Digit. Health 4, 787–795 (2022).
Article Google Scholar
Hu, E.J. et al. LoRA: low-rank adaptation of large language models. In Tenth International Conference on Learning Representations https://openreview.net/pdf?id=nZeVKeeFYf9 (ICLR, 2022).
Jin, Y. et al. Efficient multimodal large language models: a survey. Preprint at https://arxiv.org/abs/2405.10739 (2024).
Liu, H. et al. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. In 36th Conference on Neural Information Processing Systems (NeurIPS 2022) https://openreview.net/pdf?id=rBCvMG-JsPd (2022).
Vaswani, A. et al. Attention is all you need. In 31st Conference on Neural Information Processing Systems (NIPS 2017) https://papers.nips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf (2017).
Li, J., Li, D., Xiong, C. & Hoi, S. BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In Proc. International Conference on Machine Learning 12888–12900 (PLMR, 2022).
Fang, Y. et al. EVA: exploring the limits of masked visual representation learning at scale. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 19358–19369 (IEEE, 2023).
Chiang, W.-L. et al. Vicuna: an open-source chatbot impressing GPT-4 with 90% chatgpt quality. LMSYS Org https://vicuna.lmsys.org (2023).
Bazi, Y., Rahhal, M. M. A., Bashmal, L. & Zuair, M. Vision–language model for visual question answering in medical imagery. Bioengineering 10, 380 (2023).
Article Google Scholar
Liu, Y., Wang, Z., Xu, D. & Zhou, L. Q2ATransformer: improving medical VQA via an answer querying decoder. In Proc. International Conference on Information Processing in Medical Imaging 445–456 (Springer Nature, 2023).
Chen, K. et al. Cost-effective instruction learning for pathology vision and language analysis. Zenodo https://doi.org/10.5281/zenodo.15081542 (2025).

Download references

Acknowledgements

This study is supported in part by Shanghai Artificial Intelligence Laboratory (M.L. and S.Z.) and the Centre for Perceptual and Interactive Intelligence Ltd under the Innovation and Technology Commission’s InnoHK (S.Z.).

Author information

These authors contributed equally: Kaitao Chen, Mianxin Liu.

Authors and Affiliations

Shanghai Artificial Intelligence Laboratory, Shanghai, China
Kaitao Chen, Mianxin Liu, Fang Yan, Xiaoming Shi, Lilong Wang, Xiaosong Wang & Shaoting Zhang
School of Computer Science, Fudan University, Shanghai, China
Kaitao Chen
National Biomedical Imaging Center, College of Future Technology, Peking University, Beijing, China
Lei Ma
Ruijin Hospital, Shanghai Jiaotong University School of Medicine, Shanghai, China
Lifeng Zhu
Department of Pathology, State Key Laboratory of Cancer Biology, Xijing Hospital, Xi’an, China
Zhe Wang
Department of Computer Science, Rutgers University, Piscataway, NJ, USA
Mu Zhou
Centre of Perceptual and Interactive Intelligence under the InnoHK, Hong Kong SAR, China
Shaoting Zhang

Authors

Kaitao Chen
View author publications
Search author on:PubMed Google Scholar
Mianxin Liu
View author publications
Search author on:PubMed Google Scholar
Fang Yan
View author publications
Search author on:PubMed Google Scholar
Lei Ma
View author publications
Search author on:PubMed Google Scholar
Xiaoming Shi
View author publications
Search author on:PubMed Google Scholar
Lilong Wang
View author publications
Search author on:PubMed Google Scholar
Xiaosong Wang
View author publications
Search author on:PubMed Google Scholar
Lifeng Zhu
View author publications
Search author on:PubMed Google Scholar
Zhe Wang
View author publications
Search author on:PubMed Google Scholar
Mu Zhou
View author publications
Search author on:PubMed Google Scholar
Shaoting Zhang
View author publications
Search author on:PubMed Google Scholar

Contributions

K.C., M.L., M.Z. and S.Z. are major contributors to drafting and revising the manuscript for content and analyzing the data. F.Y., L.M., X.S., L.W., X.W., L.Z. and Z.W. played major roles in the acquisition of data. K.C., M.L., M.Z., F.Y., X.S., L.W., L.M. and X.W. substantially revised the manuscript. K.C., M.L., M.Z. and S.Z. conceptualized and designed the study. M.L., L.Z. and Z.W. interpreted the data. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Shaoting Zhang.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Computational Science thanks Hao Chen, Prateek Verma and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available. Primary Handling Editor: Ananya Rastogi, in collaboration with the Nature Computational Science team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Related Works, Tables 1–9 and Fig. 1.

Reporting Summary

Peer Review File

Source data

Source Data Fig. 2.

Statistical source data for Fig. 2.

Source Data Fig. 3.

Statistical source data for Fig. 3.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Chen, K., Liu, M., Yan, F. et al. Cost-effective instruction learning for pathology vision and language analysis. Nat Comput Sci 5, 524–533 (2025). https://doi.org/10.1038/s43588-025-00818-5

Download citation

Received: 12 September 2024
Accepted: 13 May 2025
Published: 19 June 2025
Version of record: 19 June 2025
Issue date: July 2025
DOI: https://doi.org/10.1038/s43588-025-00818-5