Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Advertisement

npj Digital Medicine
  • View all journals
  • Search
  • My Account Login
  • Content Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • RSS feed
  1. nature
  2. npj digital medicine
  3. articles
  4. article
Benchmarking large language model-based agent systems for clinical decision tasks
Download PDF
Download PDF
  • Article
  • Open access
  • Published: 18 February 2026

Benchmarking large language model-based agent systems for clinical decision tasks

  • Yunsong Liu1,2,
  • Zunamys I. Carrero2,
  • Xiaofeng Jiang2,3,
  • Dyke Ferber4,
  • Georg Wölflein2,5,
  • Li Zhang2,
  • Sanddhya Jayabalan2,
  • Tim Lenz2,
  • Zhouguang Hui6 &
  • …
  • Jakob Nikolas Kather2,4,7,8 

npj Digital Medicine , Article number:  (2026) Cite this article

  • 839 Accesses

  • 2 Altmetric

  • Metrics details

We are providing an unedited version of this manuscript to give early access to its findings. Before final publication, the manuscript will undergo further editing. Please note there may be errors present which affect the content, and all legal disclaimers apply.

Subjects

  • Business and industry
  • Computational biology and bioinformatics
  • Health care
  • Mathematics and computing
  • Medical research

Abstract

Agentic artificial intelligence (AI) systems, designed to autonomously reason, plan, and invoke tools, have shown promise in healthcare, yet systematic benchmarking of their real-world performance remains limited. In this study, we evaluate two such systems: the open-source OpenManus, built on Meta’s Llama-4 and extended with medically customized agents; and Manus, a proprietary agent system employing a multistep planner-executor-verifier architecture. Both systems were assessed across three benchmark families: AgentClinic, a stepwise dialog-based diagnostic simulation; MedAgentsBench, a knowledge-intensive medical QA dataset; and Humanity’s Last Exam (HLE), a suite of challenging text-only and multimodal questions. Despite access to advanced tools (e.g., web browsing, code development and execution, and text file editing) agent systems yielded only modest accuracy gains over baseline LLMs, reaching 60.3% and 28.0% in AgentClinic MedQA and MIMIC, 30.3% on MedAgentsBench, and 8.6% on HLE text. Multimodal accuracy remained low (15.5% on multimodal HLE, 29.2% on AgentClinic NEJM), while resource demands increased substantially, with >10× token usage and >2× latency. Although 89.9% of hallucinations were filtered by in-agent safeguards, hallucinations remained prevalent. These findings reveal that current agentic designs offer modest performance benefits at significant computational and workflow cost, underscoring the need for more accurate, efficient, and clinically viable agent systems.

Similar content being viewed by others

Large language model agents can use tools to perform clinical calculations

Article Open access 17 March 2025

Healthcare agent: eliciting the power of large language models for medical consultation

Article Open access 01 September 2025

Holistic evaluation of large language models for medical tasks with MedHELM

Article 20 January 2026

Data availability

The MIMIC-IV dataset can be accessed at https://physionet.org/content/mimiciv/3.1/ upon submission and approval of a data access application. All other data are available at https://github.com/NCCYUNSONG/AgentBenchMedicine_source.

Code availability

All source codes are available at https://github.com/NCCYUNSONG/AgentBenchMedicine_source.

References

  1. Shortliffe, E. & Sepúlveda, M. Clinical decision support in the era of artificial intelligence. JAMA 320, 2199–2200 (2018).

    Google Scholar 

  2. Elhaddad, M. & Hamam, S. AI-driven clinical decision support systems: an ongoing pursuit of potential. Cureus 16, e57728 (2024).

    Google Scholar 

  3. Yang, R. et al. Large language models in health care: development, applications, and challenges. Health Care Sci. 2, 255–263 (2023).

    Google Scholar 

  4. Bedi, S. et al. Testing and evaluation of health care applications of large language models: a systematic review: a systematic review. JAMA 333, 319–328 (2025).

    Google Scholar 

  5. Van Veen, D. et al. Adapted large language models can outperform medical experts in clinical text summarization. Nat. Med. 30, 1134–1142 (2024).

    Google Scholar 

  6. Nori, H., King, N., McKinney, S. M., Carignan, D. & Horvitz, E. Capabilities of GPT-4 on medical challenge problems. Preprint at arXiv https://doi.org/10.48550/arXiv.2303.13375 (2023).

  7. Singhal, K. et al. Toward expert-level medical question answering with large language models. Nat. Med. 31, 943–950 (2025).

    Google Scholar 

  8. Takita, H. et al. A systematic review and meta-analysis of diagnostic performance comparison between generative AI and physicians. NPJ Digit. Med. 8, 175 (2025).

    Google Scholar 

  9. Jiang, Y. et al. MedAgentBench: A Virtual EHR Environment to Benchmark Medical LLMAgents. NEJM AI 2 https://doi.org/10.1056/AIdbp2500144 (2025).

  10. Schmidgall, S. et al. AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments. arXiv https://doi.org/10.48550/arXiv.2405.07960 (2024).

  11. Johri, S. et al. An evaluation framework for clinical use of large language models in patient interaction tasks. Nat. Med. 31, 77–86 (2025).

    Google Scholar 

  12. Tang, X. et al. MedAgentsBench: benchmarking thinking models and agent frameworks for complex medical reasoning. Preprint at arXiv https://doi.org/10.48550/arXiv.2503.07459 (2025).

  13. Sapkota, R., Roumeliotis, K. I. & Karkee, M. AI agents vs. agentic AI: a conceptual taxonomy, applications and challenges. Inf. Fusion 126, 103599 (2026).

    Google Scholar 

  14. Xi, Z. et al. The rise and potential of large language model based agents: a survey. Sci. China Inf. Sci. 68, 121101 (2025).

  15. Hong, S. et al. MetaGPT: Meta programming for a multi-agent collaborative framework. In Proc. International Conference Learning Representations Vol. 2024, 23247–23275 (2024).

  16. Robson, M. J., Xu, S., Wang, Z., Chen, Q. & Ciucci, F. Multi-agent-network-based idea generator for zinc-ion battery electrolyte discovery: a case study on zinc tetrafluoroborate hydrate-based deep eutectic electrolytes. Adv. Mater. 37, e2502649 (2025).

    Google Scholar 

  17. Su, H. et al. Many heads are better than one: improved scientific idea generation by A LLM-based multi-agent system. In (eds. Che, W., Nabende, J., Shutova, E. & Pilehvar, M. T.) Proc. 63rd Annual Meeting of the Association for Computational Linguistics, Vol. 1: Long Papers, 28201–28240 (Association for Computational Linguistics, 2025).

  18. Li, R. et al. CARE-AD: a multi-agent large language model framework for Alzheimer’s disease prediction using longitudinal clinical notes. NPJ Digit. Med. 8, 541 (2025).

    Google Scholar 

  19. Shen, M. & Yang, Q. From mind to machine: the rise of Manus AI as a fully autonomous digital agent. Preprint at arXiv https://doi.org/10.48550/arXiv.2505.02024 (2025).

  20. LLMHacker. Manus AI: the best autonomous AI agent redefining automation and productivity. https://huggingface.co/blog/LLMhacker/manus-ai-best-ai-agent.

  21. Moritz, M., Topol, E. & Rajpurkar, P. Coordinated AI agents for advancing healthcare. Nat. Biomed. Eng. 9, 432–438 (2025).

    Google Scholar 

  22. Wang, Q. et al. AgentTaxo: Dissecting and Benchmarking Token Distribution of LLM Multi-Agent Systems. ICLR 2025 Workshop on Foundation Models in the Wild. https://openreview.net/forum?id=0iLbiYYIpC (2025).

  23. Fernández-Pichel, M., Pichel, J. C. & Losada, D. E. Evaluating search engines and large language models for answering health questions. NPJ Digit. Med. 8, 153 (2025).

    Google Scholar 

  24. Song, J., Xu, Z., He, M., Feng, J. & Shen, B. Graph retrieval augmented large language models for facial phenotype associated rare genetic disease. NPJ Digit. Med. 8, 543 (2025).

    Google Scholar 

  25. Sim, S. Z. Y. & Chen, T. Critique of impure reason: unveiling the reasoning behaviour of medical large language models. Elife 14, e106187 (2025).

    Google Scholar 

  26. Nori, H. et al. Sequential diagnosis with language models. Preprint at arXiv https://doi.org/10.48550/arXiv.2506.22405 (2025).

  27. Chen, X. et al. Enhancing diagnostic capability with multi-agents conversational large language models. NPJ Digit. Med. 8, 159 (2025).

    Google Scholar 

  28. Ji, Z. et al. Survey of hallucination in natural Language Generation. ACM Comput. Surv. https://doi.org/10.1145/3571730 (2022).

  29. Brohi, S., Mastoi, Q.-U. -A., Jhanjhi, N. Z. & Pillai, T. R. A research landscape of agentic AI and large language models: applications, challenges and future directions. Algorithms 18, 499 (2025).

    Google Scholar 

  30. Xu, G. et al. A comprehensive survey of AI Agents in Healthcare. Preprint at TechRxiv https://doi.org/10.36227/techrxiv.176240542.22279040/v2 (2025).

  31. Jin, D. et al. What disease does this patient have? A large-scale open domain question answering dataset from medical exams. Appl. Sci. 11, 6421 (2021).

    Google Scholar 

  32. Johnson, A. et al. MIMIC-IV (version 3.1). PhysioNet https:https://doi.org/10.13026/kpb9-mt58 (2024).

  33. Johnson, A. E. W. et al. MIMIC-IV, a freely accessible electronic health record dataset. Sci. Data 10, 1 (2023).

    Google Scholar 

  34. Jin, Q., Dhingra, B., Liu, Z., Cohen, W. & Lu, X. PubMedQA: A Dataset for Biomedical Research Question Answering. In Proc. 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). (eds. Inui, K., Jiang, J., Ng, V. & Wan, X.) 2567–2577 (Association for Computational Linguistics, Hong Kong, 2019).

  35. Pal, A., Umapathi, L. K. & Sankarasubbu, M. MedMCQA: A Large-scale Multi-Subject Multi-Choice Dataset for Medicaldomain Question Answering. In Proc. Conference on Health, Inference, and Learning. (eds. Flores, G., Chen, G. H., Pollard, T., Ho, J. C. & Naumann, T.) Vol. 174, 248–260 (PMLR, 2022).

  36. Kim, Y., Wu, J., Abdulle, Y. & Wu, H. MedExQA: medical question answering benchmark with multiple explanations. Preprint at arXiv https://doi.org/10.48550/arXiv.2406.06331 (2024).

  37. Wang, Y. et al. MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark. In Advances In Neural Information Processing Systems. (eds. Globerson, A. et al.) Vol. 37 95266–95290 (Curran Associates, Inc., 2024).

  38. Chen, H., Fang, Z., Singla, Y. & Dredze, M. Benchmarking Large Language Models on Answering and Explaining Challenging Medical Questions. In Proc. 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies. (eds. Chiruzzo, L.,Ritter, A. & Wang, L.) Vol. 1, Long Paper, 3563–3599 (Association for Computational Linguistics, Albuquerque, New Mexico, 2025).

  39. Hendrycks, D. et al. Measuring massive multitask language understanding. Preprint at arXiv https://doi.org/10.48550/arXiv.2009.03300 (2020).

  40. Zuo, Y. et al. MedXpertQA: benchmarking expert-level medical reasoning and understanding. Preprint at arXiv https://doi.org/10.48550/arXiv.2501.18362 (2025).

  41. Center for AI Safety, Scale AI & HLE Contributors Consortium. A benchmark of expert-level academic questions toassess AI capabilities. Nature 649, 1139–1146 (2026).

  42. Kwon, W. et al. Efficient Memory Management for Large Language Model Serving with PagedAttention. In Proc. 29th Symposium on Operating Systems Principles. 611–626 (Association for Computing Machinery, New York, NY, 2023).

  43. Georgi Gerganov. llama.cpp: LLM inference in C/C++. https://github.com/ggerganov/llama.cpp (Github, 2023).

  44. OpenAI. OpenAI API. https://openai.com/api (2023).

  45. Liang, X. et al. OpenManus: an open-source framework for building general AI agents. Preprint at https://doi.org/10.5281/zenodo.15186407 (2025).

  46. Zhu, Y. et al. MedAgentBoard: benchmarking multi-agent collaboration with conventional methods for diverse medical tasks. Preprint at arXiv https://doi.org/10.48550/arXiv.2505.12371 (2025).

Download references

Acknowledgements

J.N.K. is supported by the German Cancer Aid DKH (DECADE, 70115166), the German Federal Ministry of Research, Technology and Space BMFTR (PEARL, 01KD2104C; CAMINO, 01EO2101; TRANSFORM LIVER, 031L0312A; TANGERINE, 01KT2302 through ERA-NET Transcan; Come2Data, 16DKZ2044A; DEEP-HCC, 031L0315A; DECIPHER-M, 01KD2420A; NextBIG, 01ZU2402A), the German Research Foundation DFG (TRR 412/1, 535081457; SFB 1709/1 2025, 533056198), the German Academic Exchange Service DAAD (SECAI, 57616814), the German Federal Joint Committee G-BA (TransplantKI, 01VSF21048), the European Union EU’s Horizon Europe research and innovation programme (ODELIA, 101057091; GENIAL, 101096312), the European Research Council ERC (NADIR, 101114631), the Breast cancer Research Foundation (BELLADONNA, BCRF-25-225) and the National Institute for Health and Care Research NIHR (Leeds Biomedical Research Centre, NIHR203331). The views expressed are those of the author(s) and not necessarily those of the NHS, the NIHR or the Department of Health and Social Care. This work was funded by the European Union. Views and opinions expressed are, however, those of the author(s) only and do not necessarily reflect those of the European Union. Neither the European Union nor the granting authority can be held responsible for them.

Funding

Open Access funding enabled and organized by Projekt DEAL.

Author information

Authors and Affiliations

  1. Department of Radiation Oncology, National Cancer Center/National Clinical Research Center for Cancer/Cancer Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China

    Yunsong Liu

  2. Else Kroener Fresenius Center for Digital Health, Faculty of Medicine and University Hospital Carl Gustav Carus, TUD Dresden University of Technology, Dresden, Germany

    Yunsong Liu, Zunamys I. Carrero, Xiaofeng Jiang, Georg Wölflein, Li Zhang, Sanddhya Jayabalan, Tim Lenz & Jakob Nikolas Kather

  3. Department of Thoracic Surgery, Sichuan Clinical Research Center for Cancer, Sichuan Cancer Hospital & Institute, Sichuan Cancer Center, University of Electronic Science and Technology of China (UESTC), Chengdu, China

    Xiaofeng Jiang

  4. Medical Oncology, National Center for Tumor Diseases (NCT), University Hospital Heidelberg, Heidelberg, Germany

    Dyke Ferber & Jakob Nikolas Kather

  5. School of Computer Science, University of St Andrews, St Andrews, UK

    Georg Wölflein

  6. Department of VIP Medical Services, National Cancer Center/National Clinical Research Center for Cancer/Cancer Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China

    Zhouguang Hui

  7. Department of Medicine I, Faculty of Medicine and University Hospital Carl Gustav Carus, TUD Dresden University of Technology, Dresden, Germany

    Jakob Nikolas Kather

  8. Pathology & Data Analytics, Leeds Institute of Medical Research at St James’s University of Leeds, Leeds, UK

    Jakob Nikolas Kather

Authors
  1. Yunsong Liu
    View author publications

    Search author on:PubMed Google Scholar

  2. Zunamys I. Carrero
    View author publications

    Search author on:PubMed Google Scholar

  3. Xiaofeng Jiang
    View author publications

    Search author on:PubMed Google Scholar

  4. Dyke Ferber
    View author publications

    Search author on:PubMed Google Scholar

  5. Georg Wölflein
    View author publications

    Search author on:PubMed Google Scholar

  6. Li Zhang
    View author publications

    Search author on:PubMed Google Scholar

  7. Sanddhya Jayabalan
    View author publications

    Search author on:PubMed Google Scholar

  8. Tim Lenz
    View author publications

    Search author on:PubMed Google Scholar

  9. Zhouguang Hui
    View author publications

    Search author on:PubMed Google Scholar

  10. Jakob Nikolas Kather
    View author publications

    Search author on:PubMed Google Scholar

Contributions

Y.L.: Conceptualization, Data curation, Resources, Writing—original draft; Z.C.: Resources, Writing—original draft; X.J.: Writing—original draft; D.F.: Writing—original draft; G.W.: Methodology, Writing - original draft; L.Z.: Writing—original draft; S.J.: Writing—original draft; T.L.: Writing—original draft; Z.H.: Supervision, Project administration; J.N.K: Conceptualization, Project administration, Supervision; Writing—review & editing. All authors reviewed the manuscript and approved the submitted version.

Corresponding authors

Correspondence to Zhouguang Hui or Jakob Nikolas Kather.

Ethics declarations

Competing interests

J.N.K. declares ongoing consulting services for AstraZeneca and Bioptimus. Furthermore, he holds shares in StratifAI, Synagen, and Spira Labs, has received an institutional research grant from GSK and AstraZeneca, as well as honoraria from AstraZeneca, Bayer, Daiichi Sankyo, Eisai, Janssen, Merck, MSD, BMS, Roche, Pfizer, and Fresenius. Author J.N.K. is the Deputy Editor of the npj Precision Oncology. J.N.K. was not involved in the journal’s review of, or decisions related to, this manuscript.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, Y., Carrero, Z.I., Jiang, X. et al. Benchmarking large language model-based agent systems for clinical decision tasks. npj Digit. Med. (2026). https://doi.org/10.1038/s41746-026-02443-6

Download citation

  • Received: 09 September 2025

  • Accepted: 06 February 2026

  • Published: 18 February 2026

  • DOI: https://doi.org/10.1038/s41746-026-02443-6

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Download PDF

Associated content

Collection

Impact of Agentic AI on Care Delivery

Advertisement

Explore content

  • Research articles
  • Reviews & Analysis
  • News & Comment
  • Collections
  • Follow us on X
  • Sign up for alerts
  • RSS feed

About the journal

  • Aims and scope
  • Content types
  • Journal Information
  • About the Editors
  • Contact
  • Editorial policies
  • Calls for Papers
  • Journal Metrics
  • About the Partner
  • Open Access
  • Early Career Researcher Editorial Fellowship
  • Editorial Team Vacancies
  • News and Views Student Editor
  • Communication Fellowship

Publish with us

  • For Authors and Referees
  • Language editing services
  • Open access funding
  • Submit manuscript

Search

Advanced search

Quick links

  • Explore articles by subject
  • Find a job
  • Guide to authors
  • Editorial policies

npj Digital Medicine (npj Digit. Med.)

ISSN 2398-6352 (online)

nature.com sitemap

About Nature Portfolio

  • About us
  • Press releases
  • Press office
  • Contact us

Discover content

  • Journals A-Z
  • Articles by subject
  • protocols.io
  • Nature Index

Publishing policies

  • Nature portfolio policies
  • Open access

Author & Researcher services

  • Reprints & permissions
  • Research data
  • Language editing
  • Scientific editing
  • Nature Masterclasses
  • Research Solutions

Libraries & institutions

  • Librarian service & tools
  • Librarian portal
  • Open research
  • Recommend to library

Advertising & partnerships

  • Advertising
  • Partnerships & Services
  • Media kits
  • Branded content

Professional development

  • Nature Awards
  • Nature Careers
  • Nature Conferences

Regional websites

  • Nature Africa
  • Nature China
  • Nature India
  • Nature Japan
  • Nature Middle East
  • Privacy Policy
  • Use of cookies
  • Legal notice
  • Accessibility statement
  • Terms & Conditions
  • Your US state privacy rights
Springer Nature

© 2026 Springer Nature Limited

Nature Briefing AI and Robotics

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing: AI and Robotics