Abstract
Agentic artificial intelligence (AI) systems, designed to autonomously reason, plan, and invoke tools, have shown promise in healthcare, yet systematic benchmarking of their real-world performance remains limited. In this study, we evaluate two such systems: the open-source OpenManus, built on Meta’s Llama-4 and extended with medically customized agents; and Manus, a proprietary agent system employing a multistep planner-executor-verifier architecture. Both systems were assessed across three benchmark families: AgentClinic, a stepwise dialog-based diagnostic simulation; MedAgentsBench, a knowledge-intensive medical QA dataset; and Humanity’s Last Exam (HLE), a suite of challenging text-only and multimodal questions. Despite access to advanced tools (e.g., web browsing, code development and execution, and text file editing) agent systems yielded only modest accuracy gains over baseline LLMs, reaching 60.3% and 28.0% in AgentClinic MedQA and MIMIC, 30.3% on MedAgentsBench, and 8.6% on HLE text. Multimodal accuracy remained low (15.5% on multimodal HLE, 29.2% on AgentClinic NEJM), while resource demands increased substantially, with >10× token usage and >2× latency. Although 89.9% of hallucinations were filtered by in-agent safeguards, hallucinations remained prevalent. These findings reveal that current agentic designs offer modest performance benefits at significant computational and workflow cost, underscoring the need for more accurate, efficient, and clinically viable agent systems.
Similar content being viewed by others
Data availability
The MIMIC-IV dataset can be accessed at https://physionet.org/content/mimiciv/3.1/ upon submission and approval of a data access application. All other data are available at https://github.com/NCCYUNSONG/AgentBenchMedicine_source.
Code availability
All source codes are available at https://github.com/NCCYUNSONG/AgentBenchMedicine_source.
References
Shortliffe, E. & Sepúlveda, M. Clinical decision support in the era of artificial intelligence. JAMA 320, 2199–2200 (2018).
Elhaddad, M. & Hamam, S. AI-driven clinical decision support systems: an ongoing pursuit of potential. Cureus 16, e57728 (2024).
Yang, R. et al. Large language models in health care: development, applications, and challenges. Health Care Sci. 2, 255–263 (2023).
Bedi, S. et al. Testing and evaluation of health care applications of large language models: a systematic review: a systematic review. JAMA 333, 319–328 (2025).
Van Veen, D. et al. Adapted large language models can outperform medical experts in clinical text summarization. Nat. Med. 30, 1134–1142 (2024).
Nori, H., King, N., McKinney, S. M., Carignan, D. & Horvitz, E. Capabilities of GPT-4 on medical challenge problems. Preprint at arXiv https://doi.org/10.48550/arXiv.2303.13375 (2023).
Singhal, K. et al. Toward expert-level medical question answering with large language models. Nat. Med. 31, 943–950 (2025).
Takita, H. et al. A systematic review and meta-analysis of diagnostic performance comparison between generative AI and physicians. NPJ Digit. Med. 8, 175 (2025).
Jiang, Y. et al. MedAgentBench: A Virtual EHR Environment to Benchmark Medical LLMAgents. NEJM AI 2 https://doi.org/10.1056/AIdbp2500144 (2025).
Schmidgall, S. et al. AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments. arXiv https://doi.org/10.48550/arXiv.2405.07960 (2024).
Johri, S. et al. An evaluation framework for clinical use of large language models in patient interaction tasks. Nat. Med. 31, 77–86 (2025).
Tang, X. et al. MedAgentsBench: benchmarking thinking models and agent frameworks for complex medical reasoning. Preprint at arXiv https://doi.org/10.48550/arXiv.2503.07459 (2025).
Sapkota, R., Roumeliotis, K. I. & Karkee, M. AI agents vs. agentic AI: a conceptual taxonomy, applications and challenges. Inf. Fusion 126, 103599 (2026).
Xi, Z. et al. The rise and potential of large language model based agents: a survey. Sci. China Inf. Sci. 68, 121101 (2025).
Hong, S. et al. MetaGPT: Meta programming for a multi-agent collaborative framework. In Proc. International Conference Learning Representations Vol. 2024, 23247–23275 (2024).
Robson, M. J., Xu, S., Wang, Z., Chen, Q. & Ciucci, F. Multi-agent-network-based idea generator for zinc-ion battery electrolyte discovery: a case study on zinc tetrafluoroborate hydrate-based deep eutectic electrolytes. Adv. Mater. 37, e2502649 (2025).
Su, H. et al. Many heads are better than one: improved scientific idea generation by A LLM-based multi-agent system. In (eds. Che, W., Nabende, J., Shutova, E. & Pilehvar, M. T.) Proc. 63rd Annual Meeting of the Association for Computational Linguistics, Vol. 1: Long Papers, 28201–28240 (Association for Computational Linguistics, 2025).
Li, R. et al. CARE-AD: a multi-agent large language model framework for Alzheimer’s disease prediction using longitudinal clinical notes. NPJ Digit. Med. 8, 541 (2025).
Shen, M. & Yang, Q. From mind to machine: the rise of Manus AI as a fully autonomous digital agent. Preprint at arXiv https://doi.org/10.48550/arXiv.2505.02024 (2025).
LLMHacker. Manus AI: the best autonomous AI agent redefining automation and productivity. https://huggingface.co/blog/LLMhacker/manus-ai-best-ai-agent.
Moritz, M., Topol, E. & Rajpurkar, P. Coordinated AI agents for advancing healthcare. Nat. Biomed. Eng. 9, 432–438 (2025).
Wang, Q. et al. AgentTaxo: Dissecting and Benchmarking Token Distribution of LLM Multi-Agent Systems. ICLR 2025 Workshop on Foundation Models in the Wild. https://openreview.net/forum?id=0iLbiYYIpC (2025).
Fernández-Pichel, M., Pichel, J. C. & Losada, D. E. Evaluating search engines and large language models for answering health questions. NPJ Digit. Med. 8, 153 (2025).
Song, J., Xu, Z., He, M., Feng, J. & Shen, B. Graph retrieval augmented large language models for facial phenotype associated rare genetic disease. NPJ Digit. Med. 8, 543 (2025).
Sim, S. Z. Y. & Chen, T. Critique of impure reason: unveiling the reasoning behaviour of medical large language models. Elife 14, e106187 (2025).
Nori, H. et al. Sequential diagnosis with language models. Preprint at arXiv https://doi.org/10.48550/arXiv.2506.22405 (2025).
Chen, X. et al. Enhancing diagnostic capability with multi-agents conversational large language models. NPJ Digit. Med. 8, 159 (2025).
Ji, Z. et al. Survey of hallucination in natural Language Generation. ACM Comput. Surv. https://doi.org/10.1145/3571730 (2022).
Brohi, S., Mastoi, Q.-U. -A., Jhanjhi, N. Z. & Pillai, T. R. A research landscape of agentic AI and large language models: applications, challenges and future directions. Algorithms 18, 499 (2025).
Xu, G. et al. A comprehensive survey of AI Agents in Healthcare. Preprint at TechRxiv https://doi.org/10.36227/techrxiv.176240542.22279040/v2 (2025).
Jin, D. et al. What disease does this patient have? A large-scale open domain question answering dataset from medical exams. Appl. Sci. 11, 6421 (2021).
Johnson, A. et al. MIMIC-IV (version 3.1). PhysioNet https:https://doi.org/10.13026/kpb9-mt58 (2024).
Johnson, A. E. W. et al. MIMIC-IV, a freely accessible electronic health record dataset. Sci. Data 10, 1 (2023).
Jin, Q., Dhingra, B., Liu, Z., Cohen, W. & Lu, X. PubMedQA: A Dataset for Biomedical Research Question Answering. In Proc. 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). (eds. Inui, K., Jiang, J., Ng, V. & Wan, X.) 2567–2577 (Association for Computational Linguistics, Hong Kong, 2019).
Pal, A., Umapathi, L. K. & Sankarasubbu, M. MedMCQA: A Large-scale Multi-Subject Multi-Choice Dataset for Medicaldomain Question Answering. In Proc. Conference on Health, Inference, and Learning. (eds. Flores, G., Chen, G. H., Pollard, T., Ho, J. C. & Naumann, T.) Vol. 174, 248–260 (PMLR, 2022).
Kim, Y., Wu, J., Abdulle, Y. & Wu, H. MedExQA: medical question answering benchmark with multiple explanations. Preprint at arXiv https://doi.org/10.48550/arXiv.2406.06331 (2024).
Wang, Y. et al. MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark. In Advances In Neural Information Processing Systems. (eds. Globerson, A. et al.) Vol. 37 95266–95290 (Curran Associates, Inc., 2024).
Chen, H., Fang, Z., Singla, Y. & Dredze, M. Benchmarking Large Language Models on Answering and Explaining Challenging Medical Questions. In Proc. 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies. (eds. Chiruzzo, L.,Ritter, A. & Wang, L.) Vol. 1, Long Paper, 3563–3599 (Association for Computational Linguistics, Albuquerque, New Mexico, 2025).
Hendrycks, D. et al. Measuring massive multitask language understanding. Preprint at arXiv https://doi.org/10.48550/arXiv.2009.03300 (2020).
Zuo, Y. et al. MedXpertQA: benchmarking expert-level medical reasoning and understanding. Preprint at arXiv https://doi.org/10.48550/arXiv.2501.18362 (2025).
Center for AI Safety, Scale AI & HLE Contributors Consortium. A benchmark of expert-level academic questions toassess AI capabilities. Nature 649, 1139–1146 (2026).
Kwon, W. et al. Efficient Memory Management for Large Language Model Serving with PagedAttention. In Proc. 29th Symposium on Operating Systems Principles. 611–626 (Association for Computing Machinery, New York, NY, 2023).
Georgi Gerganov. llama.cpp: LLM inference in C/C++. https://github.com/ggerganov/llama.cpp (Github, 2023).
OpenAI. OpenAI API. https://openai.com/api (2023).
Liang, X. et al. OpenManus: an open-source framework for building general AI agents. Preprint at https://doi.org/10.5281/zenodo.15186407 (2025).
Zhu, Y. et al. MedAgentBoard: benchmarking multi-agent collaboration with conventional methods for diverse medical tasks. Preprint at arXiv https://doi.org/10.48550/arXiv.2505.12371 (2025).
Acknowledgements
J.N.K. is supported by the German Cancer Aid DKH (DECADE, 70115166), the German Federal Ministry of Research, Technology and Space BMFTR (PEARL, 01KD2104C; CAMINO, 01EO2101; TRANSFORM LIVER, 031L0312A; TANGERINE, 01KT2302 through ERA-NET Transcan; Come2Data, 16DKZ2044A; DEEP-HCC, 031L0315A; DECIPHER-M, 01KD2420A; NextBIG, 01ZU2402A), the German Research Foundation DFG (TRR 412/1, 535081457; SFB 1709/1 2025, 533056198), the German Academic Exchange Service DAAD (SECAI, 57616814), the German Federal Joint Committee G-BA (TransplantKI, 01VSF21048), the European Union EU’s Horizon Europe research and innovation programme (ODELIA, 101057091; GENIAL, 101096312), the European Research Council ERC (NADIR, 101114631), the Breast cancer Research Foundation (BELLADONNA, BCRF-25-225) and the National Institute for Health and Care Research NIHR (Leeds Biomedical Research Centre, NIHR203331). The views expressed are those of the author(s) and not necessarily those of the NHS, the NIHR or the Department of Health and Social Care. This work was funded by the European Union. Views and opinions expressed are, however, those of the author(s) only and do not necessarily reflect those of the European Union. Neither the European Union nor the granting authority can be held responsible for them.
Funding
Open Access funding enabled and organized by Projekt DEAL.
Author information
Authors and Affiliations
Contributions
Y.L.: Conceptualization, Data curation, Resources, Writing—original draft; Z.C.: Resources, Writing—original draft; X.J.: Writing—original draft; D.F.: Writing—original draft; G.W.: Methodology, Writing - original draft; L.Z.: Writing—original draft; S.J.: Writing—original draft; T.L.: Writing—original draft; Z.H.: Supervision, Project administration; J.N.K: Conceptualization, Project administration, Supervision; Writing—review & editing. All authors reviewed the manuscript and approved the submitted version.
Corresponding authors
Ethics declarations
Competing interests
J.N.K. declares ongoing consulting services for AstraZeneca and Bioptimus. Furthermore, he holds shares in StratifAI, Synagen, and Spira Labs, has received an institutional research grant from GSK and AstraZeneca, as well as honoraria from AstraZeneca, Bayer, Daiichi Sankyo, Eisai, Janssen, Merck, MSD, BMS, Roche, Pfizer, and Fresenius. Author J.N.K. is the Deputy Editor of the npj Precision Oncology. J.N.K. was not involved in the journal’s review of, or decisions related to, this manuscript.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Liu, Y., Carrero, Z.I., Jiang, X. et al. Benchmarking large language model-based agent systems for clinical decision tasks. npj Digit. Med. (2026). https://doi.org/10.1038/s41746-026-02443-6
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41746-026-02443-6


