Abstract
Artificial intelligence agents are emerging as powerful applications of large language models (LLMs), automating complex tasks and enabling scientific data exploration. However, their use in biomedical data analysis remains limited by the difficulty of handling specialized tools and multistep reasoning. Here we introduce BioMedAgent, a self-evolving LLM multi-agent framework, which learns to use diverse bioinformatics tools and chain them into executable workflows through interactive exploration and memory retrieval algorithms. It allows biomedical users to initiate tasks using natural language, without requiring computational expertise. Evaluated on our newly released BioMed-AQA benchmark comprising 327 biomedical data tasks, BioMedAgent achieved a 77% success rate, surpassing other LLM agents, and generalized robustly to the external BixBench dataset. Beyond benchmarks, it autonomously performs cross-omics analysis, machine-learning modelling and pathology image segmentation, highlighting its potential to advance biomedical research and extend to other scientific domains requiring complex tool integration and multistep reasoning.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$32.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 digital issues and online access to articles
$119.00 per year
only $9.92 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to the full article PDF.
USD 39.95
Prices may be subject to local taxes which are calculated during checkout






Similar content being viewed by others
Data availability
The BioMed-AQA benchmark (n = 327 open questions) introduced in this study is publicly available on Hugging Face at https://huggingface.co/datasets/BOBQWERA/biomed-aqa-dataset, which includes the full set of natural-language questions, reference steps and milestones for evaluation. The complementary BioMed-AQA-MCQ subset (n = 172 MCQs) is available at https://huggingface.co/datasets/BOBQWERA/biomed-mcq-dataset. The corresponding benchmark-related input data and milestone reference files are available via Zenodo at https://doi.org/10.5281/zenodo.17430550 (ref. 72). Evaluation testing of BioMedAgent for rounds = 0, 1, 2 under CMA and IMF on the BioMed-AQA benchmark, along with the interactive chat details that show the full process of task planning, execution and summarization for each question, are available at http://biomed.drai.cn.
Code availability
The open-source implementation of BioMedAgent, including its multi-agent framework and autoscoring evaluation agent, is available on GitHub at https://github.com/BOBQWERA/BioMedAgent.
References
Agrawal, R. & Prabakaran, S. Big data in digital healthcare: lessons learnt and recommendations for general practice. Heredity 124, 525–534 (2020).
Shilo, S., Rossman, H. & Segal, E. Axes of a revolution challenges and promises of big data in healthcare. Nat. Med. 26, 29–38 (2020).
Woldemariam, M. T. & Jimma, W. Adoption of electronic health record systems to enhance the quality of healthcare in low-income countries: a systematic review. BMJ Health Care Inform. 30, e100704 (2023).
Liang, H. et al. Evaluation and accurate diagnoses of pediatric diseases using artificial intelligence. Nat. Med. 25, 433–438 (2019).
Xu, H. et al. A whole-slide foundation model for digital pathology from real-world data. Nature 630, 181–188 (2024).
Feinberg, D. A. et al. Next-generation MRI scanner designed for ultra-high-resolution human brain imaging at 7 Tesla. Nat. Methods 20, 2048–2057 (2023).
Schuijf, J. D. et al. CT imaging with ultra-high-resolution: opportunities for cardiovascular imaging in clinical practice. J. Cardiovasc. Comput. Tomogr. 16, 388–396 (2022).
Goodwin, S., McPherson, J. D. & McCombie, W. R. Coming of age: ten years of next-generation sequencing technologies. Nat. Rev. Genet. 17, 333–351 (2016).
Metzker, M. L. Sequencing technologies—the next generation. Nat. Rev. Genet. 11, 31–46 (2010).
Karczewski, K. J. & Snyder, M. P. Integrative omics for health and disease. Nat. Rev. Genet. 19, 299–310 (2018).
Van de Sande, B. et al. Applications of single-cell RNA sequencing in drug discovery and development. Nat. Rev. Drug Discov. 22, 496–520 (2023).
Wratten, L., Wilm, A. & Göke, J. Reproducible, scalable, and shareable analysis pipelines with bioinformatics workflow managers. Nat. Methods 18, 1161–1168 (2021.
Cao, Y. et al. Ensemble deep learning in bioinformatics. Nat. Mach. Intell. 2, 500–508 (2020).
Dubay, C. et al. Delivering bioinformatics training: bridging the gaps between computer science and biomedicine. Proc. AMIA Symp. 2002, 220–224 (2002).
Elmarakeby, H. A. et al. Biologically informed deep neural network for prostate cancer discovery. Nature 598, 348–352 (2021).
Li, J. et al. Proteomics and bioinformatics approaches for identification of serum biomarkers to detect breast cancer. Clin. Chem. 48, 1296–1304 (2002).
Sadybekov, A. V. & Katritch, V. Computational approaches streamlining drug discovery. Nature 616, 673–685 (2023).
Fitzgerald, R. C. et al. The future of early cancer detection. Nat. Med. 28, 666–677 (2022).
McDonald, T. O. et al. Computational approaches to modelling and optimizing cancer treatment. Nat. Rev. Bioeng. 1, 695–711 (2023).
Misra, B. B. et al. Integrated omics: tools, advances, and future approaches. J. Mol. Endocrinol. 62, R21–R45 (2019).
Brooks, T. G. et al. Challenges and best practices in omics benchmarking. Nat. Rev. Genet. 25, 326–339 (2024).
Di Tommaso, P. et al. Nextflow enables reproducible computational workflows. Nat. Biotechnol. 35, 316–319 (2017).
Goecks, J. et al. Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol. 11, 1–13 (2010).
Malhotra, R. et al. Using the seven bridges Cancer Genomics Cloud to access and analyze petabytes of cancer data. Curr. Protoc. Bioinformatics 60, 11.16.1–11.16.32 (2017).
Kaur, S. & Kaur, S. Genomics with cloud computing. Int. J. Sci. Technol. Res. 4, 146–148 (2015).
Wu, T. et al. A brief overview of ChatGPT: the history, status quo and potential future development. IEEE/CAA J. Autom. Sin. 10, 1122–1136 (2023).
Brown, T. B. et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 33, 1877–1901 (2020).
Theodoris, C. V. et al. Transfer learning enables predictions in network biology. Nature 618, 616–624 (2023).
Hao, M. et al. Large-scale foundation model on single-cell transcriptomics. Nat. Methods 21, 1481–1491 (2024).
Singhal, K. et al. Large language models encode clinical knowledge. Nature 620, 172–180 (2023).
Tang, X. et al. MedAgents: large language models as collaborators for zero-shot medical reasoning. Find. Assoc. Comput. Linguist: ACL 2024, 599–621 (2024).
Guo, T. et al. Large language model based multi-agents: a survey of progress and challenges. Proc. Int. Joint Conf. Artif. Intell. 33, 8048–8057 (2024).
Wang, H. et al. Scientific discovery in the age of artificial intelligence. Nature 620, 47–60 (2023).
Gao, S. et al. Empowering biomedical discovery with AI agents. Cell 187, 6125–6151 (2024).
Boiko, D. A. et al. Autonomous chemical research with large language models. Nature 624, 570–578 (2023).
Dai, T. et al. Autonomous mobile robots for exploratory synthetic chemistry. Nature 635, 890–897 (2024).
Hou, W. & Ji, Z. Assessing GPT-4 for cell type annotation in single-cell RNA-seq analysis. Nat. Methods 21, 1462–1465 (2024).
Lobentanzer, S. et al. A platform for the biomedical application of large language models. Nat. Biotechnol. 43, 166–169 (2025).
Tayebi Arasteh, S. et al. Large language models streamline automated machine learning for clinical studies. Nat. Commun. 15, 1603 (2024).
Zhou, J. et al. An AI agent for fully automated multi-omic analyses. Adv. Sci. 11, e2407094 (2024).
Xiao, Y. et al. CellAgent: an LLM-driven multi-agent framework for automated single-cell data analysis. Preprint at bioRxiv https://doi.org/10.1101/2024.05.13.593861 (2024).
Liu, H. & Wang, H. GenoTEX: a benchmark for evaluating LLM-based exploration of gene expression data in alignment with bioinformaticians. Preprint at arXiv https://doi.org/10.48550/arXiv.2406.15341 (2024).
Mitchener, L. et al. Bixbench: a comprehensive benchmark for LLM-based agents in computational biology. Preprint at arXiv https://doi.org/10.48550/arXiv.2503.00096 (2025).
Gómez-López, G. et al. Precision medicine needs pioneering clinical bioinformaticians. Brief. Bioinform. 20, 752–766 (2019).
Hou, X. et al. Large language models for software engineering: A systematic literature review. ACM Trans. Softw. Eng. Methodol. (2023).
Xin, Q. et al. BioInformatics Agent (BIA): unleashing the power of large language models to reshape bioinformatics workflow. Preprint at bioRxiv https://doi.org/10.1101/2024.05.22.595240 (2024).
Su, H., Long, W. & Zhang, Y. BioMaster: multi-agent system for automated bioinformatics analysis workflow. Preprint at bioRxiv https://doi.org/10.1101/2025.01.23.634608 (2025).
Afzal, M. et al. Precision medicine informatics: principles, prospects, and challenges. IEEE Access 8, 13593–13612 (2020).
Zhang, W. & Mei, H. A constructive model for collective intelligence. Natl Sci. Rev. 7, 1273–1277 (2020).
Qian, C. et al. Iterative experience refinement of software-developing agents. Preprint at arXiv https://doi.org/10.48550/arXiv.2405.04219 (2024).
Riffle, D. et al. OLAF: an open life science analysis framework for conversational bioinformatics powered by large language models. Preprint at arXiv https://doi.org/10.48550/arXiv.2504.03976 (2025).
Xie, E. et al. CASSIA: a multi-agent large language model for automated and interpretable cell annotation. Nat. Commun. 17, 389 (2025).
Tsyganov, M. M. et al. Influence of DNA copy number aberrations in ABC transporter family genes on the survival of patients with primary operatable non-small cell lung cancer. Curr. Cancer Drug Targets (2025).
Sun, Y. et al. SERINC2-mediated serine metabolism promotes cervical cancer progression and drives T cell exhaustion. Int. J. Biol. Sci. 21, 1361–1377 (2025).
Wang, X., Jiang, C. & Li, Q. Serinc2 drives the progression of cervical cancer through regulating Myc pathway. Cancer Med. 13, e70296 (2024).
Lee, J. S. et al. SEZ6L2 is an important regulator of drug-resistant cells and tumor spheroid cells in lung adenocarcinoma. Biomedicines 8 (2020).
Ishikawa, N. et al. Characterization of SEZ6L2 cell-surface protein as a novel prognostic marker for lung cancer. Cancer Sci. 97, 737–745 (2006).
Jee, J. et al. DNA liquid biopsy-based prediction of cancer-associated venous thromboembolism. Nat. Med. 30, 2499–2507 (2024).
Xu, Z. et al. MiHATP: a multi-hybrid attention super-resolution network for pathological image based on transformation pool contrastive learning. In International Conference on Medical Image Computing and Computer-Assisted Intervention (Springer, 2024).
Yang, K. et al. If LLM is the wizard, then code is the wand: a survey on how code empowers large language models to serve as intelligent agents. Preprint at arXiv https://doi.org/10.48550/arXiv.2401.00812 (2024).
Qian, C. et al. Investigate-consolidate-exploit: A general strategy for inter-task agent self-evolution. Preprint at arXiv https://doi.org/10.48550/arXiv.2401.13996 (2024).
Gao, C. et al. Large language models empowered agent-based modeling and simulation: a survey and perspectives. Humanit. Soc. Sci. Commun. 11, 1–24 (2024).
Zhong, W. et al. Memorybank: enhancing large language models with long-term memory. Proc. AAAI Conf. Artif. Intell. 38, 19724–19731 (2024).
Liu, L. et al. Think-in-memory: Recalling and post-thinking enable llms with long-term memory. Preprint at arXiv https://doi.org/10.48550/arXiv.2311.08719 (2023).
Li, Z. et al. VarBen: generating in silico reference data sets for clinical next-generation sequencing bioinformatics pipeline evaluation. J. Mol. Diagn. 23, 285–299 (2021).
Pendleton, M. et al. Assembly and diploid architecture of an individual human genome via single-molecule technologies. Nat. Methods 12, 780–786 (2015).
Hao, Y. et al. Dictionary learning for integrative, multimodal and scalable single-cell analysis. Nat. Biotechnol. 42, 293–304 (2024).
Yang, Y. et al. Comprehensive landscape of resistance mechanisms for neoadjuvant therapy in esophageal squamous cell carcinoma by single-cell transcriptomics. Signal Transduct. Target. Ther. 8, 298 (2023).
Wolf, F. A., Angerer, P. & Theis, F. J. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. 19, 1–5 (2018).
Bu, D. et al. KOBAS-i: intelligent prioritization and exploratory visualization of biological functions for gene enrichment analysis. Nucleic Acids Res. 49, W317–W325 (2021).
Jiang, X. et al. Long term memory: the foundation of AI self-evolution. Preprint at arXiv https://doi.org/10.48550/arXiv.2410.15665 (2024).
Sun, J. Biomedical dataset files collection. Zenodo https://doi.org/10.5281/zenodo.17430550 (2025).
Acknowledgements
This work was supported by the National Key R&D Program of China (2022YFF1203303, D.B.), National Natural Science Foundation of China (32341019, Y.Z.; 92474204, Y.Z.; 32570778, D.B.; W2431057, K.Z.), Ningbo Top Medical and Health Research Program (2023030615, Y.Z.; 2024020919, Y.W.), Beijing Natural Science Foundation (L222007, D.B.), Ningbo Science and Technology Innovation Yongjiang 2035 Project (2023Z226 and 2024Z229, Y.Z.), Major Project of Guangzhou National Laboratory (GZNL2023A03001, Y.Z.), State Key Laboratory of Systems Medicine for Cancer (KF2422-93, Y.Z.), Hubei Province key research and development project (2022BCA016 and 2023BCB146, Y.J.) and Macau Science and Technology Development Fund (0007/2020/AFJ, 0070/2020/A2 and 0003/2021/AKP, K.Z.). The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript. The authors would like to acknowledge the Nanjing Institute of InforSuperBahn OneAiNexus for providing the training and evaluation platform.
Author information
Authors and Affiliations
Contributions
D.B. collected and interpreted the data and drafted the manuscript and figures. J.S. implemented and evaluated the algorithm and developed the software. K.L. analysed the data, drafted the figures and contributed to algorithm application. Z.H., W.H., J.H., S.Z. and S.L. participated in data processing and algorithm evaluation. P.H., Z.W. and S.W. contributed to tool preparation and website construction. T.W., K.G. and Y.W. assisted with literature collection and interpreted the results. L.Z., K.W., G.L., H.S. and Y.J. interpreted the results and supported application design. K.Z. and R.C. interpreted the results, revised the manuscript and figures and supervised the project. Y.Z. conceived the study, collected the data, revised the manuscript and figures and supervised the project. All authors reviewed and approved the final version of the manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Biomedical Engineering thanks Feixiong Cheng and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Evaluation of BioMedAgent using BioMed-AQA.
a. The calculation of the Win score and determining task execution status of success or failed. b. ROC curve demonstrating the performance of the autoscoring agent, showing high accuracy and reliability with an AUC of 0.926, indicating strong alignment with manual evaluations. c. Confusion matrix comparing autoscoring results with manual evaluations. d. Summary of BioMed-AQA and BioMed-AQA-MCQ. BioMed-AQA (n = 327) consists of open questions derived from three sources: simulated datasets (37.31%), literature-derived datasets (46.79%), and tool tutorial datasets (15.90%). BioMed-AQA-MCQ (n = 172) is a multiple-choice subset of BioMed-AQA, consisting of single-choice questions (73.26%) and multi-choice questions (26.74%), designed to enable automated and objective evaluation. All tasks from O, P, S in BioMed-AQA were designed with one corresponding multiple-choice question. M and V tasks were not included, as they are less suitable for the multiple-choice format.
Extended Data Fig. 2 Overall performance of BioMedAgent.
a. Win score comparison between with and without LTU across O, P, M, S, V tasks, with p-values obtained via two-tailed t-test (p = 1.477e-03 for M, NA: Not Applicable, ns: p > 0.05, *: p < 0.05, **: p < 0.01, ***: p < 0.001). b. Success rate comparison between with and without LTU across Total, O, P, M, S, V tasks, with p-values obtained via two-sided chi-squared tests (p = 7.789e-05 for Total, p = 0.015 for M). c. Proportion of successful tasks on BioMed-AQA using only LTU, only CTC, and both. d. Win score comparison between open-step and clear-step questions across O, P, M, S, V tasks, with p-values obtained via two-tailed t-test. e. Success rate comparison between open-step and clear-step questions across Total, O, P, M, S, V tasks, with p-values obtained via two-sided chi-squared tests.
Extended Data Fig. 3 Evaluation of the IE algorithm.
a. Workflow of IE in planning and coding phases by Planner, Programmer and Executor agents. b. Win score of different LLM-based agents across Total tasks of BioMed-AQA (n = 327). Error bars represent the mean with 95% confidence intervals (CI). GPT FC stands for GPT Function Call. c. Win score comparison between noIE and IE across Total, O, P, M, S, V tasks, with p-values obtained via two-tailed paired t-test (Total: p = 1.091e-22, O: p = 4.236e-09, P: p = 0.019, M: p = 0.199, S: p = 5.118e-07, V: p = 5.530e-09, ns: p > 0.05, *: p < 0.05, **: p < 0.01, ***: p < 0.001). Error bars represent the mean with 95% CI. d. Analyzable scope and success rate of different LLM-based agents across Total, O, P, M, S, V tasks. The length of the bar represents the success rate or analyzable scope.
Extended Data Fig. 4 Evaluation of the MR algorithm.
a. Mechanisms of the CMA and IMF memory update strategies. M1 represents the new added memory in the learning of round 1, while M2 corresponds to round 2. b. Win scores under CMA and IMF memory update strategies across Total tasks (n = 327) in three rounds of learning. Error bars represent the mean with 95% CI. c. Semantic similarity between n = 327 BioMed-AQA questions. d. Evaluation of MR algorithm performance on unseen questions. This table shows the success rates for two models: 1). Before MR (round = 0) with no learning, and 2). MR on seen (round = 3) with three rounds of learning on the seen subset. e. Win scores under CMA and IMF strategies across Total, O, P, M, S, V tasks. Error bars represent the mean with 95% CI.
Extended Data Fig. 5 Execution details of Q1-Q84 in BioMed-AQA.
The detailed testing of each question in BioMed-AQA (Q1-Q84) by BioMedAgent, utilizing the IMF memory update strategy after three rounds of learning, includes the planned steps, Win scores, and execution outcomes. The planned steps are automatically generated by BioMedAgent, with +1 indicating one more step not shown.
Extended Data Fig. 6 Execution details of Q85-Q167 in BioMed-AQA.
The detailed testing of each question in BioMed-AQA (Q85-Q167) by BioMedAgent, utilizing the IMF memory update strategy after three rounds of learning, includes the planned steps, Win scores, and execution outcomes. The planned steps are automatically generated by BioMedAgent, with +2/+6 indicates 2 or 6 more steps not shown.
Extended Data Fig. 7 Execution details of Q168-Q253 in BioMed-AQA.
The detailed testing of each question in BioMed-AQA (Q168-Q253) by BioMedAgent, utilizing the IMF memory update strategy after three rounds of learning, includes the planned steps, Win scores, and execution outcomes. The planned steps are automatically generated by BioMedAgent.
Extended Data Fig. 8 Execution details of Q254-Q327 in BioMed-AQA.
The detailed testing of each question in BioMed-AQA (Q254-Q327) by BioMedAgent, utilizing the IMF memory update strategy after three rounds of learning, includes the planned steps, Win scores, and execution outcomes. The planned steps are automatically generated by BioMedAgent.
Extended Data Fig. 9 Application examples of BioMedAgent.
a. Comparison of DEGs identified by BioMedAgent and official online tool GEO2R. b. Construction of an automated workflow using BioMedAgent to perform cell segmentation with resolution enhancement.
Extended Data Fig. 10 Intuitive chat interface of BioMedAgent.
This figure illustrates the interactive chat interface of BioMedAgent, where multiple agents collaborate on task planning, execution, and results aggregation. Users interact with the system in natural language to initiate and monitor the execution of workflows, including model selection and evaluation in the demo example. The system autonomously plans, executes, and summarizes the results.
Supplementary information
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Bu, D., Sun, J., Li, K. et al. Empowering AI data scientists using a multi-agent LLM framework with self-evolving capabilities for autonomous, tool-aware biomedical data analyses. Nat. Biomed. Eng (2026). https://doi.org/10.1038/s41551-026-01634-6
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41551-026-01634-6


