Empowering AI data scientists using a multi-agent LLM framework with self-evolving capabilities for autonomous, tool-aware biomedical data analyses

Bu, Dechao; Sun, Jingbo; Li, Kun; He, Zihao; Huang, Wei; Hu, Jinlin; Zhang, Shanshan; Lei, Shuangshuang; Huo, Peipei; Wang, Zhihao; Wang, Sheng; Wang, Tao; Gao, Kai; Wu, Yang; Zhao, Lianhe; Wang, Kai; Li, Gen; Song, Huan; Jin, Yang; Zhang, Kang; Chen, Runsheng; Zhao, Yi

doi:10.1038/s41551-026-01634-6

Article
Published: 30 March 2026

Empowering AI data scientists using a multi-agent LLM framework with self-evolving capabilities for autonomous, tool-aware biomedical data analyses

Nature Biomedical Engineering (2026)Cite this article

1366 Accesses
34 Altmetric
Metrics details

Subjects

Abstract

Artificial intelligence agents are emerging as powerful applications of large language models (LLMs), automating complex tasks and enabling scientific data exploration. However, their use in biomedical data analysis remains limited by the difficulty of handling specialized tools and multistep reasoning. Here we introduce BioMedAgent, a self-evolving LLM multi-agent framework, which learns to use diverse bioinformatics tools and chain them into executable workflows through interactive exploration and memory retrieval algorithms. It allows biomedical users to initiate tasks using natural language, without requiring computational expertise. Evaluated on our newly released BioMed-AQA benchmark comprising 327 biomedical data tasks, BioMedAgent achieved a 77% success rate, surpassing other LLM agents, and generalized robustly to the external BixBench dataset. Beyond benchmarks, it autonomously performs cross-omics analysis, machine-learning modelling and pathology image segmentation, highlighting its potential to advance biomedical research and extend to other scientific domains requiring complex tool integration and multistep reasoning.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on SpringerLink
Instant access to the full article PDF.

USD 39.95

Prices may be subject to local taxes which are calculated during checkout

**Fig. 3: Evaluation of the IE algorithm.**

**Fig. 4: Self-evolving through the MR algorithm.**

**Fig. 5: Evaluation on external benchmark BixBench and agent comparison.**

**Fig. 6: Applying BioMedAgent to accelerate diverse biomedical data research.**

Making large language models reliable data science programming copilots for biomedical research

Article 22 January 2026

Large language models in biomedicine and healthcare

Article Open access 01 December 2025

Biomedical Data Manifest: A lightweight data documentation mapping to increase transparency for AI/ML

Article Open access 11 February 2026

Data availability

The BioMed-AQA benchmark (n = 327 open questions) introduced in this study is publicly available on Hugging Face at https://huggingface.co/datasets/BOBQWERA/biomed-aqa-dataset, which includes the full set of natural-language questions, reference steps and milestones for evaluation. The complementary BioMed-AQA-MCQ subset (n = 172 MCQs) is available at https://huggingface.co/datasets/BOBQWERA/biomed-mcq-dataset. The corresponding benchmark-related input data and milestone reference files are available via Zenodo at https://doi.org/10.5281/zenodo.17430550 (ref. ⁷²). Evaluation testing of BioMedAgent for rounds = 0, 1, 2 under CMA and IMF on the BioMed-AQA benchmark, along with the interactive chat details that show the full process of task planning, execution and summarization for each question, are available at http://biomed.drai.cn.

Code availability

The open-source implementation of BioMedAgent, including its multi-agent framework and autoscoring evaluation agent, is available on GitHub at https://github.com/BOBQWERA/BioMedAgent.

References

Agrawal, R. & Prabakaran, S. Big data in digital healthcare: lessons learnt and recommendations for general practice. Heredity 124, 525–534 (2020).
Article PubMed PubMed Central Google Scholar
Shilo, S., Rossman, H. & Segal, E. Axes of a revolution challenges and promises of big data in healthcare. Nat. Med. 26, 29–38 (2020).
Article CAS PubMed Google Scholar
Woldemariam, M. T. & Jimma, W. Adoption of electronic health record systems to enhance the quality of healthcare in low-income countries: a systematic review. BMJ Health Care Inform. 30, e100704 (2023).
Article PubMed PubMed Central Google Scholar
Liang, H. et al. Evaluation and accurate diagnoses of pediatric diseases using artificial intelligence. Nat. Med. 25, 433–438 (2019).
Article CAS PubMed Google Scholar
Xu, H. et al. A whole-slide foundation model for digital pathology from real-world data. Nature 630, 181–188 (2024).
Article CAS PubMed PubMed Central Google Scholar
Feinberg, D. A. et al. Next-generation MRI scanner designed for ultra-high-resolution human brain imaging at 7 Tesla. Nat. Methods 20, 2048–2057 (2023).
Article CAS PubMed PubMed Central Google Scholar
Schuijf, J. D. et al. CT imaging with ultra-high-resolution: opportunities for cardiovascular imaging in clinical practice. J. Cardiovasc. Comput. Tomogr. 16, 388–396 (2022).
Article PubMed Google Scholar
Goodwin, S., McPherson, J. D. & McCombie, W. R. Coming of age: ten years of next-generation sequencing technologies. Nat. Rev. Genet. 17, 333–351 (2016).
Article CAS PubMed PubMed Central Google Scholar
Metzker, M. L. Sequencing technologies—the next generation. Nat. Rev. Genet. 11, 31–46 (2010).
Article CAS PubMed Google Scholar
Karczewski, K. J. & Snyder, M. P. Integrative omics for health and disease. Nat. Rev. Genet. 19, 299–310 (2018).
Article CAS PubMed PubMed Central Google Scholar
Van de Sande, B. et al. Applications of single-cell RNA sequencing in drug discovery and development. Nat. Rev. Drug Discov. 22, 496–520 (2023).
Article PubMed PubMed Central Google Scholar
Wratten, L., Wilm, A. & Göke, J. Reproducible, scalable, and shareable analysis pipelines with bioinformatics workflow managers. Nat. Methods 18, 1161–1168 (2021.
Article CAS PubMed Google Scholar
Cao, Y. et al. Ensemble deep learning in bioinformatics. Nat. Mach. Intell. 2, 500–508 (2020).
Article Google Scholar
Dubay, C. et al. Delivering bioinformatics training: bridging the gaps between computer science and biomedicine. Proc. AMIA Symp. 2002, 220–224 (2002).
Google Scholar
Elmarakeby, H. A. et al. Biologically informed deep neural network for prostate cancer discovery. Nature 598, 348–352 (2021).
Article CAS PubMed PubMed Central Google Scholar
Li, J. et al. Proteomics and bioinformatics approaches for identification of serum biomarkers to detect breast cancer. Clin. Chem. 48, 1296–1304 (2002).
Article CAS PubMed Google Scholar
Sadybekov, A. V. & Katritch, V. Computational approaches streamlining drug discovery. Nature 616, 673–685 (2023).
Article CAS PubMed Google Scholar
Fitzgerald, R. C. et al. The future of early cancer detection. Nat. Med. 28, 666–677 (2022).
Article CAS PubMed Google Scholar
McDonald, T. O. et al. Computational approaches to modelling and optimizing cancer treatment. Nat. Rev. Bioeng. 1, 695–711 (2023).
Article CAS Google Scholar
Misra, B. B. et al. Integrated omics: tools, advances, and future approaches. J. Mol. Endocrinol. 62, R21–R45 (2019).
Article CAS PubMed Google Scholar
Brooks, T. G. et al. Challenges and best practices in omics benchmarking. Nat. Rev. Genet. 25, 326–339 (2024).
Article CAS PubMed Google Scholar
Di Tommaso, P. et al. Nextflow enables reproducible computational workflows. Nat. Biotechnol. 35, 316–319 (2017).
Article PubMed Google Scholar
Goecks, J. et al. Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol. 11, 1–13 (2010).
Article Google Scholar
Malhotra, R. et al. Using the seven bridges Cancer Genomics Cloud to access and analyze petabytes of cancer data. Curr. Protoc. Bioinformatics 60, 11.16.1–11.16.32 (2017).
PubMed PubMed Central Google Scholar
Kaur, S. & Kaur, S. Genomics with cloud computing. Int. J. Sci. Technol. Res. 4, 146–148 (2015).
Google Scholar
Wu, T. et al. A brief overview of ChatGPT: the history, status quo and potential future development. IEEE/CAA J. Autom. Sin. 10, 1122–1136 (2023).
Article Google Scholar
Brown, T. B. et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 33, 1877–1901 (2020).
Google Scholar
Theodoris, C. V. et al. Transfer learning enables predictions in network biology. Nature 618, 616–624 (2023).
Article CAS PubMed PubMed Central Google Scholar
Hao, M. et al. Large-scale foundation model on single-cell transcriptomics. Nat. Methods 21, 1481–1491 (2024).
Article CAS PubMed Google Scholar
Singhal, K. et al. Large language models encode clinical knowledge. Nature 620, 172–180 (2023).
Article CAS PubMed PubMed Central Google Scholar
Tang, X. et al. MedAgents: large language models as collaborators for zero-shot medical reasoning. Find. Assoc. Comput. Linguist: ACL 2024, 599–621 (2024).
Google Scholar
Guo, T. et al. Large language model based multi-agents: a survey of progress and challenges. Proc. Int. Joint Conf. Artif. Intell. 33, 8048–8057 (2024).
Google Scholar
Wang, H. et al. Scientific discovery in the age of artificial intelligence. Nature 620, 47–60 (2023).
Article CAS PubMed Google Scholar
Gao, S. et al. Empowering biomedical discovery with AI agents. Cell 187, 6125–6151 (2024).
Article CAS PubMed Google Scholar
Boiko, D. A. et al. Autonomous chemical research with large language models. Nature 624, 570–578 (2023).
Article CAS PubMed PubMed Central Google Scholar
Dai, T. et al. Autonomous mobile robots for exploratory synthetic chemistry. Nature 635, 890–897 (2024).
Article PubMed PubMed Central Google Scholar
Hou, W. & Ji, Z. Assessing GPT-4 for cell type annotation in single-cell RNA-seq analysis. Nat. Methods 21, 1462–1465 (2024).
Article CAS PubMed PubMed Central Google Scholar
Lobentanzer, S. et al. A platform for the biomedical application of large language models. Nat. Biotechnol. 43, 166–169 (2025).
Article CAS PubMed PubMed Central Google Scholar
Tayebi Arasteh, S. et al. Large language models streamline automated machine learning for clinical studies. Nat. Commun. 15, 1603 (2024).
Article CAS PubMed PubMed Central Google Scholar
Zhou, J. et al. An AI agent for fully automated multi-omic analyses. Adv. Sci. 11, e2407094 (2024).
Article Google Scholar
Xiao, Y. et al. CellAgent: an LLM-driven multi-agent framework for automated single-cell data analysis. Preprint at bioRxiv https://doi.org/10.1101/2024.05.13.593861 (2024).
Liu, H. & Wang, H. GenoTEX: a benchmark for evaluating LLM-based exploration of gene expression data in alignment with bioinformaticians. Preprint at arXiv https://doi.org/10.48550/arXiv.2406.15341 (2024).
Mitchener, L. et al. Bixbench: a comprehensive benchmark for LLM-based agents in computational biology. Preprint at arXiv https://doi.org/10.48550/arXiv.2503.00096 (2025).
Gómez-López, G. et al. Precision medicine needs pioneering clinical bioinformaticians. Brief. Bioinform. 20, 752–766 (2019).
Article PubMed Google Scholar
Hou, X. et al. Large language models for software engineering: A systematic literature review. ACM Trans. Softw. Eng. Methodol. (2023).
Xin, Q. et al. BioInformatics Agent (BIA): unleashing the power of large language models to reshape bioinformatics workflow. Preprint at bioRxiv https://doi.org/10.1101/2024.05.22.595240 (2024).
Su, H., Long, W. & Zhang, Y. BioMaster: multi-agent system for automated bioinformatics analysis workflow. Preprint at bioRxiv https://doi.org/10.1101/2025.01.23.634608 (2025).
Afzal, M. et al. Precision medicine informatics: principles, prospects, and challenges. IEEE Access 8, 13593–13612 (2020).
Article Google Scholar
Zhang, W. & Mei, H. A constructive model for collective intelligence. Natl Sci. Rev. 7, 1273–1277 (2020).
Article PubMed PubMed Central Google Scholar
Qian, C. et al. Iterative experience refinement of software-developing agents. Preprint at arXiv https://doi.org/10.48550/arXiv.2405.04219 (2024).
Riffle, D. et al. OLAF: an open life science analysis framework for conversational bioinformatics powered by large language models. Preprint at arXiv https://doi.org/10.48550/arXiv.2504.03976 (2025).
Xie, E. et al. CASSIA: a multi-agent large language model for automated and interpretable cell annotation. Nat. Commun. 17, 389 (2025).
Article PubMed PubMed Central Google Scholar
Tsyganov, M. M. et al. Influence of DNA copy number aberrations in ABC transporter family genes on the survival of patients with primary operatable non-small cell lung cancer. Curr. Cancer Drug Targets (2025).
Sun, Y. et al. SERINC2-mediated serine metabolism promotes cervical cancer progression and drives T cell exhaustion. Int. J. Biol. Sci. 21, 1361–1377 (2025).
Article CAS PubMed PubMed Central Google Scholar
Wang, X., Jiang, C. & Li, Q. Serinc2 drives the progression of cervical cancer through regulating Myc pathway. Cancer Med. 13, e70296 (2024).
Article CAS PubMed PubMed Central Google Scholar
Lee, J. S. et al. SEZ6L2 is an important regulator of drug-resistant cells and tumor spheroid cells in lung adenocarcinoma. Biomedicines 8 (2020).
Ishikawa, N. et al. Characterization of SEZ6L2 cell-surface protein as a novel prognostic marker for lung cancer. Cancer Sci. 97, 737–745 (2006).
Article CAS PubMed PubMed Central Google Scholar
Jee, J. et al. DNA liquid biopsy-based prediction of cancer-associated venous thromboembolism. Nat. Med. 30, 2499–2507 (2024).
Article CAS PubMed PubMed Central Google Scholar
Xu, Z. et al. MiHATP: a multi-hybrid attention super-resolution network for pathological image based on transformation pool contrastive learning. In International Conference on Medical Image Computing and Computer-Assisted Intervention (Springer, 2024).
Yang, K. et al. If LLM is the wizard, then code is the wand: a survey on how code empowers large language models to serve as intelligent agents. Preprint at arXiv https://doi.org/10.48550/arXiv.2401.00812 (2024).
Qian, C. et al. Investigate-consolidate-exploit: A general strategy for inter-task agent self-evolution. Preprint at arXiv https://doi.org/10.48550/arXiv.2401.13996 (2024).
Gao, C. et al. Large language models empowered agent-based modeling and simulation: a survey and perspectives. Humanit. Soc. Sci. Commun. 11, 1–24 (2024).
Article CAS Google Scholar
Zhong, W. et al. Memorybank: enhancing large language models with long-term memory. Proc. AAAI Conf. Artif. Intell. 38, 19724–19731 (2024).
Google Scholar
Liu, L. et al. Think-in-memory: Recalling and post-thinking enable llms with long-term memory. Preprint at arXiv https://doi.org/10.48550/arXiv.2311.08719 (2023).
Li, Z. et al. VarBen: generating in silico reference data sets for clinical next-generation sequencing bioinformatics pipeline evaluation. J. Mol. Diagn. 23, 285–299 (2021).
Article CAS PubMed Google Scholar
Pendleton, M. et al. Assembly and diploid architecture of an individual human genome via single-molecule technologies. Nat. Methods 12, 780–786 (2015).
Article CAS PubMed PubMed Central Google Scholar
Hao, Y. et al. Dictionary learning for integrative, multimodal and scalable single-cell analysis. Nat. Biotechnol. 42, 293–304 (2024).
Article CAS PubMed Google Scholar
Yang, Y. et al. Comprehensive landscape of resistance mechanisms for neoadjuvant therapy in esophageal squamous cell carcinoma by single-cell transcriptomics. Signal Transduct. Target. Ther. 8, 298 (2023).
Article PubMed PubMed Central Google Scholar
Wolf, F. A., Angerer, P. & Theis, F. J. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. 19, 1–5 (2018).
Article Google Scholar
Bu, D. et al. KOBAS-i: intelligent prioritization and exploratory visualization of biological functions for gene enrichment analysis. Nucleic Acids Res. 49, W317–W325 (2021).
Article CAS PubMed PubMed Central Google Scholar
Jiang, X. et al. Long term memory: the foundation of AI self-evolution. Preprint at arXiv https://doi.org/10.48550/arXiv.2410.15665 (2024).
Sun, J. Biomedical dataset files collection. Zenodo https://doi.org/10.5281/zenodo.17430550 (2025).

Download references

Acknowledgements

This work was supported by the National Key R&D Program of China (2022YFF1203303, D.B.), National Natural Science Foundation of China (32341019, Y.Z.; 92474204, Y.Z.; 32570778, D.B.; W2431057, K.Z.), Ningbo Top Medical and Health Research Program (2023030615, Y.Z.; 2024020919, Y.W.), Beijing Natural Science Foundation (L222007, D.B.), Ningbo Science and Technology Innovation Yongjiang 2035 Project (2023Z226 and 2024Z229, Y.Z.), Major Project of Guangzhou National Laboratory (GZNL2023A03001, Y.Z.), State Key Laboratory of Systems Medicine for Cancer (KF2422-93, Y.Z.), Hubei Province key research and development project (2022BCA016 and 2023BCB146, Y.J.) and Macau Science and Technology Development Fund (0007/2020/AFJ, 0070/2020/A2 and 0003/2021/AKP, K.Z.). The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript. The authors would like to acknowledge the Nanjing Institute of InforSuperBahn OneAiNexus for providing the training and evaluation platform.

Author information

These authors contributed equally: Dechao Bu, Jingbo Sun, Kun Li.
These authors jointly supervised this work: Kang Zhang, Runsheng Chen, Yi Zhao.

Authors and Affiliations

Research Center for Ubiquitous Computing Systems, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
Dechao Bu, Jingbo Sun, Wei Huang, Jinlin Hu, Sheng Wang, Tao Wang, Yang Wu, Lianhe Zhao & Yi Zhao
University of Chinese Academy of Sciences, Beijing, China
Jingbo Sun & Jinlin Hu
AI Cross Disciplinary Research Institute and Faculty of Medicine, Macau University of Science and Technology, Macau, China
Kun Li, Gen Li & Kang Zhang
Guangzhou National Laboratory, Guangzhou, China
Kun Li, Gen Li & Kang Zhang
Ningbo No.2 Hospital, Ningbo, China
Zihao He & Kai Gao
Henan Institute of Advanced Technology, Zhengzhou University, Zhengzhou, China
Wei Huang & Tao Wang
Luoyang Institute of Information Technology Industries, Luoyang, China
Shanshan Zhang, Peipei Huo & Zhihao Wang
Beijing University of Chinese Medicine, Beijing, China
Shuangshuang Lei
Department of Big Data and Biomedical Al, College of Future Technology, Peking University and Peking-Tsinghua Center for Life Sciences, Beijing, China
Kai Wang
State Key Laboratory of Eye Health and National Clinical Research Center for Ocular Diseases, Eye Hospital, Wenzhou Medical University, Wenzhou, China
Gen Li & Kang Zhang
West China Biomedical Big Data Center, West China Hospital, Sichuan University, Chengdu, China
Huan Song
Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China
Yang Jin
Key Laboratory of RNA Biology, Center for Big Data Research in Health, Institute of Biophysics, Chinese Academy of Sciences, Beijing, China
Runsheng Chen

Authors

Dechao Bu
View author publications
Search author on:PubMed Google Scholar
Jingbo Sun
View author publications
Search author on:PubMed Google Scholar
Kun Li
View author publications
Search author on:PubMed Google Scholar
Zihao He
View author publications
Search author on:PubMed Google Scholar
Wei Huang
View author publications
Search author on:PubMed Google Scholar
Jinlin Hu
View author publications
Search author on:PubMed Google Scholar
Shanshan Zhang
View author publications
Search author on:PubMed Google Scholar
Shuangshuang Lei
View author publications
Search author on:PubMed Google Scholar
Peipei Huo
View author publications
Search author on:PubMed Google Scholar
Zhihao Wang
View author publications
Search author on:PubMed Google Scholar
Sheng Wang
View author publications
Search author on:PubMed Google Scholar
Tao Wang
View author publications
Search author on:PubMed Google Scholar
Kai Gao
View author publications
Search author on:PubMed Google Scholar
Yang Wu
View author publications
Search author on:PubMed Google Scholar
Lianhe Zhao
View author publications
Search author on:PubMed Google Scholar
Kai Wang
View author publications
Search author on:PubMed Google Scholar
Gen Li
View author publications
Search author on:PubMed Google Scholar
Huan Song
View author publications
Search author on:PubMed Google Scholar
Yang Jin
View author publications
Search author on:PubMed Google Scholar
Kang Zhang
View author publications
Search author on:PubMed Google Scholar
Runsheng Chen
View author publications
Search author on:PubMed Google Scholar
Yi Zhao
View author publications
Search author on:PubMed Google Scholar

Contributions

D.B. collected and interpreted the data and drafted the manuscript and figures. J.S. implemented and evaluated the algorithm and developed the software. K.L. analysed the data, drafted the figures and contributed to algorithm application. Z.H., W.H., J.H., S.Z. and S.L. participated in data processing and algorithm evaluation. P.H., Z.W. and S.W. contributed to tool preparation and website construction. T.W., K.G. and Y.W. assisted with literature collection and interpreted the results. L.Z., K.W., G.L., H.S. and Y.J. interpreted the results and supported application design. K.Z. and R.C. interpreted the results, revised the manuscript and figures and supervised the project. Y.Z. conceived the study, collected the data, revised the manuscript and figures and supervised the project. All authors reviewed and approved the final version of the manuscript.

Corresponding authors

Correspondence to Kang Zhang, Runsheng Chen or Yi Zhao.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Biomedical Engineering thanks Feixiong Cheng and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Evaluation of BioMedAgent using BioMed-AQA.

a. The calculation of the Win score and determining task execution status of success or failed. b. ROC curve demonstrating the performance of the autoscoring agent, showing high accuracy and reliability with an AUC of 0.926, indicating strong alignment with manual evaluations. c. Confusion matrix comparing autoscoring results with manual evaluations. d. Summary of BioMed-AQA and BioMed-AQA-MCQ. BioMed-AQA (n = 327) consists of open questions derived from three sources: simulated datasets (37.31%), literature-derived datasets (46.79%), and tool tutorial datasets (15.90%). BioMed-AQA-MCQ (n = 172) is a multiple-choice subset of BioMed-AQA, consisting of single-choice questions (73.26%) and multi-choice questions (26.74%), designed to enable automated and objective evaluation. All tasks from O, P, S in BioMed-AQA were designed with one corresponding multiple-choice question. M and V tasks were not included, as they are less suitable for the multiple-choice format.

Extended Data Fig. 2 Overall performance of BioMedAgent.

a. Win score comparison between with and without LTU across O, P, M, S, V tasks, with p-values obtained via two-tailed t-test (p = 1.477e-03 for M, NA: Not Applicable, ns: p > 0.05, *: p < 0.05, **: p < 0.01, ***: p < 0.001). b. Success rate comparison between with and without LTU across Total, O, P, M, S, V tasks, with p-values obtained via two-sided chi-squared tests (p = 7.789e-05 for Total, p = 0.015 for M). c. Proportion of successful tasks on BioMed-AQA using only LTU, only CTC, and both. d. Win score comparison between open-step and clear-step questions across O, P, M, S, V tasks, with p-values obtained via two-tailed t-test. e. Success rate comparison between open-step and clear-step questions across Total, O, P, M, S, V tasks, with p-values obtained via two-sided chi-squared tests.

Extended Data Fig. 3 Evaluation of the IE algorithm.

a. Workflow of IE in planning and coding phases by Planner, Programmer and Executor agents. b. Win score of different LLM-based agents across Total tasks of BioMed-AQA (n = 327). Error bars represent the mean with 95% confidence intervals (CI). GPT FC stands for GPT Function Call. c. Win score comparison between noIE and IE across Total, O, P, M, S, V tasks, with p-values obtained via two-tailed paired t-test (Total: p = 1.091e-22, O: p = 4.236e-09, P: p = 0.019, M: p = 0.199, S: p = 5.118e-07, V: p = 5.530e-09, ns: p > 0.05, *: p < 0.05, **: p < 0.01, ***: p < 0.001). Error bars represent the mean with 95% CI. d. Analyzable scope and success rate of different LLM-based agents across Total, O, P, M, S, V tasks. The length of the bar represents the success rate or analyzable scope.

Extended Data Fig. 4 Evaluation of the MR algorithm.

a. Mechanisms of the CMA and IMF memory update strategies. M1 represents the new added memory in the learning of round 1, while M2 corresponds to round 2. b. Win scores under CMA and IMF memory update strategies across Total tasks (n = 327) in three rounds of learning. Error bars represent the mean with 95% CI. c. Semantic similarity between n = 327 BioMed-AQA questions. d. Evaluation of MR algorithm performance on unseen questions. This table shows the success rates for two models: 1). Before MR (round = 0) with no learning, and 2). MR on seen (round = 3) with three rounds of learning on the seen subset. e. Win scores under CMA and IMF strategies across Total, O, P, M, S, V tasks. Error bars represent the mean with 95% CI.

Extended Data Fig. 5 Execution details of Q1-Q84 in BioMed-AQA.

The detailed testing of each question in BioMed-AQA (Q1-Q84) by BioMedAgent, utilizing the IMF memory update strategy after three rounds of learning, includes the planned steps, Win scores, and execution outcomes. The planned steps are automatically generated by BioMedAgent, with +1 indicating one more step not shown.

Extended Data Fig. 6 Execution details of Q85-Q167 in BioMed-AQA.

The detailed testing of each question in BioMed-AQA (Q85-Q167) by BioMedAgent, utilizing the IMF memory update strategy after three rounds of learning, includes the planned steps, Win scores, and execution outcomes. The planned steps are automatically generated by BioMedAgent, with +2/+6 indicates 2 or 6 more steps not shown.

Extended Data Fig. 7 Execution details of Q168-Q253 in BioMed-AQA.

The detailed testing of each question in BioMed-AQA (Q168-Q253) by BioMedAgent, utilizing the IMF memory update strategy after three rounds of learning, includes the planned steps, Win scores, and execution outcomes. The planned steps are automatically generated by BioMedAgent.

Extended Data Fig. 8 Execution details of Q254-Q327 in BioMed-AQA.

The detailed testing of each question in BioMed-AQA (Q254-Q327) by BioMedAgent, utilizing the IMF memory update strategy after three rounds of learning, includes the planned steps, Win scores, and execution outcomes. The planned steps are automatically generated by BioMedAgent.

Extended Data Fig. 9 Application examples of BioMedAgent.

a. Comparison of DEGs identified by BioMedAgent and official online tool GEO2R. b. Construction of an automated workflow using BioMedAgent to perform cell segmentation with resolution enhancement.

Extended Data Fig. 10 Intuitive chat interface of BioMedAgent.

This figure illustrates the interactive chat interface of BioMedAgent, where multiple agents collaborate on task planning, execution, and results aggregation. Users interact with the system in natural language to initiate and monitor the execution of workflows, including model selection and evaluation in the demo example. The system autonomously plans, executes, and summarizes the results.

Supplementary information

Supplementary Information (download PDF )

Supplementary Figs. 1–3.

Reporting Summary (download PDF )

Peer Review File (download PDF )

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Bu, D., Sun, J., Li, K. et al. Empowering AI data scientists using a multi-agent LLM framework with self-evolving capabilities for autonomous, tool-aware biomedical data analyses. Nat. Biomed. Eng (2026). https://doi.org/10.1038/s41551-026-01634-6

Download citation

Received: 17 May 2025
Accepted: 13 February 2026
Published: 30 March 2026
Version of record: 30 March 2026
DOI: https://doi.org/10.1038/s41551-026-01634-6