Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Empowering AI data scientists using a multi-agent LLM framework with self-evolving capabilities for autonomous, tool-aware biomedical data analyses

Abstract

Artificial intelligence agents are emerging as powerful applications of large language models (LLMs), automating complex tasks and enabling scientific data exploration. However, their use in biomedical data analysis remains limited by the difficulty of handling specialized tools and multistep reasoning. Here we introduce BioMedAgent, a self-evolving LLM multi-agent framework, which learns to use diverse bioinformatics tools and chain them into executable workflows through interactive exploration and memory retrieval algorithms. It allows biomedical users to initiate tasks using natural language, without requiring computational expertise. Evaluated on our newly released BioMed-AQA benchmark comprising 327 biomedical data tasks, BioMedAgent achieved a 77% success rate, surpassing other LLM agents, and generalized robustly to the external BixBench dataset. Beyond benchmarks, it autonomously performs cross-omics analysis, machine-learning modelling and pathology image segmentation, highlighting its potential to advance biomedical research and extend to other scientific domains requiring complex tool integration and multistep reasoning.

This is a preview of subscription content, access via your institution

Access options

Buy this article

USD 39.95

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Framework and benchmark.
Fig. 2: Performance of BioMedAgent.
Fig. 3: Evaluation of the IE algorithm.
Fig. 4: Self-evolving through the MR algorithm.
Fig. 5: Evaluation on external benchmark BixBench and agent comparison.
Fig. 6: Applying BioMedAgent to accelerate diverse biomedical data research.

Similar content being viewed by others

Data availability

The BioMed-AQA benchmark (n = 327 open questions) introduced in this study is publicly available on Hugging Face at https://huggingface.co/datasets/BOBQWERA/biomed-aqa-dataset, which includes the full set of natural-language questions, reference steps and milestones for evaluation. The complementary BioMed-AQA-MCQ subset (n = 172 MCQs) is available at https://huggingface.co/datasets/BOBQWERA/biomed-mcq-dataset. The corresponding benchmark-related input data and milestone reference files are available via Zenodo at https://doi.org/10.5281/zenodo.17430550 (ref. 72). Evaluation testing of BioMedAgent for rounds = 0, 1, 2 under CMA and IMF on the BioMed-AQA benchmark, along with the interactive chat details that show the full process of task planning, execution and summarization for each question, are available at http://biomed.drai.cn.

Code availability

The open-source implementation of BioMedAgent, including its multi-agent framework and autoscoring evaluation agent, is available on GitHub at https://github.com/BOBQWERA/BioMedAgent.

References

  1. Agrawal, R. & Prabakaran, S. Big data in digital healthcare: lessons learnt and recommendations for general practice. Heredity 124, 525–534 (2020).

    Article  PubMed  PubMed Central  Google Scholar 

  2. Shilo, S., Rossman, H. & Segal, E. Axes of a revolution challenges and promises of big data in healthcare. Nat. Med. 26, 29–38 (2020).

    Article  CAS  PubMed  Google Scholar 

  3. Woldemariam, M. T. & Jimma, W. Adoption of electronic health record systems to enhance the quality of healthcare in low-income countries: a systematic review. BMJ Health Care Inform. 30, e100704 (2023).

    Article  PubMed  PubMed Central  Google Scholar 

  4. Liang, H. et al. Evaluation and accurate diagnoses of pediatric diseases using artificial intelligence. Nat. Med. 25, 433–438 (2019).

    Article  CAS  PubMed  Google Scholar 

  5. Xu, H. et al. A whole-slide foundation model for digital pathology from real-world data. Nature 630, 181–188 (2024).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Feinberg, D. A. et al. Next-generation MRI scanner designed for ultra-high-resolution human brain imaging at 7 Tesla. Nat. Methods 20, 2048–2057 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  7. Schuijf, J. D. et al. CT imaging with ultra-high-resolution: opportunities for cardiovascular imaging in clinical practice. J. Cardiovasc. Comput. Tomogr. 16, 388–396 (2022).

    Article  PubMed  Google Scholar 

  8. Goodwin, S., McPherson, J. D. & McCombie, W. R. Coming of age: ten years of next-generation sequencing technologies. Nat. Rev. Genet. 17, 333–351 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Metzker, M. L. Sequencing technologies—the next generation. Nat. Rev. Genet. 11, 31–46 (2010).

    Article  CAS  PubMed  Google Scholar 

  10. Karczewski, K. J. & Snyder, M. P. Integrative omics for health and disease. Nat. Rev. Genet. 19, 299–310 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  11. Van de Sande, B. et al. Applications of single-cell RNA sequencing in drug discovery and development. Nat. Rev. Drug Discov. 22, 496–520 (2023).

    Article  PubMed  PubMed Central  Google Scholar 

  12. Wratten, L., Wilm, A. & Göke, J. Reproducible, scalable, and shareable analysis pipelines with bioinformatics workflow managers. Nat. Methods 18, 1161–1168 (2021.

    Article  CAS  PubMed  Google Scholar 

  13. Cao, Y. et al. Ensemble deep learning in bioinformatics. Nat. Mach. Intell. 2, 500–508 (2020).

    Article  Google Scholar 

  14. Dubay, C. et al. Delivering bioinformatics training: bridging the gaps between computer science and biomedicine. Proc. AMIA Symp. 2002, 220–224 (2002).

    Google Scholar 

  15. Elmarakeby, H. A. et al. Biologically informed deep neural network for prostate cancer discovery. Nature 598, 348–352 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. Li, J. et al. Proteomics and bioinformatics approaches for identification of serum biomarkers to detect breast cancer. Clin. Chem. 48, 1296–1304 (2002).

    Article  CAS  PubMed  Google Scholar 

  17. Sadybekov, A. V. & Katritch, V. Computational approaches streamlining drug discovery. Nature 616, 673–685 (2023).

    Article  CAS  PubMed  Google Scholar 

  18. Fitzgerald, R. C. et al. The future of early cancer detection. Nat. Med. 28, 666–677 (2022).

    Article  CAS  PubMed  Google Scholar 

  19. McDonald, T. O. et al. Computational approaches to modelling and optimizing cancer treatment. Nat. Rev. Bioeng. 1, 695–711 (2023).

    Article  CAS  Google Scholar 

  20. Misra, B. B. et al. Integrated omics: tools, advances, and future approaches. J. Mol. Endocrinol. 62, R21–R45 (2019).

    Article  CAS  PubMed  Google Scholar 

  21. Brooks, T. G. et al. Challenges and best practices in omics benchmarking. Nat. Rev. Genet. 25, 326–339 (2024).

    Article  CAS  PubMed  Google Scholar 

  22. Di Tommaso, P. et al. Nextflow enables reproducible computational workflows. Nat. Biotechnol. 35, 316–319 (2017).

    Article  PubMed  Google Scholar 

  23. Goecks, J. et al. Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol. 11, 1–13 (2010).

    Article  Google Scholar 

  24. Malhotra, R. et al. Using the seven bridges Cancer Genomics Cloud to access and analyze petabytes of cancer data. Curr. Protoc. Bioinformatics 60, 11.16.1–11.16.32 (2017).

    PubMed  PubMed Central  Google Scholar 

  25. Kaur, S. & Kaur, S. Genomics with cloud computing. Int. J. Sci. Technol. Res. 4, 146–148 (2015).

    Google Scholar 

  26. Wu, T. et al. A brief overview of ChatGPT: the history, status quo and potential future development. IEEE/CAA J. Autom. Sin. 10, 1122–1136 (2023).

    Article  Google Scholar 

  27. Brown, T. B. et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 33, 1877–1901 (2020).

    Google Scholar 

  28. Theodoris, C. V. et al. Transfer learning enables predictions in network biology. Nature 618, 616–624 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  29. Hao, M. et al. Large-scale foundation model on single-cell transcriptomics. Nat. Methods 21, 1481–1491 (2024).

    Article  CAS  PubMed  Google Scholar 

  30. Singhal, K. et al. Large language models encode clinical knowledge. Nature 620, 172–180 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  31. Tang, X. et al. MedAgents: large language models as collaborators for zero-shot medical reasoning. Find. Assoc. Comput. Linguist: ACL 2024, 599–621 (2024).

    Google Scholar 

  32. Guo, T. et al. Large language model based multi-agents: a survey of progress and challenges. Proc. Int. Joint Conf. Artif. Intell. 33, 8048–8057 (2024).

    Google Scholar 

  33. Wang, H. et al. Scientific discovery in the age of artificial intelligence. Nature 620, 47–60 (2023).

    Article  CAS  PubMed  Google Scholar 

  34. Gao, S. et al. Empowering biomedical discovery with AI agents. Cell 187, 6125–6151 (2024).

    Article  CAS  PubMed  Google Scholar 

  35. Boiko, D. A. et al. Autonomous chemical research with large language models. Nature 624, 570–578 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  36. Dai, T. et al. Autonomous mobile robots for exploratory synthetic chemistry. Nature 635, 890–897 (2024).

    Article  PubMed  PubMed Central  Google Scholar 

  37. Hou, W. & Ji, Z. Assessing GPT-4 for cell type annotation in single-cell RNA-seq analysis. Nat. Methods 21, 1462–1465 (2024).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  38. Lobentanzer, S. et al. A platform for the biomedical application of large language models. Nat. Biotechnol. 43, 166–169 (2025).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  39. Tayebi Arasteh, S. et al. Large language models streamline automated machine learning for clinical studies. Nat. Commun. 15, 1603 (2024).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  40. Zhou, J. et al. An AI agent for fully automated multi-omic analyses. Adv. Sci. 11, e2407094 (2024).

    Article  Google Scholar 

  41. Xiao, Y. et al. CellAgent: an LLM-driven multi-agent framework for automated single-cell data analysis. Preprint at bioRxiv https://doi.org/10.1101/2024.05.13.593861 (2024).

  42. Liu, H. & Wang, H. GenoTEX: a benchmark for evaluating LLM-based exploration of gene expression data in alignment with bioinformaticians. Preprint at arXiv https://doi.org/10.48550/arXiv.2406.15341 (2024).

  43. Mitchener, L. et al. Bixbench: a comprehensive benchmark for LLM-based agents in computational biology. Preprint at arXiv https://doi.org/10.48550/arXiv.2503.00096 (2025).

  44. Gómez-López, G. et al. Precision medicine needs pioneering clinical bioinformaticians. Brief. Bioinform. 20, 752–766 (2019).

    Article  PubMed  Google Scholar 

  45. Hou, X. et al. Large language models for software engineering: A systematic literature review. ACM Trans. Softw. Eng. Methodol. (2023).

  46. Xin, Q. et al. BioInformatics Agent (BIA): unleashing the power of large language models to reshape bioinformatics workflow. Preprint at bioRxiv https://doi.org/10.1101/2024.05.22.595240 (2024).

  47. Su, H., Long, W. & Zhang, Y. BioMaster: multi-agent system for automated bioinformatics analysis workflow. Preprint at bioRxiv https://doi.org/10.1101/2025.01.23.634608 (2025).

  48. Afzal, M. et al. Precision medicine informatics: principles, prospects, and challenges. IEEE Access 8, 13593–13612 (2020).

    Article  Google Scholar 

  49. Zhang, W. & Mei, H. A constructive model for collective intelligence. Natl Sci. Rev. 7, 1273–1277 (2020).

    Article  PubMed  PubMed Central  Google Scholar 

  50. Qian, C. et al. Iterative experience refinement of software-developing agents. Preprint at arXiv https://doi.org/10.48550/arXiv.2405.04219 (2024).

  51. Riffle, D. et al. OLAF: an open life science analysis framework for conversational bioinformatics powered by large language models. Preprint at arXiv https://doi.org/10.48550/arXiv.2504.03976 (2025).

  52. Xie, E. et al. CASSIA: a multi-agent large language model for automated and interpretable cell annotation. Nat. Commun. 17, 389 (2025).

    Article  PubMed  PubMed Central  Google Scholar 

  53. Tsyganov, M. M. et al. Influence of DNA copy number aberrations in ABC transporter family genes on the survival of patients with primary operatable non-small cell lung cancer. Curr. Cancer Drug Targets (2025).

  54. Sun, Y. et al. SERINC2-mediated serine metabolism promotes cervical cancer progression and drives T cell exhaustion. Int. J. Biol. Sci. 21, 1361–1377 (2025).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  55. Wang, X., Jiang, C. & Li, Q. Serinc2 drives the progression of cervical cancer through regulating Myc pathway. Cancer Med. 13, e70296 (2024).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  56. Lee, J. S. et al. SEZ6L2 is an important regulator of drug-resistant cells and tumor spheroid cells in lung adenocarcinoma. Biomedicines 8 (2020).

  57. Ishikawa, N. et al. Characterization of SEZ6L2 cell-surface protein as a novel prognostic marker for lung cancer. Cancer Sci. 97, 737–745 (2006).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  58. Jee, J. et al. DNA liquid biopsy-based prediction of cancer-associated venous thromboembolism. Nat. Med. 30, 2499–2507 (2024).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  59. Xu, Z. et al. MiHATP: a multi-hybrid attention super-resolution network for pathological image based on transformation pool contrastive learning. In International Conference on Medical Image Computing and Computer-Assisted Intervention (Springer, 2024).

  60. Yang, K. et al. If LLM is the wizard, then code is the wand: a survey on how code empowers large language models to serve as intelligent agents. Preprint at arXiv https://doi.org/10.48550/arXiv.2401.00812 (2024).

  61. Qian, C. et al. Investigate-consolidate-exploit: A general strategy for inter-task agent self-evolution. Preprint at arXiv https://doi.org/10.48550/arXiv.2401.13996 (2024).

  62. Gao, C. et al. Large language models empowered agent-based modeling and simulation: a survey and perspectives. Humanit. Soc. Sci. Commun. 11, 1–24 (2024).

    Article  CAS  Google Scholar 

  63. Zhong, W. et al. Memorybank: enhancing large language models with long-term memory. Proc. AAAI Conf. Artif. Intell. 38, 19724–19731 (2024).

    Google Scholar 

  64. Liu, L. et al. Think-in-memory: Recalling and post-thinking enable llms with long-term memory. Preprint at arXiv https://doi.org/10.48550/arXiv.2311.08719 (2023).

  65. Li, Z. et al. VarBen: generating in silico reference data sets for clinical next-generation sequencing bioinformatics pipeline evaluation. J. Mol. Diagn. 23, 285–299 (2021).

    Article  CAS  PubMed  Google Scholar 

  66. Pendleton, M. et al. Assembly and diploid architecture of an individual human genome via single-molecule technologies. Nat. Methods 12, 780–786 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  67. Hao, Y. et al. Dictionary learning for integrative, multimodal and scalable single-cell analysis. Nat. Biotechnol. 42, 293–304 (2024).

    Article  CAS  PubMed  Google Scholar 

  68. Yang, Y. et al. Comprehensive landscape of resistance mechanisms for neoadjuvant therapy in esophageal squamous cell carcinoma by single-cell transcriptomics. Signal Transduct. Target. Ther. 8, 298 (2023).

    Article  PubMed  PubMed Central  Google Scholar 

  69. Wolf, F. A., Angerer, P. & Theis, F. J. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. 19, 1–5 (2018).

    Article  Google Scholar 

  70. Bu, D. et al. KOBAS-i: intelligent prioritization and exploratory visualization of biological functions for gene enrichment analysis. Nucleic Acids Res. 49, W317–W325 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  71. Jiang, X. et al. Long term memory: the foundation of AI self-evolution. Preprint at arXiv https://doi.org/10.48550/arXiv.2410.15665 (2024).

  72. Sun, J. Biomedical dataset files collection. Zenodo https://doi.org/10.5281/zenodo.17430550 (2025).

Download references

Acknowledgements

This work was supported by the National Key R&D Program of China (2022YFF1203303, D.B.), National Natural Science Foundation of China (32341019, Y.Z.; 92474204, Y.Z.; 32570778, D.B.; W2431057, K.Z.), Ningbo Top Medical and Health Research Program (2023030615, Y.Z.; 2024020919, Y.W.), Beijing Natural Science Foundation (L222007, D.B.), Ningbo Science and Technology Innovation Yongjiang 2035 Project (2023Z226 and 2024Z229, Y.Z.), Major Project of Guangzhou National Laboratory (GZNL2023A03001, Y.Z.), State Key Laboratory of Systems Medicine for Cancer (KF2422-93, Y.Z.), Hubei Province key research and development project (2022BCA016 and 2023BCB146, Y.J.) and Macau Science and Technology Development Fund (0007/2020/AFJ, 0070/2020/A2 and 0003/2021/AKP, K.Z.). The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript. The authors would like to acknowledge the Nanjing Institute of InforSuperBahn OneAiNexus for providing the training and evaluation platform.

Author information

Authors and Affiliations

Authors

Contributions

D.B. collected and interpreted the data and drafted the manuscript and figures. J.S. implemented and evaluated the algorithm and developed the software. K.L. analysed the data, drafted the figures and contributed to algorithm application. Z.H., W.H., J.H., S.Z. and S.L. participated in data processing and algorithm evaluation. P.H., Z.W. and S.W. contributed to tool preparation and website construction. T.W., K.G. and Y.W. assisted with literature collection and interpreted the results. L.Z., K.W., G.L., H.S. and Y.J. interpreted the results and supported application design. K.Z. and R.C. interpreted the results, revised the manuscript and figures and supervised the project. Y.Z. conceived the study, collected the data, revised the manuscript and figures and supervised the project. All authors reviewed and approved the final version of the manuscript.

Corresponding authors

Correspondence to Kang Zhang, Runsheng Chen or Yi Zhao.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Biomedical Engineering thanks Feixiong Cheng and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Evaluation of BioMedAgent using BioMed-AQA.

a. The calculation of the Win score and determining task execution status of success or failed. b. ROC curve demonstrating the performance of the autoscoring agent, showing high accuracy and reliability with an AUC of 0.926, indicating strong alignment with manual evaluations. c. Confusion matrix comparing autoscoring results with manual evaluations. d. Summary of BioMed-AQA and BioMed-AQA-MCQ. BioMed-AQA (n = 327) consists of open questions derived from three sources: simulated datasets (37.31%), literature-derived datasets (46.79%), and tool tutorial datasets (15.90%). BioMed-AQA-MCQ (n = 172) is a multiple-choice subset of BioMed-AQA, consisting of single-choice questions (73.26%) and multi-choice questions (26.74%), designed to enable automated and objective evaluation. All tasks from O, P, S in BioMed-AQA were designed with one corresponding multiple-choice question. M and V tasks were not included, as they are less suitable for the multiple-choice format.

Extended Data Fig. 2 Overall performance of BioMedAgent.

a. Win score comparison between with and without LTU across O, P, M, S, V tasks, with p-values obtained via two-tailed t-test (p = 1.477e-03 for M, NA: Not Applicable, ns: p > 0.05, *: p < 0.05, **: p < 0.01, ***: p < 0.001). b. Success rate comparison between with and without LTU across Total, O, P, M, S, V tasks, with p-values obtained via two-sided chi-squared tests (p = 7.789e-05 for Total, p = 0.015 for M). c. Proportion of successful tasks on BioMed-AQA using only LTU, only CTC, and both. d. Win score comparison between open-step and clear-step questions across O, P, M, S, V tasks, with p-values obtained via two-tailed t-test. e. Success rate comparison between open-step and clear-step questions across Total, O, P, M, S, V tasks, with p-values obtained via two-sided chi-squared tests.

Extended Data Fig. 3 Evaluation of the IE algorithm.

a. Workflow of IE in planning and coding phases by Planner, Programmer and Executor agents. b. Win score of different LLM-based agents across Total tasks of BioMed-AQA (n = 327). Error bars represent the mean with 95% confidence intervals (CI). GPT FC stands for GPT Function Call. c. Win score comparison between noIE and IE across Total, O, P, M, S, V tasks, with p-values obtained via two-tailed paired t-test (Total: p = 1.091e-22, O: p = 4.236e-09, P: p = 0.019, M: p = 0.199, S: p = 5.118e-07, V: p = 5.530e-09, ns: p > 0.05, *: p < 0.05, **: p < 0.01, ***: p < 0.001). Error bars represent the mean with 95% CI. d. Analyzable scope and success rate of different LLM-based agents across Total, O, P, M, S, V tasks. The length of the bar represents the success rate or analyzable scope.

Extended Data Fig. 4 Evaluation of the MR algorithm.

a. Mechanisms of the CMA and IMF memory update strategies. M1 represents the new added memory in the learning of round 1, while M2 corresponds to round 2. b. Win scores under CMA and IMF memory update strategies across Total tasks (n = 327) in three rounds of learning. Error bars represent the mean with 95% CI. c. Semantic similarity between n = 327 BioMed-AQA questions. d. Evaluation of MR algorithm performance on unseen questions. This table shows the success rates for two models: 1). Before MR (round = 0) with no learning, and 2). MR on seen (round = 3) with three rounds of learning on the seen subset. e. Win scores under CMA and IMF strategies across Total, O, P, M, S, V tasks. Error bars represent the mean with 95% CI.

Extended Data Fig. 5 Execution details of Q1-Q84 in BioMed-AQA.

The detailed testing of each question in BioMed-AQA (Q1-Q84) by BioMedAgent, utilizing the IMF memory update strategy after three rounds of learning, includes the planned steps, Win scores, and execution outcomes. The planned steps are automatically generated by BioMedAgent, with +1 indicating one more step not shown.

Extended Data Fig. 6 Execution details of Q85-Q167 in BioMed-AQA.

The detailed testing of each question in BioMed-AQA (Q85-Q167) by BioMedAgent, utilizing the IMF memory update strategy after three rounds of learning, includes the planned steps, Win scores, and execution outcomes. The planned steps are automatically generated by BioMedAgent, with +2/+6 indicates 2 or 6 more steps not shown.

Extended Data Fig. 7 Execution details of Q168-Q253 in BioMed-AQA.

The detailed testing of each question in BioMed-AQA (Q168-Q253) by BioMedAgent, utilizing the IMF memory update strategy after three rounds of learning, includes the planned steps, Win scores, and execution outcomes. The planned steps are automatically generated by BioMedAgent.

Extended Data Fig. 8 Execution details of Q254-Q327 in BioMed-AQA.

The detailed testing of each question in BioMed-AQA (Q254-Q327) by BioMedAgent, utilizing the IMF memory update strategy after three rounds of learning, includes the planned steps, Win scores, and execution outcomes. The planned steps are automatically generated by BioMedAgent.

Extended Data Fig. 9 Application examples of BioMedAgent.

a. Comparison of DEGs identified by BioMedAgent and official online tool GEO2R. b. Construction of an automated workflow using BioMedAgent to perform cell segmentation with resolution enhancement.

Extended Data Fig. 10 Intuitive chat interface of BioMedAgent.

This figure illustrates the interactive chat interface of BioMedAgent, where multiple agents collaborate on task planning, execution, and results aggregation. Users interact with the system in natural language to initiate and monitor the execution of workflows, including model selection and evaluation in the demo example. The system autonomously plans, executes, and summarizes the results.

Supplementary information

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Bu, D., Sun, J., Li, K. et al. Empowering AI data scientists using a multi-agent LLM framework with self-evolving capabilities for autonomous, tool-aware biomedical data analyses. Nat. Biomed. Eng (2026). https://doi.org/10.1038/s41551-026-01634-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Version of record:

  • DOI: https://doi.org/10.1038/s41551-026-01634-6

Search

Quick links

Nature Briefing AI and Robotics

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing: AI and Robotics