Abstract
Since 2022, artificial intelligence (AI) methods have progressed far beyond their established capabilities of data classification and prediction. Large language models (LLMs) can perform logical reasoning, enabling them to plan and orchestrate complex workflows. By using this planning ability and equipped with the ability to act upon their environment, LLMs can function as agents. Agents are (semi-)autonomous systems capable of sensing, learning and acting upon their environments. As such, they can interact with external knowledge or external software and can execute sequences of tasks with minimal or no human input. In cancer research and oncology, evidence for the capability of AI agents is rapidly emerging. From autonomously optimizing drug design and development to proposing therapeutic strategies for clinical cases, AI agents can handle complex, multistep problems that were not addressable by previous generations of AI systems. Despite rapid developments, many translational and clinical cancer researchers still lack clarity regarding the precise capabilities, limitations, and ethical or regulatory frameworks associated with AI agents. Here we provide a primer on AI agents for cancer researchers and oncologists. We illustrate how this technology is set apart from and goes beyond traditional AI systems. We discuss existing and emerging applications in cancer research and address real-world challenges from the perspective of academic, clinical and industrial research.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$32.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$259.00 per year
only $21.58 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to the full article PDF.
USD 39.95
Prices may be subject to local taxes which are calculated during checkout



Similar content being viewed by others
References
Topol, E. J. High-performance medicine: the convergence of human and artificial intelligence. Nat. Med. 25, 44–56 (2019).
Gao, S. et al. Empowering biomedical discovery with AI agents. Cell 187, 6125–6151 (2024).
Boiko, D. A., MacKnight, R., Kline, B. & Gomes, G. Autonomous chemical research with large language models. Nature 624, 570–578 (2023). Demonstration of an LLM-based agent (Coscientist) autonomously planning and executing real-world scientific experiments, marking a milestone for AI agents in research.
Bran, A. M. et al. Augmenting large language models with chemistry tools. Nat. Mach. Intell. 6, 525–535 (2024).
Kaiser, J., Lauscher, A. & Eichler, A. Large language models for human-machine collaborative particle accelerator tuning through natural language. Sci. Adv. 11, eadr4173 (2025).
Russell, S. & Norvig, P. Artificial Intelligence (Pearson, 1999).
ANTHROP\C. Building effective AI agents. https://www.anthropic.com/engineering/building-effective-agents (2024).
Zou, J. & Topol, E. J. The rise of agentic AI teammates in medicine. Lancet 405, 457 (2025).
Google Cloud. What is an AI agent? https://cloud.google.com/discover/what-are-ai-agents (2025).
Ray, S. AI agents — what they are, and how they’ll change the way we work. Source https://news.microsoft.com/source/features/ai/ai-agents-what-they-are-and-how-theyll-change-the-way-we-work/ (2024).
AWS. What are AI agents? https://aws.amazon.com/what-is/ai-agents/ (2025).
Lee, Y., Ferber, D., Rood, J. E., Regev, A. & Kather, J. N. How AI agents will change cancer research and oncology. Nat. Cancer 5, 1765–1767 (2024).
Vaswani, A. et al. Attention is all you need. Preprint at https://doi.org/10.48550/arXiv.1706.03762 (2017). This study introduced the transformer architecture that underpins all modern LLMs and AI agents.
Radford, A. et al. Language models are unsupervised multitask learners. OpenAI https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf (2019).
Zhao, W. X. et al. A survey of large language models. Preprint at https://doi.org/10.48550/arXiv.2303.18223 (2023).
Brown, T. B. et al. Language models are few-shot learners. NeurIPS 33, 1877–1901 (2020). This study demonstrated that LLMs can perform diverse tasks with minimal examples, establishing the paradigm of in-context learning that enables agentic capabilities.
Truhn, D., Reis-Filho, J. S. & Kather, J. N. Large language models should be used as scientific reasoning engines, not knowledge databases. Nat. Med. 29, 2983–2984 (2023).
Bubeck, S. et al. Sparks of artificial general intelligence: early experiments with GPT-4. Preprint at https://doi.org/10.48550/arXiv.2303.12712 (2023).
Hendrycks, D. et al. Measuring massive multitask language understanding. Preprint at https://doi.org/10.48550/arXiv.2009.03300 (2020).
ANTHROP\C. Claude’s extended thinking. https://www.anthropic.com/research/visible-extended-thinking (2025).
Tu, T. et al. Towards conversational diagnostic artificial intelligence. Nature 642, 442–450 (2025).
McDuff, D. et al. Towards accurate differential diagnosis with large language models. Nature 642, 451–457 (2025).
Ferber, D. et al. In-context learning enables multimodal large language models to classify cancer pathology images. Nat. Commun. 15, 10104 (2024).
OpenAI o1 System Card. OpenAI https://openai.com/index/openai-o1-system-card/ (2024).
Brodeur, P. G. et al. Superhuman performance of a large language model on the reasoning tasks of a physician. Preprint at https://doi.org/10.48550/arXiv.2412.10849 (2024).
Hao, S. et al. Training large language models to reason in a continuous latent space. Preprint at https://doi.org/10.48550/arXiv.2412.06769 (2024).
OpenAI. Introducing OpenAI o3 and o4-mini. https://openai.com/index/introducing-o3-and-o4-mini/ (2025).
Yuksekgonul, M. et al. Optimizing generative AI by backpropagating language model feedback. Nature 639, 609–616 (2025).
Bostrom, N. in Machine Ethics and Robot Ethics 69–75 (Routledge, 2020).
Smit, A., Duckworth, P., Grinsztajn, N., Barrett, T. D. & Pretorius, A. Should we be going MAD? A look at multi-agent debate strategies for LLMs. Preprint at https://doi.org/10.48550/arXiv.2311.17371 (2023).
Wu, Y. et al. ProAI: Proactive multi-agent conversational AI with structured knowledge base for psychiatric diagnosis. Preprint at https://doi.org/10.48550/arXiv.2502.20689v2 (2025).
Yao, S. et al. ReAct: Synergizing reasoning and acting in language models. Preprint at https://doi.org/10.48550/arXiv.2210.03629 (2022). This study introduced the ReAct framework combining reasoning traces with actions, providing a foundational architecture for modern AI agents.
Wang, E. et al. TxGemma: efficient and agentic LLMs for therapeutics. Preprint at https://doi.org/10.48550/arXiv.2504.06196 (2025).
Shanahan, M., McDonell, K. & Reynolds, L. Role play with large language models. Nature 623, 493–498 (2023).
Moritz, M., Topol, E. & Rajpurkar, P. Coordinated AI agents for advancing healthcare. Nat. Biomed. Eng. https://doi.org/10.1038/s41551-025-01363-2 (2025).
Schaeffer, R. Pretraining on the test set is all you need. Preprint at https://doi.org/10.48550/arXiv.2309.08632 (2023).
Singhal, K. et al. Large language models encode clinical knowledge. Nature 620, 172–180 (2023). This study introduced Med-PaLM, demonstrating that LLMs can achieve expert-level performance on medical question answering and establishing benchmarks for clinical AI.
Singhal, K. et al. Toward expert-level medical question answering with large language models. Nat. Med. 31, 943–950 (2025).
Phan, L. et al. Humanity’s last exam. Preprint at https://doi.org/10.48550/arXiv.2501.14249 (2025).
Mitchener, L. et al. BixBench: a comprehensive benchmark for LLM-based agents in computational biology. Preprint at https://doi.org/10.48550/arXiv.2503.00096 (2025).
Huang, K. et al. Biomni: a general-purpose biomedical AI agent. Bioinformatics https://doi.org/10.1101/2025.05.30.656746 (2025).
Schaefer, M. et al. Multimodal learning enables chat-based exploration of single-cell data. Nat. Biotechnol. https://doi.org/10.1038/s41587-025-02857-9 (2025).
Doshi, A. R. & Hauser, O. P. Generative AI enhances individual creativity but reduces the collective diversity of novel content. Sci. Adv. 10, eadn5290 (2024).
Si, C., Yang, D. & Hashimoto, T. Can LLMs generate novel research ideas? A large-scale human study with 100+ NLP researchers. Preprint at https://doi.org/10.48550/arXiv.2409.04109 (2024).
Baek, J., Jauhar, S. K., Cucerzan, S. & Hwang, S. J. ResearchAgent: iterative research idea generation over scientific literature with large language models. Preprint at https://doi.org/10.48550/arXiv.2404.07738 (2024).
Roohani, Y. et al. BioDiscoveryAgent: an AI agent for designing genetic perturbation experiments. Preprint at https://doi.org/10.48550/arXiv.2405.17631 (2024).
Schmidgall, S. & Moor, M. AgentRxiv: towards collaborative autonomous research. Preprint at https://doi.org/10.48550/arXiv.2503.18102 (2025).
Zhang, K. et al. Artificial intelligence in drug development. Nat. Med. 31, 45–59 (2025).
Schmidgall, S. et al. Agent laboratory: using LLM agents as research assistants. Preprint at https://doi.org/10.48550/arXiv.2501.04227 (2025).
Swanson, K., Wu, W., Bulaong, N. L., Pak, J. E. & Zou, J. The virtual lab: AI agents design new SARS-CoV-2 nanobodies with experimental validation. Bioinformatics https://doi.org/10.1101/2024.11.11.623004 (2024).
Wang, H. et al. SpatialAgent: an autonomous AI agent for spatial biology. Bioinformatics https://doi.org/10.1101/2025.04.03.646459 (2025).
Lu, C. et al. The AI scientist: towards fully automated open-ended scientific discovery. Preprint at https://doi.org/10.48550/arXiv.2408.06292 (2024).
Yamada, Y. et al. The AI scientist-v2: workshop-level automated scientific discovery via agentic tree search. Preprint at https://doi.org/10.48550/arXiv.2504.08066 (2025).
Wölflein, G., Ferber, D., Truhn, D., Arandjelović, O. & Kather, J. N. LLM agents making agent tools. Preprint at https://doi.org/10.48550/arXiv.2502.11705 (2025).
Thirunavukarasu, A. J. et al. Large language models in medicine. Nat. Med. 29, 1930–1940 (2023).
Williams, C. Y. K. et al. Physician- and large language model-generated hospital discharge summaries. JAMA Intern. Med. https://doi.org/10.1001/jamainternmed.2025.0821 (2025).
Goodell, A. J., Chu, S. N., Rouholiman, D. & Chu, L. F. Large language model agents can use tools to perform clinical calculations. NPJ Digit. Med. 8, 163 (2025).
Litière, S., Collette, S., de Vries, E. G. E., Seymour, L. & Bogaerts, J. RECIST — learning from the past to build the future. Nat. Rev. Clin. Oncol. 14, 187–192 (2017).
Liu, F. et al. RiskAgent: autonomous medical AI copilot for generalist risk prediction. Preprint at https://doi.org/10.48550/arXiv.2503.03802 (2025).
Ferber, D. et al. GPT-4 for information retrieval and comparison of medical oncology guidelines. NEJM AI https://doi.org/10.1056/AIcs2300235 (2024).
Ferber, D. et al. Development and validation of an autonomous artificial intelligence agent for clinical decision-making in oncology. Nat. Cancer 6, 1337–1349 (2025). Peer-reviewed validation of an autonomous AI agent for oncology clinical decision support in a tumour board setting.
Kojima, T., Gu, S. S., Reid, M., Matsuo, Y. & Iwasawa, Y. Large language models are zero-shot reasoners. NeurIPS 35, 22199–22213 (2022).
Liévin, V., Hother, C. E., Motzfeldt, A. G. & Winther, O. Can large language models reason about medical questions? Preprint at https://doi.org/10.48550/arXiv.2207.08143 (2022).
Gao, S. et al. TxAgent: an AI agent for therapeutic reasoning across a universe of tools. Preprint at https://doi.org/10.48550/arXiv.2503.10970 (2025).
Hasjim, B. J. et al. The AI agent in the room: informing objective decision making at the transplant selection committee. Transplantation https://doi.org/10.1101/2024.12.06.24318575 (2024).
Wang, S. et al. Empowering medical multi-agents with clinical consultation flow for dynamic diagnosis. Preprint at https://doi.org/10.48550/arXiv.2503.16547 (2025).
Kather, J. N., Ferber, D., Wiest, I. C., Gilbert, S. & Truhn, D. Large language models could make natural language again the universal interface of healthcare. Nat. Med. 30, 2708–2710 (2024).
Palepu, A. et al. Towards conversational AI for disease management. Preprint at https://doi.org/10.48550/arXiv.2503.06074 (2025).
Ferber, D. et al. End-to-end clinical trial matching with large language models. Preprint at https://doi.org/10.48550/arXiv.2407.13463 (2024).
Lukac, S. et al. Evaluating ChatGPT as an adjunct for the multidisciplinary tumor board decision-making in primary breast cancer cases. Arch. Gynecol. Obstet. 308, 1831–1844 (2023).
Schmidl, B. et al. Assessing the role of advanced artificial intelligence as a tool in multidisciplinary tumor board decision-making for recurrent/metastatic head and neck cancer cases – the first study on ChatGPT 4o and a comparison to ChatGPT 4.0. Front. Oncol. 14, 1455413 (2024).
Nardone, V. et al. The role of artificial intelligence on tumor boards: perspectives from surgeons, medical oncologists and radiation oncologists. Curr. Oncol. 31, 4984–5007 (2024).
Ghezloo, F. et al. PathFinder: a multi-modal multi-agent system for medical diagnostic decision-making applied to histopathology. Preprint at https://doi.org/10.48550/arXiv.2502.08916 (2025).
Sinsky, C. et al. Allocation of physician time in ambulatory practice: a time and motion study in 4 specialties. Ann. Intern. Med. 165, 753–760 (2016).
Rotenstein, L. et al. Virtual scribes and physician time spent on electronic health records. JAMA Netw. Open 7, e2413140 (2024).
Ayers, J. W. et al. Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum. JAMA Intern. Med. 183, 589–596 (2023).
Chen, S. et al. The effect of using a large language model to respond to patient messages. Lancet Digit. Health 6, e379–e381 (2024).
Bock, A. Using a virtual scribe may shorten EHR time. JAMA 332, 188 (2024).
Maddox, T. M. et al. Generative AI in medicine — evaluating progress and challenges. N. Engl. J. Med. https://doi.org/10.1056/NEJMsb2503956 (2025).
Bastubbe, Y., Jain, D. & Torti, F. Frontier Technologies in Industrial Operations: The Rise of Artificial Intelligence Agents. White Paper (World Economic Forum, 2025).
Blease, C. R., Locher, C., Gaab, J., Hägglund, M. & Mandl, K. D. Generative artificial intelligence in primary care: an online survey of UK general practitioners. BMJ Health Care Inform. 31, e101102 (2024).
Umeton, R. et al. GPT-4 in a cancer center — institute-wide deployment challenges and lessons learned. NEJM AI https://doi.org/10.1056/AIcs2300191 (2024).
Gilbert, S., Harvey, H., Melvin, T., Vollebregt, E. & Wicks, P. Large language model AI chatbots require approval as medical devices. Nat. Med. 29, 2396–2398 (2023).
Jiang, Y. et al. MedAgentBench: a virtual EHR environment to benchmark medical LLM agents. NEJM AI https://doi.org/10.1056/AIdbp2500144 (2025).
Schmidgall, S. et al. AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments. Preprint at https://doi.org/10.48550/arXiv.2405.07960 (2024).
Rodman, A., Zwaan, L., Olson, A. & Manrai, A. K. When it comes to benchmarks, humans are the only way. NEJM AI https://doi.org/10.1056/AIe2500143 (2025).
Roberts, M. et al. Common pitfalls and recommendations for using machine learning to detect and prognosticate for COVID-19 using chest radiographs and CT scans. Nat. Mach. Intell. 3, 199–217 (2021).
Zhang, A., Xing, L., Zou, J. & Wu, J. C. Shifting machine learning for healthcare from development to deployment and from models to data. Nat. Biomed. Eng. 6, 1330–1345 (2022).
Schmidt, C. M. D. Anderson breaks with IBM Watson, raising questions about artificial intelligence in oncology. J. Natl Cancer Inst. https://doi.org/10.1093/jnci/djx113 (2017).
Dratsch, T. et al. Automation bias in mammography: the impact of artificial intelligence BI-RADS suggestions on reader performance. Radiology 307, e222176 (2023).
Clusmann, J. et al. The future landscape of large language models in medicine. Commun. Med. 3, 141 (2023).
Han, T. et al. Medical large language models are susceptible to targeted misinformation attacks. NPJ Digit. Med. 7, 288 (2024).
Clusmann, J. et al. Prompt injection attacks on vision language models in oncology. Nat. Commun. 16, 1239 (2025).
Savage, T., Nayak, A., Gallo, R., Rangan, E. & Chen, J. H. Diagnostic reasoning prompts reveal the potential for large language model interpretability in medicine. NPJ Digit. Med. 7, 20 (2024).
Abgrall, G., Holder, A. L., Chelly Dagdia, Z., Zeitouni, K. & Monnet, X. Should AI models be explainable to clinicians? Crit. Care 28, 301 (2024).
Gilbert, S., Dai, T. & Mathias, R. Consternation as Congress proposal for autonomous prescribing AI coincides with the haphazard cuts at the FDA. NPJ Digit. Med. 8, 165 (2025).
Balch, J. A. et al. Machine learning-enabled clinical information systems using fast healthcare interoperability resources data standards: scoping review. JMIR Med. Inform. 11, e48297 (2023).
Lee, H. et al. The impact of generative AI on critical thinking: self-reported reductions in cognitive effort and confidence effects from a survey of knowledge workers. In CHI Conference on Human Factors in Computing Systems (CHI ’25) 1–22 (Association for Computing Machinery, 2025).
Fanous, A. et al. SycEval: evaluating LLM sycophancy. Preprint at https://doi.org/10.48550/arXiv.2502.08177 (2025).
Wiest, I. C. et al. Large language models for clinical decision support in gastroenterology and hepatology. Nat. Rev. Gastroenterol. Hepatol. https://doi.org/10.1038/s41575-025-01108-1 (2025).
Rein, D. et al. GPQA: A graduate-level Google-proof Q&A benchmark. Preprint at https://doi.org/10.48550/arXiv.2311.12022 (2023).
Kazemi, M. et al. BIG-bench extra hard. In Proc. 63rd Annu. Meet. Assoc. Comput. Linguist. Vol. 1, 26473–26501 (ACL, 2025).
Liang, P. et al. Holistic evaluation of language models. Preprint at https://doi.org/10.48550/arXiv.2211.09110 (2022).
Khandekar, N. et al. MedCalc-bench: evaluating large language models for medical calculations. Preprint at https://doi.org/10.48550/arXiv.2406.12036 (2024).
Acknowledgements
J.N.K. is supported by the German Cancer Aid (DECADE, 70115166), the German Federal Ministry of Education and Research (PEARL, 01KD2104C; CAMINO, 01EO2101; TRANSFORM LIVER, 031L0312A; TANGERINE, 01KT2302 through ERA-NET TRANSCAN; Come2Data, 16DKZ2044A; DEEP-HCC, 031L0315A), the German Academic Exchange Service (SECAI, 57616814), the European Union’s Horizon Europe and innovation programme (ODELIA, 101057091; GENIAL, 101096312), the European Research Council (ERC; NADIR, 101114631), the National Institutes of Health (EPICO, R01 CA263318) and the National Institute for Health and Care Research (NIHR, NIHR203331) Leeds Biomedical Research Centre. The views expressed are those of the author(s) and not necessarily those of the NHS, the NIHR or the Department of Health and Social Care. This work was funded by the European Union. Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union. Neither the European Union nor the granting authority can be held responsible for them.
Author information
Authors and Affiliations
Contributions
D.T., L.C.A., F.M. and J.N.K. researched data for the article. All authors contributed substantially to discussion of the content. D.T., S.A., J.Z., L.C.A. and J.N.K. wrote the article. All authors reviewed and/or edited the manuscript before submission.
Corresponding author
Ethics declarations
Competing interests
J.N.K. declares consulting services for Bioptimus, France; Panakeia, UK; AstraZeneca, UK; and MultiplexDx, Slovakia. Furthermore, he holds shares in StratifAI, Germany; Synagen, Germany; and Ignition Lab, Germany; has received an institutional research grant by GSK and AstraZeneca; and has received honoraria by AstraZeneca, Bayer, Daiichi Sankyo, Eisai, Janssen, Merck, MSD, BMS, Roche, Pfizer and Fresenius. D.T. received honoraria for lectures by Bayer, GE, Roche, AstraZeneca and Philips and holds shares in StratifAI GmbH, Germany, and in Synagen GmbH, Germany. F.M. is a scientific adviser for and holds shares in Modella AI and is an adviser for Danaher. S.A. is an employee of Alphabet and may own stock as part of the standard compensation package. J.Z. and L.C.A. declare no competing interests.
Peer review
Peer review information
Nature Reviews Cancer thanks Anant Madabhushi, Wayne Zhao and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Glossary
- Chain-of-thought reasoning
-
A prompting technique that encourages language models to generate intermediate reasoning steps before arriving at a final answer, improving performance on complex tasks.
- Contraindications
-
Clinical conditions or factors that make a particular treatment or procedure inadvisable because of potential harm to the patient.
- Deep learning
-
A subset of machine learning that uses artificial neural networks with multiple layers to learn hierarchical representations of data.
- Differential diagnoses
-
A systematic process of distinguishing between diseases or conditions that share similar clinical features to identify the most likely diagnosis.
- Edge case
-
An unusual or extreme scenario that occurs at the boundaries of normal operating conditions, often revealing limitations in system performance.
- Hyperparameters
-
Configuration settings defined before model training that control the learning process, such as learning rate, batch size and network architecture choices.
- Large language model
-
(LLM). Type of artificial intelligence model trained on vast amounts of text data to understand and generate human language, capable of performing diverse language tasks without task-specific training.
- Multi-turn conversation
-
A dialogue consisting of multiple exchanges between a user and an AI system, in which context from previous turns informs subsequent responses.
- Natural language processing
-
(NLP). A field of AI artificial intelligence focused on enabling computers to understand, interpret and generate human language.
- Parsing documents
-
The computational process of analysing and extracting structured information from unstructured or semi-structured text documents.
- Precompiled reports
-
Standardized documents generated in advance or from templates, typically containing structured clinical or research data ready for review.
- Reinforcement learning
-
A machine learning paradigm in which an agent learns to make decisions by receiving feedback in the form of rewards or penalties based on its actions.
- Token
-
The basic unit of text processed by a language model, which may represent a word, subword or character depending on the tokenization scheme.
- Transformer architecture
-
A neural network design that uses self-attention mechanisms to process sequential data in parallel, forming the foundation of modern LLMs.
- Vision language model
-
An AI model capable of processing and relating both visual information (such as images) and textual data within a unified framework.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Truhn, D., Azizi, S., Zou, J. et al. Artificial intelligence agents in cancer research and oncology. Nat Rev Cancer 26, 256–269 (2026). https://doi.org/10.1038/s41568-025-00900-0
Accepted:
Published:
Version of record:
Issue date:
DOI: https://doi.org/10.1038/s41568-025-00900-0


