Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Advertisement

npj Digital Medicine
  • View all journals
  • Search
  • My Account Login
  • Content Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • RSS feed
  1. nature
  2. npj digital medicine
  3. articles
  4. article
From tool to teammate in a randomized controlled trial of clinician-AI collaborative workflows for diagnosis
Download PDF
Download PDF
  • Article
  • Open access
  • Published: 18 March 2026

From tool to teammate in a randomized controlled trial of clinician-AI collaborative workflows for diagnosis

  • Selin S. Everett1,2,
  • Bryan J. Bunning3,4,5,
  • Priyank Jain6,7,
  • Ivan Lopez1,2,3,
  • Anup Agarwal6,7,
  • Manisha Desai1,3,5,
  • Robert Gallo8,
  • Ethan Goh2,
  • Vinay B. Kadiyala6,9,
  • Zahir Kanjee6,9,
  • Jacob M. Koshy6,9,
  • Andrew Olson10,
  • Adam Rodman6,9,
  • Kevin Schulman11,
  • Eric Strong11,
  • Jonathan H. Chen2,3,4,11 na1 &
  • …
  • Eric Horvitz12,13 na1 

npj Digital Medicine , Article number:  (2026) Cite this article

  • 3053 Accesses

  • 3 Altmetric

  • Metrics details

We are providing an unedited version of this manuscript to give early access to its findings. Before final publication, the manuscript will undergo further editing. Please note there may be errors present which affect the content, and all legal disclaimers apply.

Subjects

  • Business and industry
  • Computational biology and bioinformatics
  • Health care
  • Mathematics and computing
  • Medical research
  • Scientific community

Abstract

Early studies of large language models (LLMs) in clinical settings have largely treated artificial intelligence (AI) as a tool rather than an active collaborator. As LLMs demonstrate expert-level diagnostic performance, the focus shifts from whether AI can offer valuable suggestions to how it integrates into physicians’ diagnostic workflows. We conducted a randomized controlled trial (n = 70 clinicians) to assess a custom system designed for collaborative diagnostic reasoning. The design involved independent diagnostic assessments by the clinician and AI, followed by an AI-generated synthesis integrating both perspectives, highlighting agreements, disagreements, and offering commentary. We evaluated two collaborative workflows: AI as first opinion (preceding clinician) and AI as second opinion (following clinician). Both improved clinician diagnostic accuracy over conventional resources, (85% and 82% vs. 75%). Performance was comparable across workflows and not statistically different from AI-alone accuracy (90%), highlighting the potential of collaborative AI to complement clinician expertise. Qualitative analyses illustrate how workflow design shapes human-AI interaction. C: NCT06911645.

Similar content being viewed by others

Towards conversational diagnostic artificial intelligence

Article Open access 09 April 2025

Large language model diagnostic assistance for physicians in a lower-middle-income country: a randomized controlled trial

Article Open access 06 February 2026

Large language models for clinical decision support in gastroenterology and hepatology

Article 22 August 2025

Data Availability

The diagnostic challenge problems and datasets generated and analyzed during the study are not publicly available as their disclosure would risk their inclusion in training datasets of future models. The data can be made available on reasonable request to the corresponding author.

Code availability

The system prompt for the custom GPT is available in the supplemental information. Additional information can be made available to qualified researchers on reasonable request to the corresponding author.

References

  1. Nori, H., King, N., McKinney, S. M., Carignan, D. & Horvitz, E. Capabilities of GPT-4 on medical challenge problems. CS https://doi.org/10.48550/arXiv.2303.13375 (2023).

  2. Cabral, S. et al. Clinical reasoning of a generative artificial intelligence model compared with physicians. JAMA Intern. Med. 84, 581–583 (2024).

    Google Scholar 

  3. Goh, E. et al. Large language model influence on diagnostic reasoning: a randomized clinical trial. JAMA Netw. Open 7, e2440969 (2024).

    Google Scholar 

  4. McDuff D., et al. Towards accurate differential diagnosis with large language models. Nature. 1–7. https://doi.org/10.1038/s41586-025-08869-4 (2025).

  5. Tversky, A., Kahneman, D. Judgment under uncertainty: heuristics and biases: biases in judgments reveal some heuristics of thinking under uncertainty. Science. 185:1124–1131. 1974.

  6. Fogliato, R. et al. Who goes first? Influences of human-AI workflow on decision making in clinical imaging. In Proc. 2022 ACM Conference on Fairness, Accountability, and Transparency (FAccT ‘22), 1362–1374 (Association for Computing Machinery, New York, NY, USA, 2022). https://doi.org/10.1145/3531146.3533193.

  7. Nourani, M. et al. (2021). Anchoring bias affects mental model formation and user reliance in explainable AI systems. 26th International Conference on Intelligent User Interfaces, 340–350.

  8. Yin, J., Ngiam, K. Y., Tan, S. S. L. & Teo, H. H. Designing AI-based work processes: how the timing of AI advice affects diagnostic decision making. Manag. Sci. https://doi.org/10.1287/mnsc.2022.01454 (2022).

    Google Scholar 

  9. Sellen, A. & Horvitz, E. The rise of the AI co-pilot: Lessons for design from aviation and beyond. Commun. ACM 67, 18–23 (2024).

  10. Buçinca, Z., Malaya, M. B. & Gajos, K. Z. To trust or to think: cognitive forcing functions can reduce overreliance on AI in AI-assisted decision-making. Proc. ACM Hum.-Comput. Interact. 5, 1–21 (2021).

    Google Scholar 

  11. Hemmer, P. et al. (2023). Human-AI collaboration: the effect of AI delegation on human task performance and task satisfaction. In Proceedings of the 28th International Conference on Intelligent User Interfaces (pp. 453–463).

  12. Fügener, A., Grahl, J., Gupta, A. & Ketter, W. Cognitive challenges in human–artificial intelligence collaboration: Investigating the path toward productive delegation. Inf. Syst. Res. 33, 678–696 (2022).

    Google Scholar 

  13. Bussone, A., Stumpf, S., & O’Sullivan, D. The Role of Explanations on Trust and Reliance in Clinical Decision Support Systems. Proceedings of the 2015 International Conference on Healthcare Informatics, 160–169 (2015).

  14. Gaube, S. et al. Do as AI say: Susceptibility in deployment of clinical decision-aids. Npj Digit. Med. 4, 1–8 (2021).

    Google Scholar 

  15. Pop, V. L., Shrewsbury, A. & Durso, F. T. Individual differences in the calibration of trust in automation. Hum. Factors 57, 545–556 (2015).

    Google Scholar 

  16. Zhang, Y., Liao, Q. V., & Bellamy, R. K. E. Effect of confidence and explanation on accuracy and trust calibration in AI-assisted decision making. Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, 295–305. https://doi.org/10.1145/3351095.3372852 (2020).

  17. Passi, S., Dhanorkar, S., Vorvoreanu, M. Addressing Overreliance on AI. In: Xu, W. (eds) Handbook of Human-Centered Artificial Intelligence. Springer, Singapore. https://doi.org/10.1007/978-981-97-8440-0_98-1 (2025).

  18. Drosos, I., Sarkar, A., Toronto, N. “ It makes you think”. Provocations Help Restore Critical Thinking to AI-Assisted Knowledge Work. ArXiv Prepr. https://doi.org/10.48550/arXiv.2501.17247 (2025).

  19. Herbert H. Clark. Using language. Cambridge University Press. (1996).

  20. Shaikh, O., Mozannar, H., Bansal, G., Fourney, A. & Horvitz, E. Navigating Rifts in Human-LLM Grounding: Study and Benchmark. ACL 2025: Proc. 63rd Annu. Meet. Assoc. Comput. Linguist. https://doi.org/10.48550/arXiv.2503.13975 (2025).

    Google Scholar 

  21. Brennan, S. E. The grounding problem in conversations with and through computers. In Social and cognitive approaches to interpersonal communication, pp. 201–225. Psychology Press. (2014).

  22. Bohus, D. & Eric, H. Facilitating multiparty dialog with gaze, gesture, and speech. In International Conference on Multimodal Interfaces and the Workshop on Machine Learning for Multimodal Interaction, https://doi.org/10.1145/1891903.1891910 (2010).

  23. Traum, D. R. A Computational Theory of Grounding in Natural Language Conversation. PhD thesis, Department of Computer Science, University of Rochester. Also available as TR 545, Department of Computer Science, University of Rochester. (1994).

  24. Bansal, G. et al. Beyond accuracy: The role of mental models in human-AI team performance. Proc. AAAI Conf. Hum. Comput. Crowdsourc. 7, 2–11, https://doi.org/10.1609/hcomp.v7i1.5285 (2019).

    Google Scholar 

  25. Horvitz, E. Principles of mixed-initiative user interfaces. Proceedings of the SIGCHI conference on Human Factors in Computing Systems (CHI ‘99). Association for Computing Machinery, New York, NY, USA, 159–166. https://doi.org/10.1145/302979.303030.

  26. Amershi, S. et al. Guidelines for human-AI interaction. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (CHI ‘19). Association for Computing Machinery, New York, NY, USA, Paper 3, 1–13. https://doi.org/10.1145/3290605.3300233.

  27. Wilder, B., Horvitz, E., Kamar, E. Learning to complement humans. Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, IJCAI'20. 212:1526–1533. https://doi.org/10.24963/ijcai.2020/212.

  28. Bansal, G., Nushi, B., Kamar, E., Horvitz, E. & Weld, D. S. Is the most accurate AI the best teammate? Optimizing AI for teamwork. In Proc. AAAI Conference on Artificial Intelligence, Vol. 35 11405–11414 (2021).

  29. Calisto, F. M., Abrantes, J. M., Santiago, C., Nunes, N. J. & Nascimento, J. C. Personalized explanations for clinician-AI interaction in breast imaging diagnosis by adapting communication to expertise levels. Int J. Hum.-Comput. Stud. 197, 103444 (2025).

    Google Scholar 

  30. Mozannar, H., Satyanarayan, A. & Sontag, D. Teaching humans when to defer to a classifier via exemplars. Artif. Intell. 36, 5323–5331, (2022).

    Google Scholar 

  31. Weld, D. S., Bansal, G. The challenge of crafting intelligible intelligence Communications of the ACM 62, 70–79.

  32. Bansal, G. et al. Does the Whole Exceed its Parts? The Effect of AI Explanations on Complementary Team Performance. Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. Association for Computing Machinery, New York, NY, USA, Article 81, 1–16. https://doi.org/10.1145/3411764.3445717 (2020).

  33. Horvitz, E., Heckerman, D., Nathwani, B. & Fagan, L. M. The use of a heuristic problem-solving hierarchy to facilitate the explanation of hypothesis-directed reasoning. In Proc. of Medinfo, 27–31 https://erichorvitz.com/medinfo_explain_inference.pdf (1986).

  34. Horvitz, E. & Paek, T. Complementary computing: policies for transferring callers from dialog systems to human receptionists. User Model. User Adapt. Interact. 17 https://doi.org/10.1007/s11257-006-9026-1 (2007).

  35. Kamar, E., Hacker, S. & Horvitz, E. Combining Human and Machine Intelligence in Large-scale Crowdsourcing, AAMAS 2012, Valencia, Spain, https://dl.acm.org/doi/10.5555/2343576.2343643 (2012).

  36. Mozannar, H., Bansal, G., Fourney, A. & Horvitz, E. When to show a suggestion? Integrating human feedback in AI-assisted programming. Artif. Intell. 38, 10137–10144 (2024).

    Google Scholar 

  37. Bejnordi, B. E. et al. Diagnostic assessment of deep learning algorithms for detection of lymph node metastases in women with breast cancer. JAMA 318, 2199–2210 (2017).

    Google Scholar 

  38. Langlotz, C. P. & Shortliffe, E. H. Adapting a consultation system to critique user plans. Int J. Man-Mach. Stud. 19, 479–496 (1983).

    Google Scholar 

  39. Miller, P. L. ATTENDING: Critiquing a physician’s management plan. IEEE Trans. Pattern Anal. Mach. Intell. 5, 449–461 (1983).

    Google Scholar 

  40. Ouyang, L. et al. Training language models to follow instructions with human feedback. Adv. Neural Inf. Process Syst. 35, 27730–27744 (2022).

    Google Scholar 

  41. Salecha, A. et al. Large language models display human-like social desirability biases in Big Five personality surveys. PNAS Nexus 3, pgae533 (2024).

    Google Scholar 

  42. Sharma, M. et al. Towards understanding sycophancy in language models. ArXiv Prepr. Published online, (2023).

  43. Savage, T. et al. Large language model uncertainty proxies: discrimination and calibration for medical diagnosis and treatment. J. Am. Med Inf. Assoc. 32, 139–149 (2025).

    Google Scholar 

  44. Balachandran, V. et al. Eureka: Evaluating and understanding large foundation models. ArXiv Prepr. Published online https://doi.org/10.48550/arXiv.2409.10566 (2024).

    Google Scholar 

Download references

Acknowledgements

We are grateful to Jason Hom, MD, Curtis Langlotz, MD, PhD, Natalie Pageler, MD, Mihaela Vorvoreanu, PhD, and Daniel Yang, MD, for their insightful feedback. We thank Isabel Weng, MHS, for guidance on the statistical analyses. This work was supported by the Stanford Institute for Human-Centered Artificial Intelligence (HAI), Stanford Medical Scholars Research Program, Stanford Bio-X Interdisciplinary Initiatives Seed Grants Program, the Gordon and Betty Moore Foundation [Grant #12409], and the National Library of Medicine [2T15LM007033].

Author information

Author notes
  1. These authors contributed equally: Jonathan H. Chen, Eric Horvitz.

Authors and Affiliations

  1. Stanford University School of Medicine, Stanford, CA, USA

    Selin S. Everett, Ivan Lopez & Manisha Desai

  2. Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, CA, USA

    Selin S. Everett, Ivan Lopez, Ethan Goh & Jonathan H. Chen

  3. Stanford Department of Biomedical Data Science, Stanford, CA, USA

    Bryan J. Bunning, Ivan Lopez, Manisha Desai & Jonathan H. Chen

  4. Stanford Clinical Excellence Research Center, Stanford University, Stanford, CA, USA

    Bryan J. Bunning & Jonathan H. Chen

  5. Stanford Quantitative Sciences Unit, Stanford, CA, USA

    Bryan J. Bunning & Manisha Desai

  6. Harvard Medical School, Boston, MA, USA

    Priyank Jain, Anup Agarwal, Vinay B. Kadiyala, Zahir Kanjee, Jacob M. Koshy & Adam Rodman

  7. Department of Medicine, Cambridge Health Alliance, Cambridge, MA, USA

    Priyank Jain & Anup Agarwal

  8. Center for Innovation to Implementation, VA Palo Alto Health Care System, Palo Alto, CA, USA

    Robert Gallo

  9. Department of Medicine, Beth Israel Deaconess Medical Center, Boston, MA, USA

    Vinay B. Kadiyala, Zahir Kanjee, Jacob M. Koshy & Adam Rodman

  10. University of Minnesota Medical School, Minneapolis, MN, USA

    Andrew Olson

  11. Division of Hospital Medicine, Stanford University, Stanford, CA, USA

    Kevin Schulman, Eric Strong & Jonathan H. Chen

  12. Office of the Chief Scientific Officer, Microsoft, Redmond, WA, USA

    Eric Horvitz

  13. Stanford Institute for Human-Centered Artificial Intelligence, Stanford University, Stanford, CA, USA

    Eric Horvitz

Authors
  1. Selin S. Everett
    View author publications

    Search author on:PubMed Google Scholar

  2. Bryan J. Bunning
    View author publications

    Search author on:PubMed Google Scholar

  3. Priyank Jain
    View author publications

    Search author on:PubMed Google Scholar

  4. Ivan Lopez
    View author publications

    Search author on:PubMed Google Scholar

  5. Anup Agarwal
    View author publications

    Search author on:PubMed Google Scholar

  6. Manisha Desai
    View author publications

    Search author on:PubMed Google Scholar

  7. Robert Gallo
    View author publications

    Search author on:PubMed Google Scholar

  8. Ethan Goh
    View author publications

    Search author on:PubMed Google Scholar

  9. Vinay B. Kadiyala
    View author publications

    Search author on:PubMed Google Scholar

  10. Zahir Kanjee
    View author publications

    Search author on:PubMed Google Scholar

  11. Jacob M. Koshy
    View author publications

    Search author on:PubMed Google Scholar

  12. Andrew Olson
    View author publications

    Search author on:PubMed Google Scholar

  13. Adam Rodman
    View author publications

    Search author on:PubMed Google Scholar

  14. Kevin Schulman
    View author publications

    Search author on:PubMed Google Scholar

  15. Eric Strong
    View author publications

    Search author on:PubMed Google Scholar

  16. Jonathan H. Chen
    View author publications

    Search author on:PubMed Google Scholar

  17. Eric Horvitz
    View author publications

    Search author on:PubMed Google Scholar

Contributions

S.E.: Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Validation, Project administration, Writing – original draft, Writing – review & editing. B.B.: Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Validation, Visualization, Writing – original draft, Writing – review & editing. P.J.: Conceptualization, Data curation, Investigation, Methodology, Project administration, Validation, Writing – review & editing. I.L.: Data curation, Formal analysis, Investigation, Methodology, Visualization, Writing – review & editing. A.A.: Data curation, Writing – review & editing, M.D.: Methodology, Formal analysis, Writing – review & editing. R.G.: Writing – review & editing. E.G.: Methodology, Writing – review & editing, V.K.: Data curation, Writing – review & editing, Z.K.: Writing – review & editing. J.K.: Data curation, Writing – review & editing. A.0.: Writing – review & editing. A.R.: Writing – review & editing. K.S.: Writing – review & editing, E.S.: Writing – review & editing, J.C.: Supervision, Methodology, Funding acquisition, Writing – review & editing. E.H.: Conceptualization, Formal analysis, Investigation, Software, Methodology, Project administration, Supervision, Validation, Writing – original draft, Writing – review & editing.

Corresponding author

Correspondence to Eric Horvitz.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information (download DOCX )

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Everett, S.S., Bunning, B.J., Jain, P. et al. From tool to teammate in a randomized controlled trial of clinician-AI collaborative workflows for diagnosis. npj Digit. Med. (2026). https://doi.org/10.1038/s41746-026-02545-1

Download citation

  • Received: 06 July 2025

  • Accepted: 04 March 2026

  • Published: 18 March 2026

  • DOI: https://doi.org/10.1038/s41746-026-02545-1

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Download PDF

Advertisement

Explore content

  • Research articles
  • Reviews & Analysis
  • News & Comment
  • Collections
  • Follow us on X
  • Sign up for alerts
  • RSS feed

About the journal

  • Aims and scope
  • Content types
  • Journal Information
  • About the Editors
  • Contact
  • Editorial policies
  • Calls for Papers
  • Journal Metrics
  • About the Partner
  • Open Access
  • Early Career Researcher Editorial Fellowship
  • Editorial Team Vacancies
  • News and Views Student Editor
  • Communication Fellowship

Publish with us

  • For Authors and Referees
  • Language editing services
  • Open access funding
  • Submit manuscript

Search

Advanced search

Quick links

  • Explore articles by subject
  • Find a job
  • Guide to authors
  • Editorial policies

npj Digital Medicine (npj Digit. Med.)

ISSN 2398-6352 (online)

nature.com footer links

About Nature Portfolio

  • About us
  • Press releases
  • Press office
  • Contact us

Discover content

  • Journals A-Z
  • Articles by subject
  • protocols.io
  • Nature Index

Publishing policies

  • Nature portfolio policies
  • Open access

Author & Researcher services

  • Reprints & permissions
  • Research data
  • Language editing
  • Scientific editing
  • Nature Masterclasses
  • Research Solutions

Libraries & institutions

  • Librarian service & tools
  • Librarian portal
  • Open research
  • Recommend to library

Advertising & partnerships

  • Advertising
  • Partnerships & Services
  • Media kits
  • Branded content

Professional development

  • Nature Awards
  • Nature Careers
  • Nature Conferences

Regional websites

  • Nature Africa
  • Nature China
  • Nature India
  • Nature Japan
  • Nature Middle East
  • Privacy Policy
  • Use of cookies
  • Legal notice
  • Accessibility statement
  • Terms & Conditions
  • Your US state privacy rights
Springer Nature

© 2026 Springer Nature Limited

Nature Briefing AI and Robotics

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing: AI and Robotics