Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Advertisement

Scientific Reports
  • View all journals
  • Search
  • My Account Login
  • Content Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • RSS feed
  1. nature
  2. scientific reports
  3. articles
  4. article
Toward trustworthy chatbots: a protocol for red teaming for health related conversations
Download PDF
Download PDF
  • Article
  • Open access
  • Published: 31 March 2026

Toward trustworthy chatbots: a protocol for red teaming for health related conversations

  • Syed-Amad Hussain1,2,
  • Daniel I. Jackson1,2,
  • Ashley Lewis4,
  • Eric Fosler-Lussier1,2 &
  • …
  • Emre Sezgin1,3 

Scientific Reports , Article number:  (2026) Cite this article

  • 195 Accesses

  • 1 Altmetric

  • Metrics details

We are providing an unedited version of this manuscript to give early access to its findings. Before final publication, the manuscript will undergo further editing. Please note there may be errors present which affect the content, and all legal disclaimers apply.

Subjects

  • Engineering
  • Mathematics and computing

Abstract

Health-related chatbots require safety assurance beyond factual correctness. We propose a red-teaming protocol for patient-facing AI structured around three pillars: error stratification, dual-pronged testing, and vulnerability-informed mitigation. We distinguish Document Adherence (DA) from Instruction Adherence (IA), deploying adversarial “attacks” across both single-turn and multi-turn exchanges to provoke system failures. We then applied layered mitigations informed by the vulnerabilities revealed by these attacks. We evaluate this framework on a retrieval-augmented generation (RAG) based chatbot designed to assist with health-related social needs (HRSN).The protocol identified behavioral noncompliance as the dominant risk. While robust in DA (0/60 errors), the system struggled with IA (15% error rate). Crucially, multi-turn stress tests revealed vulnerabilities hidden in single-turn checks: error rates spiked to 50% for advice queries and 40% for user distress. All high-severity failures occurred during these sustained interactions. Of our mitigations, prompt augmentation reduced total errors by 60%, while document augmentation mitigated single-turn distress errors. Combined, they eliminated high-severity errors entirely by forcing “safe failure” loops. We suggest this cycle of stratified analysis, depth-based testing, and targeted mitigation can be a guiding framework for securing clinical conversational agents.

Data availability

Our testing dataset is available at https://github.com/NCH-IFRL/chatbot-redteaming.

References

  1. Kurniawan, M. H., Handiyani, H., Nuraini, T., Hariyati, R. T. S. & Sutrisno, S. A systematic review of artificial intelligence-powered (AI-powered) chatbot intervention for managing chronic illness. Ann. Med. 56, 2302980. https://doi.org/10.1080/07853890.2024.2302980 (2024).

    Google Scholar 

  2. Barreda, M. et al. Transforming healthcare with chatbots: Uses and applications-A scoping review. Digit. Health. 11, 20552076251319174. https://doi.org/10.1177/20552076251319174 (2025).

    Google Scholar 

  3. Laymouna, M. et al. Roles, users, benefits, and limitations of chatbots in health care: Rapid review. J. Med. Internet Res. 26, e56930. https://doi.org/10.2196/56930 (2024).

    Google Scholar 

  4. Clark, M. & Bailey, S. Chatbots in health care: Connecting patients to information: Emerging health technologies. Ottawa (ON): Canadian Agency for Drugs and Technologies in Health; (2024). Available: https://www.ncbi.nlm.nih.gov/books/NBK602381/.

  5. Heinz, M. V. et al. Randomized trial of a generative AI chatbot for mental health treatment. NEJM AI. 2 https://doi.org/10.1056/aioa2400802 (2025).

  6. Moore, J. et al. Expressing stigma and inappropriate responses prevents LLMs from safely replacing mental health providers. Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency. New York, NY, USA: ACM; pp. 599–627. (2025). https://doi.org/10.1145/3715275.3732039.

  7. Cabral, S. et al. Clinical reasoning of a generative artificial intelligence model compared with physicians. JAMA Intern. Med. 184, 581–583. https://doi.org/10.1001/jamainternmed.2024.0295 (2024).

    Google Scholar 

  8. Sezgin, E. et al. Chatbot for social need screening and resource sharing with vulnerable families: Iterative design and evaluation study. JMIR Hum. Factors. 11, e57114. https://doi.org/10.2196/57114 (2024).

    Google Scholar 

  9. Kocielnik, R. et al. HarborBot: A chatbot for social needs screening. AMIA Annu Symp Proc. ;2019: 552–561. (2019). Available: https://pmc.ncbi.nlm.nih.gov/articles/PMC7153089/.

  10. Lewis, P. et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. arXiv [cs.CL]. (2020). Available: http://arxiv.org/abs/2005.11401.

  11. Tonmoy, S. M. T. I. et al. A comprehensive survey of hallucination mitigation techniques in Large Language Models. arXiv [cs.CL]. 2024. Available: http://arxiv.org/abs/2401.01313.

  12. Rawte, V., Sheth, A. & Das, A. A Survey of Hallucination in Large. Foundation Models (2023).

  13. Ahmad, M., Yaramic, I. & Roy, T. D. Creating trustworthy LLMs: Dealing with hallucinations in healthcare AI. arXiv [cs.CL]. (2023). https://doi.org/10.20944/preprints202310.1662.v1.

  14. Artificial Intelligence Risk Management Framework (AI RMF 1.0). NIST AI. pp. 100–101. (2023).

  15. Assuring, A. I., Security, Safety Through, A. I. & Regulation MITRE Presidential Transition Priority Topic Memo. (2024).

  16. AI Red Teaming. Applying Software TEVV for AI Evaluations. In: Cybersecurity and Infrastructure Security Agency CISA [Internet]. [cited 10 Oct 2025]. Available: https://www.cisa.gov/news-events/news/ai-red-teaming-applying-software-tevv-ai-evaluations.

  17. Bullwinkel, B. et al. Lessons from red teaming 100 generative AI products. arXiv [cs.AI]. (2025). Available: http://arxiv.org/abs/2501.07238.

  18. Verma, A. et al. Operationalizing a threat model for red-teaming large language models (LLMs). arXiv [cs.CL]. 2025. Available: http://arxiv.org/abs/2407.14937.

  19. Chao, P. et al. JailbreakBench: An open robustness benchmark for jailbreaking large language models. arXiv [cs.CR]. 2024. Available: http://arxiv.org/abs/2404.01318.

  20. LLM01:2025. Prompt Injection. In: OWASP Gen AI Security Project [Internet]. OWASP Top 10 for LLM & Generative AI Security; 10 Apr 2024 [cited 10 Oct 2025]. Available: https://genai.owasp.org/llmrisk/llm01-prompt-injection/.

  21. Zhuo, T. Y., Huang, Y., Chen, C. & Xing, Z. Red teaming ChatGPT via jailbreaking: Bias, Robustness, Reliability and toxicity. arXiv [cs.CL]. 2023. Available: http://arxiv.org/abs/2301.12867.

  22. Gehman, S. & RealToxicityPrompts Evaluating Neural Toxic Degeneration in Language Models. Findings of the Association for Computational Linguistics: EMNLP 2020. (2020).

  23. Li, N. et al. LLM defenses are not robust to multi-Turn Human Jailbreaks yet. arXiv [cs.LG]. 2024. Available: http://arxiv.org/abs/2408.15221.

  24. Du, X. et al. Multi-turn jailbreaking large Language Models via attention shifting. Proc. Conf. AAAI Artif. Intell. 39, 23814–23822. https://doi.org/10.1609/aaai.v39i22.34553 (2025).

    Google Scholar 

  25. Chang, C. T. et al. Red teaming ChatGPT in medicine to yield real-world insights on model behavior. NPJ Digit. Med. 8, 149. https://doi.org/10.1038/s41746-025-01542-0 (2025).

    Google Scholar 

  26. Kim, Y. et al. Medical hallucination in foundation models and their impact on healthcare. medRxiv https://doi.org/10.1101/2025.02.28.25323115 (2025).

    Google Scholar 

  27. Chen, S. et al. CARES: Comprehensive evaluation of safety and adversarial robustness in medical LLMs. arXiv [cs.CL]. 2025. Available: http://arxiv.org/abs/2505.11413.

  28. Niu, C. et al. RAGTruth: A hallucination corpus for developing trustworthy retrieval-augmented language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Stroudsburg, PA, USA: Association for Computational Linguistics; pp. 10862–10878. (2024). https://doi.org/10.18653/v1/2024.acl-long.585.

  29. Es, S., James, J., Espinosa Anke, L., Schockaert, S. & RAGAs Automated evaluation of retrieval augmented generation. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations. Stroudsburg, PA, USA: Association for Computational Linguistics; pp. 150–158. (2024). https://doi.org/10.18653/v1/2024.eacl-demo.16.

  30. Sezgin, E. et al. Digital health technologies for screening and identifying unmet social needs: Scoping review. J. Med. Internet Res. 27, e78793. https://doi.org/10.2196/78793 (2025).

    Google Scholar 

  31. Fichtenberg, C. M., Alley, D. E. & Mistry, K. B. Improving Social Needs intervention research: Key questions for advancing the field. Am. J. Prev. Med. 57, S47–S54. https://doi.org/10.1016/j.amepre.2019.07.018 (2019).

    Google Scholar 

  32. [cited 10 Oct 2025]. Available: https://assets.anthropic.com/m/61e7d27f8c8f5919/original/Claude-3-Model-Card.pdf.

  33. Lin, S., Hilton, J., Evans, O. & TruthfulQA Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Stroudsburg, PA, USA: Association for Computational Linguistics; (2022). https://doi.org/10.18653/v1/2022.acl-long.229.

  34. Hartvigsen, T. et al. ToxiGen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection. arXiv [cs.CL]. (2022). Available: http://arxiv.org/abs/2203.09509.

  35. Zhu, K. et al. PromptRobust: Towards evaluating the robustness of Large Language Models on adversarial prompts. arXiv [cs.CL]. 2023. Available: http://arxiv.org/abs/2306.04528.

  36. Artstein, R. & Poesio, M. Inter-coder agreement for computational linguistics. Comput. Linguist Assoc. Comput. Linguist. 34, 555–596. https://doi.org/10.1162/coli.07-034-r2 (2008).

    Google Scholar 

  37. Jeune, P. L., Malézieux, B., Xiao, W., Dora, M. & Phare A Safety Probe for Large Language Models. arXiv [cs.CY]. 2025. Available: http://arxiv.org/abs/2505.11365.

  38. Yao, Z. et al. Are reasoning models more prone to hallucination? arXiv [cs.CL]. (2025). Available: http://arxiv.org/abs/2505.23646.

  39. Liu, N. F. et al. Lost in the middle: How language models use long contexts. Trans. Assoc. Comput. Linguist. 12, 157–173. https://doi.org/10.1162/tacl_a_00638 (2024).

    Google Scholar 

  40. Lawrence, H. R. et al. The opportunities and risks of large language models in mental health. JMIR Ment Health. 11, e59479. https://doi.org/10.2196/59479 (2024).

    Google Scholar 

Download references

Funding

This publication was supported, in part, by The Ohio State University Clinical and Translational Science Institute (CTSI) and the National Center for Advancing Translational Sciences of the National Institutes of Health under Grant Number UM1TR004548. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

Author information

Authors and Affiliations

  1. Abigail Wexner Research Institute at Nationwide Children’s Hospital, 700 Children’s Dr, Columbus, OH, 43205, USA

    Syed-Amad Hussain, Daniel I. Jackson, Eric Fosler-Lussier & Emre Sezgin

  2. Department of Computer Science and Engineering, The Ohio State University, Columbus, OH, USA

    Syed-Amad Hussain, Daniel I. Jackson & Eric Fosler-Lussier

  3. The Ohio State University College of Medicine, Columbus, OH, USA

    Emre Sezgin

  4. Department of Linguistics, The Ohio State University, Columbus, OH, USA

    Ashley Lewis

Authors
  1. Syed-Amad Hussain
    View author publications

    Search author on:PubMed Google Scholar

  2. Daniel I. Jackson
    View author publications

    Search author on:PubMed Google Scholar

  3. Ashley Lewis
    View author publications

    Search author on:PubMed Google Scholar

  4. Eric Fosler-Lussier
    View author publications

    Search author on:PubMed Google Scholar

  5. Emre Sezgin
    View author publications

    Search author on:PubMed Google Scholar

Contributions

S.A.H. wrote the main manuscript, created the primary figures and tables, performed evaluation, and created the study design; D. I. J. refined language throughout the manuscript for better socially aware terminology and clarity. He also helped with our study design; A. L. Is the LLM red teaming expert and provided us revisions throughout the paper, as well as guidance on how to perform and frame this study. She especially helped to frame the results; E.F.L. validated the study design and worked significantly on improving the language in the introductions, discussions, and conclusion; E.S. Supported this work at each step, with a role in study design, validation, and reviewing/writing components of each section.

Corresponding author

Correspondence to Syed-Amad Hussain.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary Material 1 (download DOCX )

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hussain, SA., Jackson, D.I., Lewis, A. et al. Toward trustworthy chatbots: a protocol for red teaming for health related conversations. Sci Rep (2026). https://doi.org/10.1038/s41598-026-45719-3

Download citation

  • Received: 15 December 2025

  • Accepted: 20 March 2026

  • Published: 31 March 2026

  • DOI: https://doi.org/10.1038/s41598-026-45719-3

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Download PDF

Advertisement

Explore content

  • Research articles
  • News & Comment
  • Collections
  • Subjects
  • Follow us on Facebook
  • Follow us on X
  • Sign up for alerts
  • RSS feed

About the journal

  • About Scientific Reports
  • Contact
  • Journal policies
  • Guide to referees
  • Calls for Papers
  • Editor's Choice
  • Journal highlights
  • Open Access Fees and Funding

Publish with us

  • For authors
  • Language editing services
  • Open access funding
  • Submit manuscript

Search

Advanced search

Quick links

  • Explore articles by subject
  • Find a job
  • Guide to authors
  • Editorial policies

Scientific Reports (Sci Rep)

ISSN 2045-2322 (online)

nature.com footer links

About Nature Portfolio

  • About us
  • Press releases
  • Press office
  • Contact us

Discover content

  • Journals A-Z
  • Articles by subject
  • protocols.io
  • Nature Index

Publishing policies

  • Nature portfolio policies
  • Open access

Author & Researcher services

  • Reprints & permissions
  • Research data
  • Language editing
  • Scientific editing
  • Nature Masterclasses
  • Research Solutions

Libraries & institutions

  • Librarian service & tools
  • Librarian portal
  • Open research
  • Recommend to library

Advertising & partnerships

  • Advertising
  • Partnerships & Services
  • Media kits
  • Branded content

Professional development

  • Nature Awards
  • Nature Careers
  • Nature Conferences

Regional websites

  • Nature Africa
  • Nature China
  • Nature India
  • Nature Japan
  • Nature Middle East
  • Privacy Policy
  • Use of cookies
  • Legal notice
  • Accessibility statement
  • Terms & Conditions
  • Your US state privacy rights
Springer Nature

© 2026 Springer Nature Limited

Nature Briefing AI and Robotics

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing: AI and Robotics