Toward trustworthy chatbots: a protocol for red teaming for health related conversations

Hussain, Syed-Amad; Jackson, Daniel I.; Lewis, Ashley; Fosler-Lussier, Eric; Sezgin, Emre

doi:10.1038/s41598-026-45719-3

Download PDF

Article
Open access
Published: 31 March 2026

Toward trustworthy chatbots: a protocol for red teaming for health related conversations

Syed-Amad Hussain^1,2,
Daniel I. Jackson^1,2,
Ashley Lewis⁴,
Eric Fosler-Lussier^1,2 &
…
Emre Sezgin^1,3

Scientific Reports , Article number: (2026) Cite this article

195 Accesses
1 Altmetric
Metrics details

We are providing an unedited version of this manuscript to give early access to its findings. Before final publication, the manuscript will undergo further editing. Please note there may be errors present which affect the content, and all legal disclaimers apply.

Subjects

Abstract

Health-related chatbots require safety assurance beyond factual correctness. We propose a red-teaming protocol for patient-facing AI structured around three pillars: error stratification, dual-pronged testing, and vulnerability-informed mitigation. We distinguish Document Adherence (DA) from Instruction Adherence (IA), deploying adversarial “attacks” across both single-turn and multi-turn exchanges to provoke system failures. We then applied layered mitigations informed by the vulnerabilities revealed by these attacks. We evaluate this framework on a retrieval-augmented generation (RAG) based chatbot designed to assist with health-related social needs (HRSN).The protocol identified behavioral noncompliance as the dominant risk. While robust in DA (0/60 errors), the system struggled with IA (15% error rate). Crucially, multi-turn stress tests revealed vulnerabilities hidden in single-turn checks: error rates spiked to 50% for advice queries and 40% for user distress. All high-severity failures occurred during these sustained interactions. Of our mitigations, prompt augmentation reduced total errors by 60%, while document augmentation mitigated single-turn distress errors. Combined, they eliminated high-severity errors entirely by forcing “safe failure” loops. We suggest this cycle of stratified analysis, depth-based testing, and targeted mitigation can be a guiding framework for securing clinical conversational agents.

Data availability

Our testing dataset is available at https://github.com/NCH-IFRL/chatbot-redteaming.

References

Kurniawan, M. H., Handiyani, H., Nuraini, T., Hariyati, R. T. S. & Sutrisno, S. A systematic review of artificial intelligence-powered (AI-powered) chatbot intervention for managing chronic illness. Ann. Med. 56, 2302980. https://doi.org/10.1080/07853890.2024.2302980 (2024).
Google Scholar
Barreda, M. et al. Transforming healthcare with chatbots: Uses and applications-A scoping review. Digit. Health. 11, 20552076251319174. https://doi.org/10.1177/20552076251319174 (2025).
Google Scholar
Laymouna, M. et al. Roles, users, benefits, and limitations of chatbots in health care: Rapid review. J. Med. Internet Res. 26, e56930. https://doi.org/10.2196/56930 (2024).
Google Scholar
Clark, M. & Bailey, S. Chatbots in health care: Connecting patients to information: Emerging health technologies. Ottawa (ON): Canadian Agency for Drugs and Technologies in Health; (2024). Available: https://www.ncbi.nlm.nih.gov/books/NBK602381/.
Heinz, M. V. et al. Randomized trial of a generative AI chatbot for mental health treatment. NEJM AI. 2 https://doi.org/10.1056/aioa2400802 (2025).
Moore, J. et al. Expressing stigma and inappropriate responses prevents LLMs from safely replacing mental health providers. Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency. New York, NY, USA: ACM; pp. 599–627. (2025). https://doi.org/10.1145/3715275.3732039.
Cabral, S. et al. Clinical reasoning of a generative artificial intelligence model compared with physicians. JAMA Intern. Med. 184, 581–583. https://doi.org/10.1001/jamainternmed.2024.0295 (2024).
Google Scholar
Sezgin, E. et al. Chatbot for social need screening and resource sharing with vulnerable families: Iterative design and evaluation study. JMIR Hum. Factors. 11, e57114. https://doi.org/10.2196/57114 (2024).
Google Scholar
Kocielnik, R. et al. HarborBot: A chatbot for social needs screening. AMIA Annu Symp Proc. ;2019: 552–561. (2019). Available: https://pmc.ncbi.nlm.nih.gov/articles/PMC7153089/.
Lewis, P. et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. arXiv [cs.CL]. (2020). Available: http://arxiv.org/abs/2005.11401.
Tonmoy, S. M. T. I. et al. A comprehensive survey of hallucination mitigation techniques in Large Language Models. arXiv [cs.CL]. 2024. Available: http://arxiv.org/abs/2401.01313.
Rawte, V., Sheth, A. & Das, A. A Survey of Hallucination in Large. Foundation Models (2023).
Ahmad, M., Yaramic, I. & Roy, T. D. Creating trustworthy LLMs: Dealing with hallucinations in healthcare AI. arXiv [cs.CL]. (2023). https://doi.org/10.20944/preprints202310.1662.v1.
Artificial Intelligence Risk Management Framework (AI RMF 1.0). NIST AI. pp. 100–101. (2023).
Assuring, A. I., Security, Safety Through, A. I. & Regulation MITRE Presidential Transition Priority Topic Memo. (2024).
AI Red Teaming. Applying Software TEVV for AI Evaluations. In: Cybersecurity and Infrastructure Security Agency CISA [Internet]. [cited 10 Oct 2025]. Available: https://www.cisa.gov/news-events/news/ai-red-teaming-applying-software-tevv-ai-evaluations.
Bullwinkel, B. et al. Lessons from red teaming 100 generative AI products. arXiv [cs.AI]. (2025). Available: http://arxiv.org/abs/2501.07238.
Verma, A. et al. Operationalizing a threat model for red-teaming large language models (LLMs). arXiv [cs.CL]. 2025. Available: http://arxiv.org/abs/2407.14937.
Chao, P. et al. JailbreakBench: An open robustness benchmark for jailbreaking large language models. arXiv [cs.CR]. 2024. Available: http://arxiv.org/abs/2404.01318.
LLM01:2025. Prompt Injection. In: OWASP Gen AI Security Project [Internet]. OWASP Top 10 for LLM & Generative AI Security; 10 Apr 2024 [cited 10 Oct 2025]. Available: https://genai.owasp.org/llmrisk/llm01-prompt-injection/.
Zhuo, T. Y., Huang, Y., Chen, C. & Xing, Z. Red teaming ChatGPT via jailbreaking: Bias, Robustness, Reliability and toxicity. arXiv [cs.CL]. 2023. Available: http://arxiv.org/abs/2301.12867.
Gehman, S. & RealToxicityPrompts Evaluating Neural Toxic Degeneration in Language Models. Findings of the Association for Computational Linguistics: EMNLP 2020. (2020).
Li, N. et al. LLM defenses are not robust to multi-Turn Human Jailbreaks yet. arXiv [cs.LG]. 2024. Available: http://arxiv.org/abs/2408.15221.
Du, X. et al. Multi-turn jailbreaking large Language Models via attention shifting. Proc. Conf. AAAI Artif. Intell. 39, 23814–23822. https://doi.org/10.1609/aaai.v39i22.34553 (2025).
Google Scholar
Chang, C. T. et al. Red teaming ChatGPT in medicine to yield real-world insights on model behavior. NPJ Digit. Med. 8, 149. https://doi.org/10.1038/s41746-025-01542-0 (2025).
Google Scholar
Kim, Y. et al. Medical hallucination in foundation models and their impact on healthcare. medRxiv https://doi.org/10.1101/2025.02.28.25323115 (2025).
Google Scholar
Chen, S. et al. CARES: Comprehensive evaluation of safety and adversarial robustness in medical LLMs. arXiv [cs.CL]. 2025. Available: http://arxiv.org/abs/2505.11413.
Niu, C. et al. RAGTruth: A hallucination corpus for developing trustworthy retrieval-augmented language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Stroudsburg, PA, USA: Association for Computational Linguistics; pp. 10862–10878. (2024). https://doi.org/10.18653/v1/2024.acl-long.585.
Es, S., James, J., Espinosa Anke, L., Schockaert, S. & RAGAs Automated evaluation of retrieval augmented generation. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations. Stroudsburg, PA, USA: Association for Computational Linguistics; pp. 150–158. (2024). https://doi.org/10.18653/v1/2024.eacl-demo.16.
Sezgin, E. et al. Digital health technologies for screening and identifying unmet social needs: Scoping review. J. Med. Internet Res. 27, e78793. https://doi.org/10.2196/78793 (2025).
Google Scholar
Fichtenberg, C. M., Alley, D. E. & Mistry, K. B. Improving Social Needs intervention research: Key questions for advancing the field. Am. J. Prev. Med. 57, S47–S54. https://doi.org/10.1016/j.amepre.2019.07.018 (2019).
Google Scholar
[cited 10 Oct 2025]. Available: https://assets.anthropic.com/m/61e7d27f8c8f5919/original/Claude-3-Model-Card.pdf.
Lin, S., Hilton, J., Evans, O. & TruthfulQA Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Stroudsburg, PA, USA: Association for Computational Linguistics; (2022). https://doi.org/10.18653/v1/2022.acl-long.229.
Hartvigsen, T. et al. ToxiGen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection. arXiv [cs.CL]. (2022). Available: http://arxiv.org/abs/2203.09509.
Zhu, K. et al. PromptRobust: Towards evaluating the robustness of Large Language Models on adversarial prompts. arXiv [cs.CL]. 2023. Available: http://arxiv.org/abs/2306.04528.
Artstein, R. & Poesio, M. Inter-coder agreement for computational linguistics. Comput. Linguist Assoc. Comput. Linguist. 34, 555–596. https://doi.org/10.1162/coli.07-034-r2 (2008).
Google Scholar
Jeune, P. L., Malézieux, B., Xiao, W., Dora, M. & Phare A Safety Probe for Large Language Models. arXiv [cs.CY]. 2025. Available: http://arxiv.org/abs/2505.11365.
Yao, Z. et al. Are reasoning models more prone to hallucination? arXiv [cs.CL]. (2025). Available: http://arxiv.org/abs/2505.23646.
Liu, N. F. et al. Lost in the middle: How language models use long contexts. Trans. Assoc. Comput. Linguist. 12, 157–173. https://doi.org/10.1162/tacl_a_00638 (2024).
Google Scholar
Lawrence, H. R. et al. The opportunities and risks of large language models in mental health. JMIR Ment Health. 11, e59479. https://doi.org/10.2196/59479 (2024).
Google Scholar

Download references

Funding

This publication was supported, in part, by The Ohio State University Clinical and Translational Science Institute (CTSI) and the National Center for Advancing Translational Sciences of the National Institutes of Health under Grant Number UM1TR004548. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

Author information

Authors and Affiliations

Abigail Wexner Research Institute at Nationwide Children’s Hospital, 700 Children’s Dr, Columbus, OH, 43205, USA
Syed-Amad Hussain, Daniel I. Jackson, Eric Fosler-Lussier & Emre Sezgin
Department of Computer Science and Engineering, The Ohio State University, Columbus, OH, USA
Syed-Amad Hussain, Daniel I. Jackson & Eric Fosler-Lussier
The Ohio State University College of Medicine, Columbus, OH, USA
Emre Sezgin
Department of Linguistics, The Ohio State University, Columbus, OH, USA
Ashley Lewis

Authors

Syed-Amad Hussain
View author publications
Search author on:PubMed Google Scholar
Daniel I. Jackson
View author publications
Search author on:PubMed Google Scholar
Ashley Lewis
View author publications
Search author on:PubMed Google Scholar
Eric Fosler-Lussier
View author publications
Search author on:PubMed Google Scholar
Emre Sezgin
View author publications
Search author on:PubMed Google Scholar

Contributions

S.A.H. wrote the main manuscript, created the primary figures and tables, performed evaluation, and created the study design; D. I. J. refined language throughout the manuscript for better socially aware terminology and clarity. He also helped with our study design; A. L. Is the LLM red teaming expert and provided us revisions throughout the paper, as well as guidance on how to perform and frame this study. She especially helped to frame the results; E.F.L. validated the study design and worked significantly on improving the language in the introductions, discussions, and conclusion; E.S. Supported this work at each step, with a role in study design, validation, and reviewing/writing components of each section.

Corresponding author

Correspondence to Syed-Amad Hussain.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary Material 1 (download DOCX )

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Hussain, SA., Jackson, D.I., Lewis, A. et al. Toward trustworthy chatbots: a protocol for red teaming for health related conversations. Sci Rep (2026). https://doi.org/10.1038/s41598-026-45719-3

Download citation

Received: 15 December 2025
Accepted: 20 March 2026
Published: 31 March 2026
DOI: https://doi.org/10.1038/s41598-026-45719-3