Abstract
Health-related chatbots require safety assurance beyond factual correctness. We propose a red-teaming protocol for patient-facing AI structured around three pillars: error stratification, dual-pronged testing, and vulnerability-informed mitigation. We distinguish Document Adherence (DA) from Instruction Adherence (IA), deploying adversarial “attacks” across both single-turn and multi-turn exchanges to provoke system failures. We then applied layered mitigations informed by the vulnerabilities revealed by these attacks. We evaluate this framework on a retrieval-augmented generation (RAG) based chatbot designed to assist with health-related social needs (HRSN).The protocol identified behavioral noncompliance as the dominant risk. While robust in DA (0/60 errors), the system struggled with IA (15% error rate). Crucially, multi-turn stress tests revealed vulnerabilities hidden in single-turn checks: error rates spiked to 50% for advice queries and 40% for user distress. All high-severity failures occurred during these sustained interactions. Of our mitigations, prompt augmentation reduced total errors by 60%, while document augmentation mitigated single-turn distress errors. Combined, they eliminated high-severity errors entirely by forcing “safe failure” loops. We suggest this cycle of stratified analysis, depth-based testing, and targeted mitigation can be a guiding framework for securing clinical conversational agents.
Data availability
Our testing dataset is available at https://github.com/NCH-IFRL/chatbot-redteaming.
References
Kurniawan, M. H., Handiyani, H., Nuraini, T., Hariyati, R. T. S. & Sutrisno, S. A systematic review of artificial intelligence-powered (AI-powered) chatbot intervention for managing chronic illness. Ann. Med. 56, 2302980. https://doi.org/10.1080/07853890.2024.2302980 (2024).
Barreda, M. et al. Transforming healthcare with chatbots: Uses and applications-A scoping review. Digit. Health. 11, 20552076251319174. https://doi.org/10.1177/20552076251319174 (2025).
Laymouna, M. et al. Roles, users, benefits, and limitations of chatbots in health care: Rapid review. J. Med. Internet Res. 26, e56930. https://doi.org/10.2196/56930 (2024).
Clark, M. & Bailey, S. Chatbots in health care: Connecting patients to information: Emerging health technologies. Ottawa (ON): Canadian Agency for Drugs and Technologies in Health; (2024). Available: https://www.ncbi.nlm.nih.gov/books/NBK602381/.
Heinz, M. V. et al. Randomized trial of a generative AI chatbot for mental health treatment. NEJM AI. 2 https://doi.org/10.1056/aioa2400802 (2025).
Moore, J. et al. Expressing stigma and inappropriate responses prevents LLMs from safely replacing mental health providers. Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency. New York, NY, USA: ACM; pp. 599–627. (2025). https://doi.org/10.1145/3715275.3732039.
Cabral, S. et al. Clinical reasoning of a generative artificial intelligence model compared with physicians. JAMA Intern. Med. 184, 581–583. https://doi.org/10.1001/jamainternmed.2024.0295 (2024).
Sezgin, E. et al. Chatbot for social need screening and resource sharing with vulnerable families: Iterative design and evaluation study. JMIR Hum. Factors. 11, e57114. https://doi.org/10.2196/57114 (2024).
Kocielnik, R. et al. HarborBot: A chatbot for social needs screening. AMIA Annu Symp Proc. ;2019: 552–561. (2019). Available: https://pmc.ncbi.nlm.nih.gov/articles/PMC7153089/.
Lewis, P. et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. arXiv [cs.CL]. (2020). Available: http://arxiv.org/abs/2005.11401.
Tonmoy, S. M. T. I. et al. A comprehensive survey of hallucination mitigation techniques in Large Language Models. arXiv [cs.CL]. 2024. Available: http://arxiv.org/abs/2401.01313.
Rawte, V., Sheth, A. & Das, A. A Survey of Hallucination in Large. Foundation Models (2023).
Ahmad, M., Yaramic, I. & Roy, T. D. Creating trustworthy LLMs: Dealing with hallucinations in healthcare AI. arXiv [cs.CL]. (2023). https://doi.org/10.20944/preprints202310.1662.v1.
Artificial Intelligence Risk Management Framework (AI RMF 1.0). NIST AI. pp. 100–101. (2023).
Assuring, A. I., Security, Safety Through, A. I. & Regulation MITRE Presidential Transition Priority Topic Memo. (2024).
AI Red Teaming. Applying Software TEVV for AI Evaluations. In: Cybersecurity and Infrastructure Security Agency CISA [Internet]. [cited 10 Oct 2025]. Available: https://www.cisa.gov/news-events/news/ai-red-teaming-applying-software-tevv-ai-evaluations.
Bullwinkel, B. et al. Lessons from red teaming 100 generative AI products. arXiv [cs.AI]. (2025). Available: http://arxiv.org/abs/2501.07238.
Verma, A. et al. Operationalizing a threat model for red-teaming large language models (LLMs). arXiv [cs.CL]. 2025. Available: http://arxiv.org/abs/2407.14937.
Chao, P. et al. JailbreakBench: An open robustness benchmark for jailbreaking large language models. arXiv [cs.CR]. 2024. Available: http://arxiv.org/abs/2404.01318.
LLM01:2025. Prompt Injection. In: OWASP Gen AI Security Project [Internet]. OWASP Top 10 for LLM & Generative AI Security; 10 Apr 2024 [cited 10 Oct 2025]. Available: https://genai.owasp.org/llmrisk/llm01-prompt-injection/.
Zhuo, T. Y., Huang, Y., Chen, C. & Xing, Z. Red teaming ChatGPT via jailbreaking: Bias, Robustness, Reliability and toxicity. arXiv [cs.CL]. 2023. Available: http://arxiv.org/abs/2301.12867.
Gehman, S. & RealToxicityPrompts Evaluating Neural Toxic Degeneration in Language Models. Findings of the Association for Computational Linguistics: EMNLP 2020. (2020).
Li, N. et al. LLM defenses are not robust to multi-Turn Human Jailbreaks yet. arXiv [cs.LG]. 2024. Available: http://arxiv.org/abs/2408.15221.
Du, X. et al. Multi-turn jailbreaking large Language Models via attention shifting. Proc. Conf. AAAI Artif. Intell. 39, 23814–23822. https://doi.org/10.1609/aaai.v39i22.34553 (2025).
Chang, C. T. et al. Red teaming ChatGPT in medicine to yield real-world insights on model behavior. NPJ Digit. Med. 8, 149. https://doi.org/10.1038/s41746-025-01542-0 (2025).
Kim, Y. et al. Medical hallucination in foundation models and their impact on healthcare. medRxiv https://doi.org/10.1101/2025.02.28.25323115 (2025).
Chen, S. et al. CARES: Comprehensive evaluation of safety and adversarial robustness in medical LLMs. arXiv [cs.CL]. 2025. Available: http://arxiv.org/abs/2505.11413.
Niu, C. et al. RAGTruth: A hallucination corpus for developing trustworthy retrieval-augmented language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Stroudsburg, PA, USA: Association for Computational Linguistics; pp. 10862–10878. (2024). https://doi.org/10.18653/v1/2024.acl-long.585.
Es, S., James, J., Espinosa Anke, L., Schockaert, S. & RAGAs Automated evaluation of retrieval augmented generation. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations. Stroudsburg, PA, USA: Association for Computational Linguistics; pp. 150–158. (2024). https://doi.org/10.18653/v1/2024.eacl-demo.16.
Sezgin, E. et al. Digital health technologies for screening and identifying unmet social needs: Scoping review. J. Med. Internet Res. 27, e78793. https://doi.org/10.2196/78793 (2025).
Fichtenberg, C. M., Alley, D. E. & Mistry, K. B. Improving Social Needs intervention research: Key questions for advancing the field. Am. J. Prev. Med. 57, S47–S54. https://doi.org/10.1016/j.amepre.2019.07.018 (2019).
[cited 10 Oct 2025]. Available: https://assets.anthropic.com/m/61e7d27f8c8f5919/original/Claude-3-Model-Card.pdf.
Lin, S., Hilton, J., Evans, O. & TruthfulQA Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Stroudsburg, PA, USA: Association for Computational Linguistics; (2022). https://doi.org/10.18653/v1/2022.acl-long.229.
Hartvigsen, T. et al. ToxiGen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection. arXiv [cs.CL]. (2022). Available: http://arxiv.org/abs/2203.09509.
Zhu, K. et al. PromptRobust: Towards evaluating the robustness of Large Language Models on adversarial prompts. arXiv [cs.CL]. 2023. Available: http://arxiv.org/abs/2306.04528.
Artstein, R. & Poesio, M. Inter-coder agreement for computational linguistics. Comput. Linguist Assoc. Comput. Linguist. 34, 555–596. https://doi.org/10.1162/coli.07-034-r2 (2008).
Jeune, P. L., Malézieux, B., Xiao, W., Dora, M. & Phare A Safety Probe for Large Language Models. arXiv [cs.CY]. 2025. Available: http://arxiv.org/abs/2505.11365.
Yao, Z. et al. Are reasoning models more prone to hallucination? arXiv [cs.CL]. (2025). Available: http://arxiv.org/abs/2505.23646.
Liu, N. F. et al. Lost in the middle: How language models use long contexts. Trans. Assoc. Comput. Linguist. 12, 157–173. https://doi.org/10.1162/tacl_a_00638 (2024).
Lawrence, H. R. et al. The opportunities and risks of large language models in mental health. JMIR Ment Health. 11, e59479. https://doi.org/10.2196/59479 (2024).
Funding
This publication was supported, in part, by The Ohio State University Clinical and Translational Science Institute (CTSI) and the National Center for Advancing Translational Sciences of the National Institutes of Health under Grant Number UM1TR004548. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
Author information
Authors and Affiliations
Contributions
S.A.H. wrote the main manuscript, created the primary figures and tables, performed evaluation, and created the study design; D. I. J. refined language throughout the manuscript for better socially aware terminology and clarity. He also helped with our study design; A. L. Is the LLM red teaming expert and provided us revisions throughout the paper, as well as guidance on how to perform and frame this study. She especially helped to frame the results; E.F.L. validated the study design and worked significantly on improving the language in the introductions, discussions, and conclusion; E.S. Supported this work at each step, with a role in study design, validation, and reviewing/writing components of each section.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Hussain, SA., Jackson, D.I., Lewis, A. et al. Toward trustworthy chatbots: a protocol for red teaming for health related conversations. Sci Rep (2026). https://doi.org/10.1038/s41598-026-45719-3
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-026-45719-3