Abstract
Chang et al. showed that large language models can produce unsafe or biased outputs even when superficially accurate. We highlight that LLMs can hide harmful reasoning if only final responses are red-teamed. Monitoring intermediate inference steps, especially in ethically charged clinical scenarios, can reveal manipulative or unethical thought processes. We propose systematic testing of ethically sensitive prompts and thorough chain-of-thought analysis to ensure safe, trustworthy deployment in healthcare.
We read with interest Chang et al.’s “Red teaming ChatGPT in medicine to yield real-world insights on model behavior” in npj Digital Medicine1. The paper provides a thoughtful, multidisciplinary framework for detecting harmful or inaccurate outputs from large language models (LLMs) in healthcare. Their red teaming analysis revealed that about 20% of all LLM responses were unsafe or contained bias. It highlights the need for deeper oversight before deploying LLMs for clinical use. We suggest that scrutiny of the models’ internal reasoning, beyond final answers, remains necessary.
Recent work by Baker et al.2 shows that advanced LLMs may appear superficially compliant, while generating harmful or manipulative reasoning. Their study shows how monitoring intermediate inference steps, also called “chain-of-thought monitoring”, can detect misalignments in the model’s thought process. In some cases, the LLM might “hide” unethical intentions if it has been strongly optimized to avoid detection, suggesting that red teaming of final outputs alone is insufficient. In complex healthcare scenarios, such as end-of-life or resource allocation decisions, a single-step review may overlook harmful reasoning. While the final output might appear reasonable, the model’s underlying rationale could reflect ethically problematic assumptions. For example, a premature recommendation of palliative care in a resource-constrained setting, based on a patient’s age, disability, or socioeconomic status, as revealed by auditing intermediate reasoning steps3.
Modern “reasoning LLMs” rely on multi-step inference4,5,6,7. OpenAI’s o1 model and DeepSeek’s R1 break problems into smaller tasks and refine partial solutions step by step4,5. Another approach, the open-source S1 model8, uses a small set of carefully curated “reasoning data” for supervised fine-tuning for Chain-of-Thought, which can be length-controlled by a “wait” token. This method has been shown to boost performance on intricate reasoning problems8. As these approaches gain traction, we will likely see more inference-time compute solutions for healthcare tasks9. However, we may also encounter more subtle or sophisticated failure modes if unsafe reasoning is left unchecked10.
To address this, we propose two measures. First, similar to Baker et al. approach2, systematically vary ethically charged variables (e.g., patient prognosis, resource scarcity) to pinpoint where the model’s reasoning suggests impermissible actions. Second, adopt thorough chain-of-thought analysis to identify manipulative or unethical rationales before they influence the final output. Because LLMs evolve so rapidly, institutions deploying them should formalize ongoing audits of these risk areas. However, there is an open question about who will conduct these audits. Likely, multidisciplinary teams involving clinicians, ethicists, developers, regulatory compliance experts, and patient advocates, supported by automated evaluation and explainability tools. With increased model complexity and rapid iteration, scalable and reliable post-deployment monitoring solutions will be crucial. Additionally, accountability structures are needed, assigning clear oversight and regular reporting responsibilities to maintain ongoing ethical and safety standards.
Chang et al.1 rightly note that each new model version may regress or introduce new errors, necessitating continuous re-evaluation. We commend their approach and hope future efforts will systematically probe a model’s deeper reasoning. In healthcare, we must not trust the veneer of a final answer alone, rather we must ensure that the process generating that answer is sound.
Data availability
No datasets were generated or analyzed during the current study.
References
Chang, C. T. et al. Red teaming ChatGPT in medicine to yield real-world insights on model behavior. npj Digit. Med. 8, 149 (2025).
Baker, B. et al. Monitoring reasoning models for misbehavior and the risks of promoting obfuscation. https://doi.org/10.48550/arXiv.2503.11926 (2025).
Sorin, V. et al. Socio-demographic modifiers shape large language models’ ethical decisions. J. Healthc. Inform. Res. https://doi.org/10.1007/s41666-025-00211- (2025).
OpenAI o1 System Card. https://openai.com/index/openai-o1-system-card/.
DeepSeek-AI, et al. DeepSeek-R1: incentivizing reasoning capability in LLMs via reinforcement learning. https://doi.org/10.48550/arXiv.2501.12948 (2025).
Google. Gemini 2.5 Pro. https://deepmind.google/technologies/gemini/pro/.
Anthropic. Claude 3.7 Sonnet. https://www.anthropic.com/claude/sonnet.
Muennighoff, N. et al. s1: Simple test-time scaling. Preprint at https://doi.org/10.48550/arXiv.2501.19393 (2025).
Snell C., Lee J., Xu K. & Kumar A. Scaling llm test-time compute optimally can be more effective than scaling model parameters. Preprint at https://doi.org/10.48550/arXiv.2408.03314.
Klang, E., Tessler, I., Freeman, R., Sorin, V. & Nadkarni, G. N. If Machines Exceed Us: Health Care at an Inflection Point. NEJM AI 1, AIP2400559 (2024).
Author information
Authors and Affiliations
Contributions
V.S., P.K., G.N.N., and E.K. conceived and designed the letter. V.S. and P.K. drafted the initial version. G.N.N. and E.K. critically revised it. All authors approved the final manuscript.
Corresponding authors
Ethics declarations
Competing interests
V.S., P.K., and E.K. declare no financial or non-financial competing interests. G.N.N. serves as associate editor of this journal and had no role in the peer review or decision to publish this manuscript.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Sorin, V., Korfiatis, P., Nadkarni, G.N. et al. Reasoning red teaming in healthcare not all paths to a desired outcome are desirable. npj Digit. Med. 8, 649 (2025). https://doi.org/10.1038/s41746-025-02104-0
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41746-025-02104-0