We read with interest Chang et al.’s “Red teaming ChatGPT in medicine to yield real-world insights on model behavior” in npj Digital Medicine1. The paper provides a thoughtful, multidisciplinary framework for detecting harmful or inaccurate outputs from large language models (LLMs) in healthcare. Their red teaming analysis revealed that about 20% of all LLM responses were unsafe or contained bias. It highlights the need for deeper oversight before deploying LLMs for clinical use. We suggest that scrutiny of the models’ internal reasoning, beyond final answers, remains necessary.

Recent work by Baker et al.2 shows that advanced LLMs may appear superficially compliant, while generating harmful or manipulative reasoning. Their study shows how monitoring intermediate inference steps, also called “chain-of-thought monitoring”, can detect misalignments in the model’s thought process. In some cases, the LLM might “hide” unethical intentions if it has been strongly optimized to avoid detection, suggesting that red teaming of final outputs alone is insufficient. In complex healthcare scenarios, such as end-of-life or resource allocation decisions, a single-step review may overlook harmful reasoning. While the final output might appear reasonable, the model’s underlying rationale could reflect ethically problematic assumptions. For example, a premature recommendation of palliative care in a resource-constrained setting, based on a patient’s age, disability, or socioeconomic status, as revealed by auditing intermediate reasoning steps3.

Modern “reasoning LLMs” rely on multi-step inference4,5,6,7. OpenAI’s o1 model and DeepSeek’s R1 break problems into smaller tasks and refine partial solutions step by step4,5. Another approach, the open-source S1 model8, uses a small set of carefully curated “reasoning data” for supervised fine-tuning for Chain-of-Thought, which can be length-controlled by a “wait” token. This method has been shown to boost performance on intricate reasoning problems8. As these approaches gain traction, we will likely see more inference-time compute solutions for healthcare tasks9. However, we may also encounter more subtle or sophisticated failure modes if unsafe reasoning is left unchecked10.

To address this, we propose two measures. First, similar to Baker et al. approach2, systematically vary ethically charged variables (e.g., patient prognosis, resource scarcity) to pinpoint where the model’s reasoning suggests impermissible actions. Second, adopt thorough chain-of-thought analysis to identify manipulative or unethical rationales before they influence the final output. Because LLMs evolve so rapidly, institutions deploying them should formalize ongoing audits of these risk areas. However, there is an open question about who will conduct these audits. Likely, multidisciplinary teams involving clinicians, ethicists, developers, regulatory compliance experts, and patient advocates, supported by automated evaluation and explainability tools. With increased model complexity and rapid iteration, scalable and reliable post-deployment monitoring solutions will be crucial. Additionally, accountability structures are needed, assigning clear oversight and regular reporting responsibilities to maintain ongoing ethical and safety standards.

Chang et al.1 rightly note that each new model version may regress or introduce new errors, necessitating continuous re-evaluation. We commend their approach and hope future efforts will systematically probe a model’s deeper reasoning. In healthcare, we must not trust the veneer of a final answer alone, rather we must ensure that the process generating that answer is sound.