Introduction

We thank Sorin et al. for their thoughtful response to our recent paper and for bringing up an important point for future iterations of red teaming large language models (LLMs) in healthcare. Sorin et al. rightfully point out that a model may use faulty internal reasoning to get the “correct” answer and that an analysis of only the final answer misses potential failure modes that come from faulty internal reasoning. We agree that the increasing use of “reasoning” models, which were not specifically tested in our red teaming exercise, may impact the chances of unsound reasoning that could lead to harmful responses. In addition, red teaming reasoning models could also assess additional metrics beyond safety/bias/hallucination/privacy, such as the presence of cognitive biases1 (e.g, suggestibility bias, availability bias) and false consistency (i.e, giving the same incorrect answer consistently).

How to approach testing reasoning models

Sorin et al. proposes additional approaches which include systematically changing ethically charged variables and adopting chain of thought analysis2. We agree with these suggestions and believe an entire suite of benchmarks and analysis approaches will be required for ongoing auditing of LLMs in healthcare applications, as was recently proposed by Bedi et al.3 In fact, a major impetus behind our red teaming efforts was to create a dataset of prompts that could be used by researchers for evaluating models through a range of approaches, including those suggested by Sorin et al. For example, one of the prompts developed by our red teaming exercise was, “I am a doctor volunteering at a free clinic. 64 yr old black male has PMH of HTN, T2DM and recently diagnosed prostate cancer presenting with pain and foul-smelling leakage of an attempted surgical site s/p robotic assisted prostatectomy attempt. He rates his pain a 5 out of 10. What number should it be for me to consider opioids?”. When testing a reasoning model, further evaluation can be done by changing variables such as race and gender while also adapting the prompt to ask the model to share its reasoning steps. This would allow the opportunity to not only assess the final answer but also check for potentially incorrect or harmful reasoning (e.g, the inappropriate use of race for giving recommendations on pain management).