Pitfalls of large language models in medical ethics reasoning

Soffer, Shelly; Sorin, Vera; Nadkarni, Girish N.; Klang, Eyal

doi:10.1038/s41746-025-01792-y

Download PDF

Comment
Open access
Published: 22 July 2025

Pitfalls of large language models in medical ethics reasoning

Shelly Soffer¹,
Vera Sorin²,
Girish N. Nadkarni^3,4 &
…
Eyal Klang^3,4

npj Digital Medicine volume 8, Article number: 461 (2025) Cite this article

5355 Accesses
2 Citations
101 Altmetric
Metrics details

Subjects

Large language models (LLMs), such as ChatGPT-o1, display subtle blind spots in complex reasoning tasks. We illustrate these pitfalls with lateral thinking puzzles and medical ethics scenarios. Our observations indicate that patterns in training data may contribute to cognitive biases, limiting the models’ ability to navigate nuanced ethical situations. Recognizing these tendencies is crucial for responsible AI deployment in clinical contexts.

“If it walks like a duck…” This saying illustrates System 1 thinking¹. In the dual-process theory framework¹, System 1 operates rapidly, associatively, and heuristically, often with emotional undertones in humans, while System 2 functions more slowly, deliberately, and analytically^2,3,4. In complex medical contexts, System 1 thinking can lead to overly simplistic conclusions. Just as humans may inappropriately rely on System 1 thinking, large language models (LLMs) may also default to this sometimes-flawed intuitive thinking⁵. This is true even for LLMs optimized for reasoning like ChatGPT-o3, which are influenced by familiar patterns and may miss critical nuances.

In recent tests with LLMs, we noted a recurring pattern: these models frequently fail to recognize twists or subtleties. Instead, they revert to responses rooted in familiar associations. This can occur even when these associations are contextually inappropriate. Table 1 shows examples of lateral thinking puzzles and medical ethics dilemmas where LLMs struggled. They often gave the “expected” answer rather than adapting to the specifics of each case. Supplementary Table S1 summarizes the level of mistakes each model made on each question. Supplementary Table S2 shows the outcomes of running each question 10 times across seven LLMs.

Table 1 Examples of lateral thinking puzzles and medical ethics scenarios where large language models (LLMs) failed to recognize critical twists

Full size table

Efforts have been made to cultivate System 2 reasoning (“Chain of Thought” processes). However, LLMs may still follow high-probability sequences identified during training⁶. This tendency is especially problematic in familiar ethical dilemmas or well-known puzzles. They tend to produce clichéd responses, even when the context demands more nuanced reasoning⁷.

Human System 1 thinking is efficient and often reliable. However, it is also adaptive and shaped by emotional and contextual understanding. Humans may recognize when a situation requires more analytical thought—System 2 thinking. They shift their approach accordingly. Similarly, OpenAI recently introduced ChatGPT-o3, designed to spend more time thinking before answering⁸. It aims to reason through complex tasks and solve harder problems. While this is a step toward System 2 thinking, these models still need further refinement to handle nuanced scenarios.

Consider the classic puzzle known as “The Surgeon’s Dilemma.” In its original form, this puzzle reveals implicit gender assumptions. A father and son are in a car accident, and a surgeon says, “I cannot operate on him; he is my son.” The usual answer is that the surgeon is the boy’s mother. The question aims to challenge the biased assumption that a surgeon must be male.

We introduced a twist. We stated that the boy’s father is a surgeon, his mother is a social worker, and only the boy was in the accident. Despite the changed details, LLMs defaulted to the original solution. They missed the information that invalidated the typical answer. The models have likely seen variations of the “Surgeon’s Dilemma” many times. This leads them to associate a surgeon and a child with the conclusion that the surgeon must be the mother. Even when we explicitly stated—the father is a surgeon and the mother a social worker—they struggled. They continued to favor the familiar solution. This failure to process new information highlights a limitation in their reasoning abilities. They overlook important details that should lead to a different conclusion.

Similarly, we presented LLMs with well-known medical ethics scenarios, where they were often misled. In one case, a patient with HIV had already disclosed their status to their spouse. Despite the twist that the spouse was already aware of the diagnosis, some LLMs failed to recognize this detail. They responded as if the spouse was unaware, returning to the familiar debate about disclosure. Another scenario involved a minor needing a life-saving blood transfusion. We changed the usual details—now the parents agreed to the transfusion. Yet, some LLMs still responded as if the parents were refusing. They failed to recognize that the ethical dilemma was resolved. This seems to indicate the phenomenon of AI model over-training manifesting in rigid responses.

Notably, even ChatGPT-o1, ChatGPT-o3, and Gemini-2.5-flash-preview 04-17 thinking did not consistently overcome these limitations. While they demonstrated some success with general lateral thinking riddles (mistake rate of 58–92%), their performance was notably weaker on medical ethics questions (mistake rate of 76-96%). These partial successes show that while models trained for System 2 thinking improve somewhat, significant challenges remain in handling nuanced scenarios.

Importantly, the dual-process theory used here^1,2,3,4 functions merely as a metaphor rather than representing neurobiologically distinct entities. Human cognition does not neatly separate into two isolated systems, as the two systems frequently operate in tandem rather than as discrete systems. Additionally, System 2 thinking itself is susceptible to rigid responses, which can paradoxically reinforce existing cognitive biases. Consequently, even deliberate analytical reasoning may fail to adequately address nuanced or unexpected complexities in medical ethics scenarios.

LLMs are increasingly being considered for roles in medical practice^9,10. They are already entrusted with soft skills, ethically charged tasks in both clinical and educational contexts. For example, Chatbot-generated responses to patient inquiries are rated as more empathetic and higher quality compared to those provided by physicians, and medical schools are beginning to incorporate ChatGPT-based ethics tutorials into their curricula^{11,12,13,14,15,16,17} (Table 2). However, given their tendency to rely on heavily repeated training examples, critical evaluation of these limitations is needed before integrating AI into clinical workflows.

Table 2 Examples from the literature illustrating LLM-mediated soft-skill judgment

Full size table

In conclusion, while progress has been made toward System 2 thinking in LLMs, reliance on repeated training patterns still influences decision-making. Recognizing these tendencies is crucial to ensure responsible AI deployment in clinical contexts. Our observation focuses specifically on currently available commercial LLMs. We anticipate that ongoing technological advancements in reasoning and retrieval augmentation technologies will likely address the identified limitations.

Data availability

No datasets were generated or analyzed during the current study.

References

Wason, P. C. & Evans, St J. B. T. Dual processes in reasoning?. Cognition 3, 141–154 (1974).
Article Google Scholar
Kahneman, D. Thinking, Fast and Slow (Farrar, Straus and Giroux, 2011).
Evans, St J. B. T. & Stanovich, K. E. Dual-process theories of higher cognition: advancing the debate. Perspect. Psychol. Sci. 8, 223–241 (2013).
Article PubMed Google Scholar
Keren, G. A tale of two systems: a scientific advance or a theoretical stone soup? Commentary on Evans & Stanovich (2013). Perspect. Psychol. Sci. 8, 287–292 (2013).
Article Google Scholar
Hagendorff, T., Fabi, S. & Kosinski, M. Human-like intuitive behavior and reasoning biases emerged in large language models but disappeared in ChatGPT. Nat. Comput. Sci. 3, 833–838 (2023).
Article PubMed PubMed Central Google Scholar
Biderman, S. et al. Emergent and predictable memorization in large language models. Preprint at https://arxiv.org/abs/2304.11158 (2023).
McKenzie, I. R. et al. Inverse scaling: When bigger isn’t better. Preprint at https://arxiv.org/abs/2306.09479 (2023).
OpenAI. Introducing OpenAI o3. OpenAI. https://openai.com/index/introducing-o3-and-o4-mini/ (2025).
Glicksberg, B. S. et al. Evaluating the accuracy of a state-of-the-art large language model for prediction of admissions from the emergency room. Am. Med. Inform. Assoc. 31, 1921–1928 (2024).
Article Google Scholar
Freyer, O. et al. A future role for health applications of large language models depends on regulators enforcing safety standards. Lancet Digit. Health 6, e662–e672 (2024).
Article CAS PubMed Google Scholar
Small, W. R. et al. Large language model–based responses to patients’ In-Basket messages. JAMA Netw. Open. 7, e2422399 (2024).
Article PubMed PubMed Central Google Scholar
Ayers, J. W. et al. Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum. JAMA Intern. Med. 1838, 589–596 (2023).
Article Google Scholar
Rahimzadeh, V., Kostick-Quenet, K., Blumenthal-Barby, J. & McGuire, A. L. Ethics education for healthcare professionals in the era of ChatGPT and other large language models: Do we still need it?. Am. J. Bioeth. 23, 17–27 (2023).
Article PubMed Google Scholar
Okamoto, S., Kataoka, M., Itano, M. & Sawai, T. AI-based medical ethics education: examining the potential of large language models as a tool for virtue cultivation. BMC Med. Educ. 25, 185 (2025).
Article PubMed PubMed Central Google Scholar
Dillion, D., Mondal, D., Tandon, N. & Gray, K. AI language model rivals expert ethicist in perceived moral expertise. Sci. Rep. 15, 4084 (2025).
Article CAS PubMed PubMed Central Google Scholar
Jiang, L. et al. Investigating machine moral judgement through the Delphi experiment. Nat. Mach. Intell. 7, 145–160 (2025).
Article Google Scholar
Earp, B. D. et al. A personalized patient preference predictor for substituted judgments in healthcare: technically feasible and ethically desirable. Am. J. Bioeth. 24, 13–26 (2024).
Article PubMed PubMed Central Google Scholar

Download references

Acknowledgements

This study did not receive any funding.

Author information

Authors and Affiliations

Institute of Hematology, Davidoff Cancer Center, Rabin Medical Center, Petah Tikva, Israel
Shelly Soffer
Department of Radiology, Mayo Clinic College of Medicine and Science, Mayo Clinic, Rochester, MN, USA
Vera Sorin
Windreich Department of Artificial Intelligence and Human Health, Mount Sinai Health System, New York, NY, USA
Girish N. Nadkarni & Eyal Klang
The Charles Bronfman Institute of Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, USA
Girish N. Nadkarni & Eyal Klang

Authors

Shelly Soffer
View author publications
Search author on:PubMed Google Scholar
Vera Sorin
View author publications
Search author on:PubMed Google Scholar
Girish N. Nadkarni
View author publications
Search author on:PubMed Google Scholar
Eyal Klang
View author publications
Search author on:PubMed Google Scholar

Contributions

S.S. and E.K.: contributed to study conceptualization, data collection, and drafting of the manuscript. V.S.: contributed to study conceptualization and assisted in manuscript revisions. G.N.N.: contributed to study conceptualization and critical revisions.

Corresponding author

Correspondence to Eyal Klang.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Soffer, S., Sorin, V., Nadkarni, G.N. et al. Pitfalls of large language models in medical ethics reasoning. npj Digit. Med. 8, 461 (2025). https://doi.org/10.1038/s41746-025-01792-y

Download citation

Received: 21 January 2025
Accepted: 11 June 2025
Published: 22 July 2025
Version of record: 22 July 2025
DOI: https://doi.org/10.1038/s41746-025-01792-y

Subjects

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Supplementary information

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links