Large language models (LLMs), such as ChatGPT-o1, display subtle blind spots in complex reasoning tasks. We illustrate these pitfalls with lateral thinking puzzles and medical ethics scenarios. Our observations indicate that patterns in training data may contribute to cognitive biases, limiting the models’ ability to navigate nuanced ethical situations. Recognizing these tendencies is crucial for responsible AI deployment in clinical contexts.
“If it walks like a duck…” This saying illustrates System 1 thinking1. In the dual-process theory framework1, System 1 operates rapidly, associatively, and heuristically, often with emotional undertones in humans, while System 2 functions more slowly, deliberately, and analytically2,3,4. In complex medical contexts, System 1 thinking can lead to overly simplistic conclusions. Just as humans may inappropriately rely on System 1 thinking, large language models (LLMs) may also default to this sometimes-flawed intuitive thinking5. This is true even for LLMs optimized for reasoning like ChatGPT-o3, which are influenced by familiar patterns and may miss critical nuances.
In recent tests with LLMs, we noted a recurring pattern: these models frequently fail to recognize twists or subtleties. Instead, they revert to responses rooted in familiar associations. This can occur even when these associations are contextually inappropriate. Table 1 shows examples of lateral thinking puzzles and medical ethics dilemmas where LLMs struggled. They often gave the “expected” answer rather than adapting to the specifics of each case. Supplementary Table S1 summarizes the level of mistakes each model made on each question. Supplementary Table S2 shows the outcomes of running each question 10 times across seven LLMs.
Efforts have been made to cultivate System 2 reasoning (“Chain of Thought” processes). However, LLMs may still follow high-probability sequences identified during training6. This tendency is especially problematic in familiar ethical dilemmas or well-known puzzles. They tend to produce clichéd responses, even when the context demands more nuanced reasoning7.
Human System 1 thinking is efficient and often reliable. However, it is also adaptive and shaped by emotional and contextual understanding. Humans may recognize when a situation requires more analytical thought—System 2 thinking. They shift their approach accordingly. Similarly, OpenAI recently introduced ChatGPT-o3, designed to spend more time thinking before answering8. It aims to reason through complex tasks and solve harder problems. While this is a step toward System 2 thinking, these models still need further refinement to handle nuanced scenarios.
Consider the classic puzzle known as “The Surgeon’s Dilemma.” In its original form, this puzzle reveals implicit gender assumptions. A father and son are in a car accident, and a surgeon says, “I cannot operate on him; he is my son.” The usual answer is that the surgeon is the boy’s mother. The question aims to challenge the biased assumption that a surgeon must be male.
We introduced a twist. We stated that the boy’s father is a surgeon, his mother is a social worker, and only the boy was in the accident. Despite the changed details, LLMs defaulted to the original solution. They missed the information that invalidated the typical answer. The models have likely seen variations of the “Surgeon’s Dilemma” many times. This leads them to associate a surgeon and a child with the conclusion that the surgeon must be the mother. Even when we explicitly stated—the father is a surgeon and the mother a social worker—they struggled. They continued to favor the familiar solution. This failure to process new information highlights a limitation in their reasoning abilities. They overlook important details that should lead to a different conclusion.
Similarly, we presented LLMs with well-known medical ethics scenarios, where they were often misled. In one case, a patient with HIV had already disclosed their status to their spouse. Despite the twist that the spouse was already aware of the diagnosis, some LLMs failed to recognize this detail. They responded as if the spouse was unaware, returning to the familiar debate about disclosure. Another scenario involved a minor needing a life-saving blood transfusion. We changed the usual details—now the parents agreed to the transfusion. Yet, some LLMs still responded as if the parents were refusing. They failed to recognize that the ethical dilemma was resolved. This seems to indicate the phenomenon of AI model over-training manifesting in rigid responses.
Notably, even ChatGPT-o1, ChatGPT-o3, and Gemini-2.5-flash-preview 04-17 thinking did not consistently overcome these limitations. While they demonstrated some success with general lateral thinking riddles (mistake rate of 58–92%), their performance was notably weaker on medical ethics questions (mistake rate of 76-96%). These partial successes show that while models trained for System 2 thinking improve somewhat, significant challenges remain in handling nuanced scenarios.
Importantly, the dual-process theory used here1,2,3,4 functions merely as a metaphor rather than representing neurobiologically distinct entities. Human cognition does not neatly separate into two isolated systems, as the two systems frequently operate in tandem rather than as discrete systems. Additionally, System 2 thinking itself is susceptible to rigid responses, which can paradoxically reinforce existing cognitive biases. Consequently, even deliberate analytical reasoning may fail to adequately address nuanced or unexpected complexities in medical ethics scenarios.
LLMs are increasingly being considered for roles in medical practice9,10. They are already entrusted with soft skills, ethically charged tasks in both clinical and educational contexts. For example, Chatbot-generated responses to patient inquiries are rated as more empathetic and higher quality compared to those provided by physicians, and medical schools are beginning to incorporate ChatGPT-based ethics tutorials into their curricula11,12,13,14,15,16,17 (Table 2). However, given their tendency to rely on heavily repeated training examples, critical evaluation of these limitations is needed before integrating AI into clinical workflows.
In conclusion, while progress has been made toward System 2 thinking in LLMs, reliance on repeated training patterns still influences decision-making. Recognizing these tendencies is crucial to ensure responsible AI deployment in clinical contexts. Our observation focuses specifically on currently available commercial LLMs. We anticipate that ongoing technological advancements in reasoning and retrieval augmentation technologies will likely address the identified limitations.
Data availability
No datasets were generated or analyzed during the current study.
References
Wason, P. C. & Evans, St J. B. T. Dual processes in reasoning?. Cognition 3, 141–154 (1974).
Kahneman, D. Thinking, Fast and Slow (Farrar, Straus and Giroux, 2011).
Evans, St J. B. T. & Stanovich, K. E. Dual-process theories of higher cognition: advancing the debate. Perspect. Psychol. Sci. 8, 223–241 (2013).
Keren, G. A tale of two systems: a scientific advance or a theoretical stone soup? Commentary on Evans & Stanovich (2013). Perspect. Psychol. Sci. 8, 287–292 (2013).
Hagendorff, T., Fabi, S. & Kosinski, M. Human-like intuitive behavior and reasoning biases emerged in large language models but disappeared in ChatGPT. Nat. Comput. Sci. 3, 833–838 (2023).
Biderman, S. et al. Emergent and predictable memorization in large language models. Preprint at https://arxiv.org/abs/2304.11158 (2023).
McKenzie, I. R. et al. Inverse scaling: When bigger isn’t better. Preprint at https://arxiv.org/abs/2306.09479 (2023).
OpenAI. Introducing OpenAI o3. OpenAI. https://openai.com/index/introducing-o3-and-o4-mini/ (2025).
Glicksberg, B. S. et al. Evaluating the accuracy of a state-of-the-art large language model for prediction of admissions from the emergency room. Am. Med. Inform. Assoc. 31, 1921–1928 (2024).
Freyer, O. et al. A future role for health applications of large language models depends on regulators enforcing safety standards. Lancet Digit. Health 6, e662–e672 (2024).
Small, W. R. et al. Large language model–based responses to patients’ In-Basket messages. JAMA Netw. Open. 7, e2422399 (2024).
Ayers, J. W. et al. Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum. JAMA Intern. Med. 1838, 589–596 (2023).
Rahimzadeh, V., Kostick-Quenet, K., Blumenthal-Barby, J. & McGuire, A. L. Ethics education for healthcare professionals in the era of ChatGPT and other large language models: Do we still need it?. Am. J. Bioeth. 23, 17–27 (2023).
Okamoto, S., Kataoka, M., Itano, M. & Sawai, T. AI-based medical ethics education: examining the potential of large language models as a tool for virtue cultivation. BMC Med. Educ. 25, 185 (2025).
Dillion, D., Mondal, D., Tandon, N. & Gray, K. AI language model rivals expert ethicist in perceived moral expertise. Sci. Rep. 15, 4084 (2025).
Jiang, L. et al. Investigating machine moral judgement through the Delphi experiment. Nat. Mach. Intell. 7, 145–160 (2025).
Earp, B. D. et al. A personalized patient preference predictor for substituted judgments in healthcare: technically feasible and ethically desirable. Am. J. Bioeth. 24, 13–26 (2024).
Acknowledgements
This study did not receive any funding.
Author information
Authors and Affiliations
Contributions
S.S. and E.K.: contributed to study conceptualization, data collection, and drafting of the manuscript. V.S.: contributed to study conceptualization and assisted in manuscript revisions. G.N.N.: contributed to study conceptualization and critical revisions.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Soffer, S., Sorin, V., Nadkarni, G.N. et al. Pitfalls of large language models in medical ethics reasoning. npj Digit. Med. 8, 461 (2025). https://doi.org/10.1038/s41746-025-01792-y
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41746-025-01792-y