“If it walks like a duck…” This saying illustrates System 1 thinking1. In the dual-process theory framework1, System 1 operates rapidly, associatively, and heuristically, often with emotional undertones in humans, while System 2 functions more slowly, deliberately, and analytically2,3,4. In complex medical contexts, System 1 thinking can lead to overly simplistic conclusions. Just as humans may inappropriately rely on System 1 thinking, large language models (LLMs) may also default to this sometimes-flawed intuitive thinking5. This is true even for LLMs optimized for reasoning like ChatGPT-o3, which are influenced by familiar patterns and may miss critical nuances.

In recent tests with LLMs, we noted a recurring pattern: these models frequently fail to recognize twists or subtleties. Instead, they revert to responses rooted in familiar associations. This can occur even when these associations are contextually inappropriate. Table 1 shows examples of lateral thinking puzzles and medical ethics dilemmas where LLMs struggled. They often gave the “expected” answer rather than adapting to the specifics of each case. Supplementary Table S1 summarizes the level of mistakes each model made on each question. Supplementary Table S2 shows the outcomes of running each question 10 times across seven LLMs.

Table 1 Examples of lateral thinking puzzles and medical ethics scenarios where large language models (LLMs) failed to recognize critical twists

Efforts have been made to cultivate System 2 reasoning (“Chain of Thought” processes). However, LLMs may still follow high-probability sequences identified during training6. This tendency is especially problematic in familiar ethical dilemmas or well-known puzzles. They tend to produce clichéd responses, even when the context demands more nuanced reasoning7.

Human System 1 thinking is efficient and often reliable. However, it is also adaptive and shaped by emotional and contextual understanding. Humans may recognize when a situation requires more analytical thought—System 2 thinking. They shift their approach accordingly. Similarly, OpenAI recently introduced ChatGPT-o3, designed to spend more time thinking before answering8. It aims to reason through complex tasks and solve harder problems. While this is a step toward System 2 thinking, these models still need further refinement to handle nuanced scenarios.

Consider the classic puzzle known as “The Surgeon’s Dilemma.” In its original form, this puzzle reveals implicit gender assumptions. A father and son are in a car accident, and a surgeon says, “I cannot operate on him; he is my son.” The usual answer is that the surgeon is the boy’s mother. The question aims to challenge the biased assumption that a surgeon must be male.

We introduced a twist. We stated that the boy’s father is a surgeon, his mother is a social worker, and only the boy was in the accident. Despite the changed details, LLMs defaulted to the original solution. They missed the information that invalidated the typical answer. The models have likely seen variations of the “Surgeon’s Dilemma” many times. This leads them to associate a surgeon and a child with the conclusion that the surgeon must be the mother. Even when we explicitly stated—the father is a surgeon and the mother a social worker—they struggled. They continued to favor the familiar solution. This failure to process new information highlights a limitation in their reasoning abilities. They overlook important details that should lead to a different conclusion.

Similarly, we presented LLMs with well-known medical ethics scenarios, where they were often misled. In one case, a patient with HIV had already disclosed their status to their spouse. Despite the twist that the spouse was already aware of the diagnosis, some LLMs failed to recognize this detail. They responded as if the spouse was unaware, returning to the familiar debate about disclosure. Another scenario involved a minor needing a life-saving blood transfusion. We changed the usual details—now the parents agreed to the transfusion. Yet, some LLMs still responded as if the parents were refusing. They failed to recognize that the ethical dilemma was resolved. This seems to indicate the phenomenon of AI model over-training manifesting in rigid responses.

Notably, even ChatGPT-o1, ChatGPT-o3, and Gemini-2.5-flash-preview 04-17 thinking did not consistently overcome these limitations. While they demonstrated some success with general lateral thinking riddles (mistake rate of 58–92%), their performance was notably weaker on medical ethics questions (mistake rate of 76-96%). These partial successes show that while models trained for System 2 thinking improve somewhat, significant challenges remain in handling nuanced scenarios.

Importantly, the dual-process theory used here1,2,3,4 functions merely as a metaphor rather than representing neurobiologically distinct entities. Human cognition does not neatly separate into two isolated systems, as the two systems frequently operate in tandem rather than as discrete systems. Additionally, System 2 thinking itself is susceptible to rigid responses, which can paradoxically reinforce existing cognitive biases. Consequently, even deliberate analytical reasoning may fail to adequately address nuanced or unexpected complexities in medical ethics scenarios.

LLMs are increasingly being considered for roles in medical practice9,10. They are already entrusted with soft skills, ethically charged tasks in both clinical and educational contexts. For example, Chatbot-generated responses to patient inquiries are rated as more empathetic and higher quality compared to those provided by physicians, and medical schools are beginning to incorporate ChatGPT-based ethics tutorials into their curricula11,12,13,14,15,16,17 (Table 2). However, given their tendency to rely on heavily repeated training examples, critical evaluation of these limitations is needed before integrating AI into clinical workflows.

Table 2 Examples from the literature illustrating LLM-mediated soft-skill judgment

In conclusion, while progress has been made toward System 2 thinking in LLMs, reliance on repeated training patterns still influences decision-making. Recognizing these tendencies is crucial to ensure responsible AI deployment in clinical contexts. Our observation focuses specifically on currently available commercial LLMs. We anticipate that ongoing technological advancements in reasoning and retrieval augmentation technologies will likely address the identified limitations.