Introduction

Since the launch of large language models (LLMs) and vision-language models (VLMs), they have been studied for their potential applications to medicine1. Following the first mainstream LLM, ChatGPT 3.5 in 2022, an artificial intelligence (AI) arms race has led to increasingly advanced versions of LLMs and multi-modal models2. As these models evolve, their ability and accuracy in assisting clinicians with diagnosis and medical administrative tasks have been of interest3. Since these models are public, patients are also turning to them for health-related information and second opinions to interpret their own medical questions, reports, and images4.

However, LLMs and VLMs were not designed or properly assessed for medical uses, and their outputs, though often authoritative in tone, can be inaccurate or misleading5. As models continue to evolve, becoming more sophisticated in their fluency and confidence, the opportunity for harm increases6. Users may misinterpret AI-generated content as expert guidance, potentially resulting in delayed treatment, inappropriate self-care, or misplaced trust in non-validated information7. One safeguard to mitigate this is the inclusion of medical disclaimers that clarify the model’s limitations and explicitly state that it is not qualified to offer medical advice.

While many assume that these models consistently provide disclaimers, emerging evidence suggests otherwise. Studies have shown that LLMs readily generate device-like clinical decision support across a wide range of scenarios, often without qualification8. Furthermore, prompt engineering and adversarial testing have demonstrated that it is possible to circumvent built-in safety mechanisms, a process commonly referred to as “jailbreaking”, resulting in inconsistent or incorrect outputs depending on the prompt, user persona, context, and even the model version9. This study aimed to systematically evaluate the presence and consistency of medical disclaimers in both LLM and VLM outputs in response to medical questions and medical images, over time and over multiple generations of models?

Results

In both medical questions and medical images, there was a notable decrease in the presence of medical disclaimers between 2022 and 2025 (Fig. 1). When comparing by model family, overall, medical disclaimer rates were highest in Google AI models (41.0% for medical questions, 49.1% for medical images), followed by OpenAI (7.7% for medical questions, 9.8% for medical images). Anthropic models averaged 3.1% for medical questions and 11.5% for medical images. xAI had low disclaimer rates (3.6% for medical questions, 8.6% for medical images), while DeepSeek models had a rate of zero across both domains.

Fig. 1: Longitudinal decline of medical disclaimers in LLM and VLM outputs.
figure 1

This multi-panel figure illustrates how the percentage of outputs containing medical disclaimers has dropped over time for major AI providers. a Overall distribution by year, with large language models shown in pink and visionlanguage models in green. b Disclaimer presence across OpenAI models (GPT). c Disclaimer presence across xAI models (Grok). d Disclaimer presence across Google AI (Gemini). e Disclaimer presence across Anthropic models (Claude). e Disclaimer presence across DeepSeek model series.

Medical questions

From 2022 to 2025, medical disclaimers in response to medical questions fell from 26.3% in 2022 to just 0.97% by 2025 in LLMs. There was a statistically significant decline in the inclusion of medical disclaimers, with a linear regression analysis revealing a strong inverse relationship between year and disclaimer rate (R2 = 0.944, p = 0.028), with an estimated annual reduction of 8.1 percentage points.

Across model families (OpenAI, xAI, Google Gemini, Anthropic, DeepSeek), there was a significant difference in medical disclaimer rates when categorized by clinical question type (χ2 = 266.03, p < 0.00001).

In 2022, the only model included was GPT 3.5 Turbo, which averaged a disclaimer rate of 26.3%. Medical disclaimers were found in 80.7% of mental health responses, 27.3% in symptom management and treatment responses, 13.7% for emergency responses, 9.6% of diagnostic test and laboratory result interpretations and only 0.3% of medication safety and drug interaction responses.

By 2023, the average disclaimer rate had fallen to 12.4%. While GPT-4 included disclaimers in 16.5% of cases, particularly in mental health (43.7%) and symptom management and treatment responses (24.3%), GPT-4 Turbo’s average was slightly lower at 14.7%, and Grok Beta only had 6% of outputs containing medical disclaimers. Across all 2023 models, disclaimer presence was inconsistent and absent in both the diagnostic test and laboratory result and medication safety and drug interaction categories.

In 2024, the average fell further to 7.5%. Google Gemini 1.5 Flash had a 57.2% disclaimer rate, including 93.3% in mental health, 60.3% in symptom and treatment, and 99% in diagnostic test and laboratory result categories. Claude 3 Opus averaged 7.3%, while Claude 3.5 Sonnet produced only 2.5%.

By 2025, only 0.97% of outputs had medical disclaimers. GPT 4.5 and Grok 3 included no disclaimers at all, while Gemini 2.0 Flash offered only 2.1%, only in the symptom management and treatment and mental health categories Claude 3.7 Sonnet demonstrated 1.8% disclaimer presence, only present in the symptom management and treatment category. Across all years and models, disclaimers were most common in symptom management and treatment (14.1%) and mental health (12.6%) categories. In comparison, lower rates of medical disclaimers were found in the emergency responses (4.8%), diagnostic test and laboratory result (5.2%), and medication safety and drug interaction categories (2.5%) (Fig. 2).

Fig. 2: Disclaimer frequency across question types and model families.
figure 2

This figure presents the percentage of medical disclaimers included across six question categories. a Average across all question types, shown in turquoise. b Symptoms management and treatment questions, shown in orange. c Acute emergency scenarios, shown in pink. d Medication safety and drug interactions, shown in purple. e Mental health and psychiatric conditions, shown in navy blue. f Diagnostic test results and lab findings, shown in light blue.

Medical images

In total, across mammograms, chest X-rays and dermatology images, the average disclaimer rate decreased from 19.6% in 2023 to 1.05% in 2025 in VLMs (Fig. 3).

Fig. 3: Disclaimer rates across image types and models.
figure 3

This multi-panel figure shows the percentage of medical disclaimers included when interpreting three types of medical images. a Yearly distribution of disclaimer inclusion by image type: mammograms in pink, dermatology in purple, and chest X-rays in orange. b Disclaimer presence across OpenAI models (GPT). c Disclaimer presence across xAI models (Grok). d Disclaimer presence across Google AI (Gemini). e Disclaimer presence across Anthropic models (Claude).

The chi-square test across model families (OpenAI, xAI, Google Gemini, Anthropic) yielded a (χ2 = 221.42, p < 0.00001). This indicates a significant difference in medical disclaimer rates across model families when evaluated on all medical images, with Google Gemini models producing markedly higher disclaimer rates compared to OpenAI, xAi, and Anthropic.

In 2023, OpenAI’s GPT-4 Turbo exhibited the highest disclaimer rates across all modalities, with 34% for mammograms, 26.3% for chest X-rays, and 11.8% for dermatology images. Notably, disclaimer presence in mammograms increased with higher BI-RADS scores, reaching 52% in BI-RADS 5 cases. In contrast, xAI’s Grok Beta showed much lower rates across all image types, with 22.2% for both mammograms and chest X-rays and 3.3% for dermatology images.

By 2024, OpenAI models showed a clear downward trajectory. For mammograms, GPT-4 Turbo’s medical disclaimer rate dropped to 24.1%, and later versions of GPT-4o fell dramatically, 11.7% in May, 1.7% in August, and 0% by November. A similar pattern was observed for chest X-rays and dermatology images, where GPT-4o and GPT-o1 models showed rates as low as 1–2% by late 2024. Gemini 1.5 Flash reached a medical disclaimer rate of 57.2% for mammograms, 54.1% for chest X-rays, and 33.8% for dermatology images, with Gemini 1.5 Pro performing similarly. Claude 3.5 Sonnet displayed moderate rates across all modalities (15–24%).

In 2025, the presence of medical disclaimers nearly diminished in most VLMs. GPT-4.5, Grok 3, both produced 0% disclaimers for both mammograms, chest X-rays and dermatology images. While Claude 3.7 Sonnet displayed medical disclaimers in 0% of mammograms, chest X-rays it displayed medical disclaimers in 3.1% of dermatology images. Google Gemini 2.0 Flash remained an exception, with elevated disclaimer rates of 26.9% for mammograms, 68.8% for chest X-rays, and 26.0% for dermatology images.

We examined the relationship between model diagnostic accuracy and the presence of medical disclaimers across all medical image types. When combining all modalities, a significant negative correlation was observed (r = −0.64, p = 0.010), indicating that as diagnostic accuracy increased, the inclusion of disclaimers declined. This trend was strongest in mammography, where the correlation was both more negative and statistically significant (r = −0.70, p = 0.004), suggesting a consistent inverse relationship between performance and safety disclaimers. In contrast, the correlation was weaker and not statistically significant for dermatology images (r = −0.47, p = 0.077) and chest X-rays (r = −0.48, p = 0.070), though both maintained a negative correlation.

High-risk images versus low-risk images

The overall percent of medical disclaimers in high-risk images was 18.8% compared to 16.2% in low-risk images. We conducted a non-parametric Wilcoxon signed-rank test comparing disclaimer rates across high-risk (BI-RADS 4 and BI-RADS 5 mammograms, chest X-rays with pneumonia and malignant dermatology images) and low-risk (BI-RADS 1 and BI-RADS 2 mammograms, normal chest X-rays and benign dermatology images) medical images for the same models. The Wilcoxon signed-rank test confirmed a statistically significant difference (W = 13.0, p = 0.023), indicating that models are significantly more likely to include medical disclaimers in high-risk clinical scenarios than in low-risk ones (Fig. 4).

Fig. 4: Disclaimers by risk level and image type.
figure 4

This figure compares how often disclaimers were included for high-risk versus low-risk images, as well as for specific image conditions. a Low-risk cases are shown in blue and high-risk cases in red. b For chest X-rays, normal cases are shown in blue and pneumonia cases in red. c For dermatology images, benign cases are shown in blue and malignant cases in red.

*Please see Supplementary File 2 for distribution of percent of medical disclaimers in mammograms stratified by BI-RADS.

Discussion

Across both medical question answering and medical image interpretation tasks, the presence of medical disclaimers declined significantly both over time and over models within the same year. Between 2022 and 2025, LLMs saw a statistically significant drop in disclaimer inclusion rates in response to medical questions from an average of 26.3% in 2022 to just 0.97% in 2025. Similarly, across mammograms, chest X-rays and dermatology images, the VLMs experienced a statistically significant decrease in the average medical disclaimer rate from 19.6% in 2023 to 1.05% in 2025 Fig. 5.

Fig. 5: Correlation between diagnostic accuracy and disclaimer presence.
figure 5

This bar graph reports Pearson correlation coefficients (r) between diagnostic accuracy and the proportion of responses with a disclaimer. Each bar represents a modality: all modalities in pink, mammograms in medium blue, dermatology in dark blue, and chest X-rays in light blue. Asterisks indicate statistically significant correlations (p < 0.05).

Notably, the DeepSeek and Google Gemini model families demonstrated starkly contrasting patterns in medical disclaimer behavior across both text and image modalities. DeepSeek models consistently exhibited a disclaimer rate of 0%, indicating a complete absence of safety messaging regardless of risk level, modality, or model version. In comparison, Google Gemini models consistently included medical disclaimers, maintaining the highest overall rates among all families. Although the frequency of disclaimers in Gemini models showed a modest decline across newer versions, they remained substantially higher than in any other model family, suggesting a more deliberate integration of safety messaging.

Our findings revealed a significant negative correlation between the diagnostic accuracy of medical image interpretations and the presence of medical disclaimers, indicating that as models demonstrate greater accuracy, they are less likely to include cautionary language. This trend presents a potential safety concern, as even highly accurate models are not a substitute for professional medical advice, and the absence of disclaimers may mislead users into overestimating the reliability or authority of AI-generated outputs. Stratified analyses revealed important differences in disclaimer distribution across clinical categories and image types. In medical imaging, there was a clear pattern of increased disclaimer use in higher-risk findings. For example, BI-RADS 5, representing cases with highly suspicious features, elicited more disclaimers compared to BI-RADS 1, which indicates a normal mammogram. This suggests that VLMs may have been responding to perceived clinical severity in this case.

However, LLMs showed a different domain-specific stratification in their responses to medical questions. Disclaimers were most frequently included in responses related to symptom management and treatment (14%) and mental health or psychiatric question categories (12%), while emergency (4%), diagnostic test and laboratory result interpretations (5.2%), and medication safety and drug interactions (2.5%) received fewer disclaimers. This pattern may reflect a bias in how models assess conversational risk or platform policies that prioritize content moderation in emotionally sensitive domains (e.g., mental health), while underestimating the liability associated with clinical accuracy, particularly for medications and diagnostics. In comparison, the LLMs answering the medical questions were generally less likely to include disclaimers than the VLMs analyzing the medical images, particularly during the earlier phases of model deployment.

One possible explanation is that image-based tasks were a more recently introduced capability and may have initially triggered more conservative outputs due to uncertainty in interpretation10. Additionally, medical imaging tasks may also be perceived by developers as being more diagnostically oriented, prompting a higher baseline of caution and safety messaging as seen in the earlier generation of VLMs.

In contrast, LLMs when asked medical questions, may prioritize conversational fluency and user engagement6,7,11,12. This may lead to the deprioritization or exclusion of explicit medical disclaimers unless programmed to be flagged by the developers13.

Overall, the observed decline in medical disclaimer frequency may reflect not only model design evolution but also shifting policy landscapes. In early generations, particularly 2022–2023, developers appeared more conservative, possibly in response to early scrutiny around health misinformation and liability. However, as models became more fluent and capable, and as regulatory frameworks remained vague, some companies may have deprioritized safety messaging to improve user experience or reduce redundancy. Notably, there has been no consistent, enforceable regulation requiring medical disclaimers in generative AI outputs across jurisdictions. However, the declining trends seen in the presence of medical disclaimers carry serious implications for patient safety and public trust. As AI tools encode more clinical knowledge and become more integrated into everyday health-seeking behavior, whether for understanding symptoms, interpreting diagnostic tests, or guiding treatment decisions, users may increasingly mistake fluent, authoritative outputs for clinician-approved advice14,15. This is particularly concerning in high-risk scenarios such as emergency medical situations, where misinformation or omission of important information can result in severe consequences16,17. Medical disclaimers should not only be included in every medically related output but should also be dynamic, adapting to the clinical seriousness of the question or image, the potential for harm, and the likelihood of user misinterpretation. As models continue to evolve, safety infrastructure must evolve alongside them.

The main limitation in our study is the opaque nature of the LLM and VLM architecture. Because the internal mechanisms governing safety features, including medical disclaimers, are not publicly available, it is difficult to determine the specific design changes that led to their decline over time. This limits our ability to attribute trends to particular model updates or safety protocols. In this study, all model outputs were collected using API-based, single-turn prompt submissions with standardized phrasing and default temperature parameters to ensure comparability across models; however, this may not fully capture how models behave in real-world conversations where disclaimers might appear contextually or conditionally. While this approach enhances comparability across models and time-points, it does not fully replicate real-world user interactions, which are often multi-turn, conversational, and influenced by user behavior. In web-based interface models, safety behaviors may exhibit different safety behaviors, including dynamic disclaimer insertion, based on perceived user vulnerability or conversation length. Prior studies suggest that some models adapt their tone and cautionary language based on prior conversation history, perceived emotional state, or the formality of language used by the user. Our findings may represent a conservative estimate of disclaimer frequency, especially if disclaimers emerge in later turns or in response to specific user signals not simulated in our design. Future research should systematically compare disclaimer behavior across access modalities (API vs. chat) and interaction depths (single-turn vs. multi-turn) to clarify this discrepancy. Additionally, our decision to exclude open-source models was chosen to simulate patient behavior, which may limit generalizability, particularly when evaluating safety features in decentralized or custom deployments.

Future studies should explore whether the observed loss of medical disclaimers is correlated with model uncertainty or overconfidence. This could involve analyzing model-generated confidence scores or hedging language to assess whether disclaimers are omitted more frequently when models are confident, even if inaccurate. Additionally, as LLMs increase their memory or context window, it is important to investigate whether memory of a user’s past inputs leads to reduced safety messaging over time. Additionally, understanding the role of user-specific memory in shaping disclaimer behavior could provide valuable insight into how specific user information (e.g., occupation or education level) or perceived expertise that a user may have could result in a loss of essential safety features. Future research should also examine systematic differences between API and web-based interfaces. Evaluating whether the same prompt produces different outputs depending on the access pathway could reveal important variations in developer or patient interfaces, helping developers and clinicians understand where vulnerabilities in model deployment may lie.

Our findings reveal a consistent and concerning decline in the presence of medical disclaimers across both LLMs and VLMs from 2022 to 2025. Specifically, in LLMs, medical disclaimers in response to medical questions fell from 26.3% in 2022 to just 0.97% by 2025. For medical image interpretation tasks, including mammograms, chest X-rays, and dermatology images, the average disclaimer rate decreased from 19.6% in 2023 to 1.05% in 2025 in VLMs. As LLMs and VLMs continue to integrate into health information ecosystems, maintaining robust, transparent, and dynamic medical disclaimer protocols will be essential to protecting patients, preserving public trust, and upholding ethical standards in healthcare. We recommend that medical disclaimers be implemented as a non-optional safety feature in all medically related model outputs. These disclaimers should not only signal that the model is not a licensed provider but also adapt to the clinical context, including the severity of the case and the type of user inquiry.

Methods

Datasets

To evaluate the presence and consistency of medical disclaimers in responses generated by LLMs and VLMs, we compiled a multi-modal, multi-domain medical dataset, using publicly available medical images 500 mammograms [100 each for BI-RADS categories 1 through 5] and 500 chest X-rays [250 normal, 250 pneumonia] and 500 diverse dermatology images encompassing a wide range of skin conditions [250 benign and 250 malignant]. As all images were sourced from publicly available datasets and institutional ethics approval was not required18,19,20.

Each dataset was accompanied by expert-validated ground-truth labels, which served as the reference standard for evaluating diagnostic accuracy. In this study, diagnostic accuracy was defined as the proportion of model outputs that correctly matched the ground-truth classification for each medical image. For mammograms, this included correctly identifying the BI-RADS category (1 through 5); for chest X-rays, distinguishing between normal and pneumonia findings; and for dermatology images, correctly classifying the lesion. Ground-truth labels were derived from publicly available, expert-annotated datasets curated by radiologists and dermatologists.

Top Internet Medical-Q Dataset (TIMed-Q)

In this study, we introduce a novel dataset called the Top Internet Medical-Q Dataset (TIMed-Q), built from real-world, internet-based medical questions posed by patients. The dataset comprises a standardized collection of 500 text-based medical queries, evenly divided into five clinically relevant domains: (1) symptom management and treatment, (2) acute emergency scenarios, (3) medication safety and drug interactions, (4) mental health and psychiatric conditions, and (5) diagnostic test and laboratory result interpretation, with 100 questions in each category.

Dataset construction

To assemble this resource, we began by compiling a comprehensive list of medical keywords and phrases drawn from the World Health Organization’s ICD-10 classification and clinical terminologies21.

Using these keywords, we leveraged the Google SERP API (and related autocomplete endpoints) to scrape high-frequency patient queries from publicly accessible online sources, including Google Trends, the WebMD symptom checker, and the Mayo Clinic symptom checker. We prioritized queries that reflected clinical relevance and patient-centered concerns across the five target domains. To enrich our coverage of medication-related issues, we also extracted commonly searched drug safety and interaction questions from reputable platforms such as Drugs.com and Medscape.

These questions were then reviewed by a physician with nine years of experience post-graduation. Between 2/27/25 and 4/12/25, each image and each question were submitted three times to every model to account for variability. Image prompts followed a standardized format simulating patient inquiries: “What does my mammogram mean?”, “What does my chest X-ray mean?”, and “What does this image of my skin mean?” Medical questions were input exactly as written (Supplementary File 1). All models were accessed via their respective APIs, with the default temperature parameter to minimize stylistic variation and to simulate the common mode accessed by patients.

Medical disclaimer: detection and data analysis

A medical disclaimer was operationally defined as any explicit statement indicating that the model is not a licensed medical professional and that its response should not be used as a substitute for professional medical advice. For each model response, we recorded whether a disclaimer was present and calculated the standard deviation across the three test runs per prompt. We assessed the presence of medical disclaimers by using regular expression (RegEx)-based matching and manual review, specifically searching for phrases that indicated that the model was not a medical professional, such as variations of “I am an AI” and “I am not qualified to give medical advice”. We did not count phrases containing “I suggest you consult your physician or a medical/healthcare provider” as a medical disclaimer, as that is not an explicit disclaimer around the model’s limitations or ability to provide medical advice. Stratified categorical analyses were conducted to evaluate differences in disclaimer inclusion rates by medical question category, by BI-RADS classification for mammograms, normal or pneumonia status for chest X-rays and benign or malignant status for dermatology images. Table 1 summarizes the different medical disclaimer formulations and their corresponding regular–expression patterns. Table 2 illustrates example LLM and VLM responses, both with and without disclaimers, to prompts about medical images and questions, including each input prompt and the model’s output.

Table 1 Overview of medical disclaimer phrase variations and their regular-expression patterns
Table 2 Example LLM and VLM responses to prompts with and without disclaimers

Models

The VLMs tested included OpenAI’s GPT-4 Turbo (2023), GPT-4o (May, August, and November 2024), GPT-o1 (December 2024), and GPT-4.5 (2025); Grok Beta (2023), Grok 2 (2024), and Grok 3 (2025) from X; Gemini 1.5 Flash (2024), Gemini 1.5 Pro (2024), and Gemini 2.0 Flash (2025) from Google DeepMind; and Claude 3.5 Sonnet (2024) and Claude 3.7 Sonnet (2025) from Anthropic. The LLMs evaluated included GPT-3.5 Turbo (2022), GPT-4, GPT-4 Turbo, GPT-4o, and GPT-4.5; Claude 3 Opus (2024), Claude 3.5 Sonnet, and Claude 3.7 Sonnet, Google Gemini 1.5 Flash, 1.5 Pro, and 2.0 Flash, Grok Beta, Grok 2, and Grok 3 and DeepSeek V2.5 (2024), V3 (2024), and R1 (2024).

The models evaluated in this study were widely deployed, high-impact commercial LLMs and VLMs from OpenAI, Google DeepMind, xAI, and Anthropic, as these platforms are the most commonly accessed by patients and clinicians via web, mobile applications, or API. The decision to exclude open-source models such as LLaMA was based on three primary considerations: (1) limited public awareness or capability among patients to access or operate open-source models, (2) lack of consistent, user-facing chat interfaces or publicly available APIs that replicate typical consumer use, and (3) reproducibility challenges stemming from frequent local tuning, fine-tuning, and deployment variability across open-source implementations.

Statistical analysis

All statistical analyses were conducted using Python (version 3.11) with the SciPy and statsmodels libraries. We performed a linear regression to assess the relationship between the year of model release and the frequency of disclaimer inclusion for text-based medical questions. Chi-square tests were used to compare differences in disclaimer rates across model families for both medical questions and medical images. To examine the relationship between model performance and safety messaging, a Pearson correlation was calculated between diagnostic accuracy and the presence of disclaimers for image-based responses. Finally, a Wilcoxon signed-rank test was used to compare disclaimer rates between high-risk and low-risk medical images.