A longitudinal analysis of declining medical safety messaging in generative AI models

Sharma, Sonali; Alaa, Ahmed M.; Daneshjou, Roxana

doi:10.1038/s41746-025-01943-1

Download PDF

Article
Open access
Published: 02 October 2025

A longitudinal analysis of declining medical safety messaging in generative AI models

Sonali Sharma^1,2,
Ahmed M. Alaa^3,4 &
Roxana Daneshjou^2,5

npj Digital Medicine volume 8, Article number: 592 (2025) Cite this article

1758 Accesses
20 Altmetric
Metrics details

Subjects

Abstract

Generative AI models, including large language models (LLMs) and vision-language models (VLMs), are increasingly used to interpret medical images and answer clinical questions. However, their responses often include inaccuracies; therefore, safety measures like medical disclaimers are critical. In this study, we evaluated the presence of disclaimers in LLM and VLM outputs across model generations released from 2022 to 2025. Responses were generated from 500 mammograms, 500 chest X-rays, 500 dermatology images, and 500 medical questions drawn from a new dataset we introduced: TIMed-Q (Top Internet Medical Question Dataset). TIMed-Q captures the most frequently searched medical queries by patients, reflecting real-world health information-seeking behavior. Disclaimer presence in LLM outputs dropped from 26.3% in 2022 to 0.97% in 2025, while VLM disclaimer rates declined from 19.6% in 2023 to 1.05%. By 2025, most models displayed no disclaimers. As models gain further capability, disclaimers must function as adaptive safeguards tailored to clinical contexts.

The future landscape of large language models in medicine

Article Open access 10 October 2023

Healthcare agent: eliciting the power of large language models for medical consultation

Article Open access 01 September 2025

Evaluation and mitigation of the limitations of large language models in clinical decision-making

Article Open access 04 July 2024

Introduction

Since the launch of large language models (LLMs) and vision-language models (VLMs), they have been studied for their potential applications to medicine¹. Following the first mainstream LLM, ChatGPT 3.5 in 2022, an artificial intelligence (AI) arms race has led to increasingly advanced versions of LLMs and multi-modal models². As these models evolve, their ability and accuracy in assisting clinicians with diagnosis and medical administrative tasks have been of interest³. Since these models are public, patients are also turning to them for health-related information and second opinions to interpret their own medical questions, reports, and images⁴.

However, LLMs and VLMs were not designed or properly assessed for medical uses, and their outputs, though often authoritative in tone, can be inaccurate or misleading⁵. As models continue to evolve, becoming more sophisticated in their fluency and confidence, the opportunity for harm increases⁶. Users may misinterpret AI-generated content as expert guidance, potentially resulting in delayed treatment, inappropriate self-care, or misplaced trust in non-validated information⁷. One safeguard to mitigate this is the inclusion of medical disclaimers that clarify the model’s limitations and explicitly state that it is not qualified to offer medical advice.

While many assume that these models consistently provide disclaimers, emerging evidence suggests otherwise. Studies have shown that LLMs readily generate device-like clinical decision support across a wide range of scenarios, often without qualification⁸. Furthermore, prompt engineering and adversarial testing have demonstrated that it is possible to circumvent built-in safety mechanisms, a process commonly referred to as “jailbreaking”, resulting in inconsistent or incorrect outputs depending on the prompt, user persona, context, and even the model version⁹. This study aimed to systematically evaluate the presence and consistency of medical disclaimers in both LLM and VLM outputs in response to medical questions and medical images, over time and over multiple generations of models?

Results

In both medical questions and medical images, there was a notable decrease in the presence of medical disclaimers between 2022 and 2025 (Fig. 1). When comparing by model family, overall, medical disclaimer rates were highest in Google AI models (41.0% for medical questions, 49.1% for medical images), followed by OpenAI (7.7% for medical questions, 9.8% for medical images). Anthropic models averaged 3.1% for medical questions and 11.5% for medical images. xAI had low disclaimer rates (3.6% for medical questions, 8.6% for medical images), while DeepSeek models had a rate of zero across both domains.

**Fig. 1: Longitudinal decline of medical disclaimers in LLM and VLM outputs.**

Medical questions

From 2022 to 2025, medical disclaimers in response to medical questions fell from 26.3% in 2022 to just 0.97% by 2025 in LLMs. There was a statistically significant decline in the inclusion of medical disclaimers, with a linear regression analysis revealing a strong inverse relationship between year and disclaimer rate (R² = 0.944, p = 0.028), with an estimated annual reduction of 8.1 percentage points.

Across model families (OpenAI, xAI, Google Gemini, Anthropic, DeepSeek), there was a significant difference in medical disclaimer rates when categorized by clinical question type (χ² = 266.03, p < 0.00001).

In 2022, the only model included was GPT 3.5 Turbo, which averaged a disclaimer rate of 26.3%. Medical disclaimers were found in 80.7% of mental health responses, 27.3% in symptom management and treatment responses, 13.7% for emergency responses, 9.6% of diagnostic test and laboratory result interpretations and only 0.3% of medication safety and drug interaction responses.

By 2023, the average disclaimer rate had fallen to 12.4%. While GPT-4 included disclaimers in 16.5% of cases, particularly in mental health (43.7%) and symptom management and treatment responses (24.3%), GPT-4 Turbo’s average was slightly lower at 14.7%, and Grok Beta only had 6% of outputs containing medical disclaimers. Across all 2023 models, disclaimer presence was inconsistent and absent in both the diagnostic test and laboratory result and medication safety and drug interaction categories.

In 2024, the average fell further to 7.5%. Google Gemini 1.5 Flash had a 57.2% disclaimer rate, including 93.3% in mental health, 60.3% in symptom and treatment, and 99% in diagnostic test and laboratory result categories. Claude 3 Opus averaged 7.3%, while Claude 3.5 Sonnet produced only 2.5%.

By 2025, only 0.97% of outputs had medical disclaimers. GPT 4.5 and Grok 3 included no disclaimers at all, while Gemini 2.0 Flash offered only 2.1%, only in the symptom management and treatment and mental health categories Claude 3.7 Sonnet demonstrated 1.8% disclaimer presence, only present in the symptom management and treatment category. Across all years and models, disclaimers were most common in symptom management and treatment (14.1%) and mental health (12.6%) categories. In comparison, lower rates of medical disclaimers were found in the emergency responses (4.8%), diagnostic test and laboratory result (5.2%), and medication safety and drug interaction categories (2.5%) (Fig. 2).

**Fig. 2: Disclaimer frequency across question types and model families.**

Medical images

In total, across mammograms, chest X-rays and dermatology images, the average disclaimer rate decreased from 19.6% in 2023 to 1.05% in 2025 in VLMs (Fig. 3).

**Fig. 3: Disclaimer rates across image types and models.**

The chi-square test across model families (OpenAI, xAI, Google Gemini, Anthropic) yielded a (χ² = 221.42, p < 0.00001). This indicates a significant difference in medical disclaimer rates across model families when evaluated on all medical images, with Google Gemini models producing markedly higher disclaimer rates compared to OpenAI, xAi, and Anthropic.

In 2023, OpenAI’s GPT-4 Turbo exhibited the highest disclaimer rates across all modalities, with 34% for mammograms, 26.3% for chest X-rays, and 11.8% for dermatology images. Notably, disclaimer presence in mammograms increased with higher BI-RADS scores, reaching 52% in BI-RADS 5 cases. In contrast, xAI’s Grok Beta showed much lower rates across all image types, with 22.2% for both mammograms and chest X-rays and 3.3% for dermatology images.

By 2024, OpenAI models showed a clear downward trajectory. For mammograms, GPT-4 Turbo’s medical disclaimer rate dropped to 24.1%, and later versions of GPT-4o fell dramatically, 11.7% in May, 1.7% in August, and 0% by November. A similar pattern was observed for chest X-rays and dermatology images, where GPT-4o and GPT-o1 models showed rates as low as 1–2% by late 2024. Gemini 1.5 Flash reached a medical disclaimer rate of 57.2% for mammograms, 54.1% for chest X-rays, and 33.8% for dermatology images, with Gemini 1.5 Pro performing similarly. Claude 3.5 Sonnet displayed moderate rates across all modalities (15–24%).

In 2025, the presence of medical disclaimers nearly diminished in most VLMs. GPT-4.5, Grok 3, both produced 0% disclaimers for both mammograms, chest X-rays and dermatology images. While Claude 3.7 Sonnet displayed medical disclaimers in 0% of mammograms, chest X-rays it displayed medical disclaimers in 3.1% of dermatology images. Google Gemini 2.0 Flash remained an exception, with elevated disclaimer rates of 26.9% for mammograms, 68.8% for chest X-rays, and 26.0% for dermatology images.

We examined the relationship between model diagnostic accuracy and the presence of medical disclaimers across all medical image types. When combining all modalities, a significant negative correlation was observed (r = −0.64, p = 0.010), indicating that as diagnostic accuracy increased, the inclusion of disclaimers declined. This trend was strongest in mammography, where the correlation was both more negative and statistically significant (r = −0.70, p = 0.004), suggesting a consistent inverse relationship between performance and safety disclaimers. In contrast, the correlation was weaker and not statistically significant for dermatology images (r = −0.47, p = 0.077) and chest X-rays (r = −0.48, p = 0.070), though both maintained a negative correlation.

High-risk images versus low-risk images

The overall percent of medical disclaimers in high-risk images was 18.8% compared to 16.2% in low-risk images. We conducted a non-parametric Wilcoxon signed-rank test comparing disclaimer rates across high-risk (BI-RADS 4 and BI-RADS 5 mammograms, chest X-rays with pneumonia and malignant dermatology images) and low-risk (BI-RADS 1 and BI-RADS 2 mammograms, normal chest X-rays and benign dermatology images) medical images for the same models. The Wilcoxon signed-rank test confirmed a statistically significant difference (W = 13.0, p = 0.023), indicating that models are significantly more likely to include medical disclaimers in high-risk clinical scenarios than in low-risk ones (Fig. 4).

**Fig. 4: Disclaimers by risk level and image type.**

*Please see Supplementary File 2 for distribution of percent of medical disclaimers in mammograms stratified by BI-RADS.

Discussion

Across both medical question answering and medical image interpretation tasks, the presence of medical disclaimers declined significantly both over time and over models within the same year. Between 2022 and 2025, LLMs saw a statistically significant drop in disclaimer inclusion rates in response to medical questions from an average of 26.3% in 2022 to just 0.97% in 2025. Similarly, across mammograms, chest X-rays and dermatology images, the VLMs experienced a statistically significant decrease in the average medical disclaimer rate from 19.6% in 2023 to 1.05% in 2025 Fig. 5.

**Fig. 5: Correlation between diagnostic accuracy and disclaimer presence.**

Notably, the DeepSeek and Google Gemini model families demonstrated starkly contrasting patterns in medical disclaimer behavior across both text and image modalities. DeepSeek models consistently exhibited a disclaimer rate of 0%, indicating a complete absence of safety messaging regardless of risk level, modality, or model version. In comparison, Google Gemini models consistently included medical disclaimers, maintaining the highest overall rates among all families. Although the frequency of disclaimers in Gemini models showed a modest decline across newer versions, they remained substantially higher than in any other model family, suggesting a more deliberate integration of safety messaging.

Our findings revealed a significant negative correlation between the diagnostic accuracy of medical image interpretations and the presence of medical disclaimers, indicating that as models demonstrate greater accuracy, they are less likely to include cautionary language. This trend presents a potential safety concern, as even highly accurate models are not a substitute for professional medical advice, and the absence of disclaimers may mislead users into overestimating the reliability or authority of AI-generated outputs. Stratified analyses revealed important differences in disclaimer distribution across clinical categories and image types. In medical imaging, there was a clear pattern of increased disclaimer use in higher-risk findings. For example, BI-RADS 5, representing cases with highly suspicious features, elicited more disclaimers compared to BI-RADS 1, which indicates a normal mammogram. This suggests that VLMs may have been responding to perceived clinical severity in this case.

However, LLMs showed a different domain-specific stratification in their responses to medical questions. Disclaimers were most frequently included in responses related to symptom management and treatment (14%) and mental health or psychiatric question categories (12%), while emergency (4%), diagnostic test and laboratory result interpretations (5.2%), and medication safety and drug interactions (2.5%) received fewer disclaimers. This pattern may reflect a bias in how models assess conversational risk or platform policies that prioritize content moderation in emotionally sensitive domains (e.g., mental health), while underestimating the liability associated with clinical accuracy, particularly for medications and diagnostics. In comparison, the LLMs answering the medical questions were generally less likely to include disclaimers than the VLMs analyzing the medical images, particularly during the earlier phases of model deployment.

One possible explanation is that image-based tasks were a more recently introduced capability and may have initially triggered more conservative outputs due to uncertainty in interpretation¹⁰. Additionally, medical imaging tasks may also be perceived by developers as being more diagnostically oriented, prompting a higher baseline of caution and safety messaging as seen in the earlier generation of VLMs.

In contrast, LLMs when asked medical questions, may prioritize conversational fluency and user engagement^6,7,11,12. This may lead to the deprioritization or exclusion of explicit medical disclaimers unless programmed to be flagged by the developers¹³.

Overall, the observed decline in medical disclaimer frequency may reflect not only model design evolution but also shifting policy landscapes. In early generations, particularly 2022–2023, developers appeared more conservative, possibly in response to early scrutiny around health misinformation and liability. However, as models became more fluent and capable, and as regulatory frameworks remained vague, some companies may have deprioritized safety messaging to improve user experience or reduce redundancy. Notably, there has been no consistent, enforceable regulation requiring medical disclaimers in generative AI outputs across jurisdictions. However, the declining trends seen in the presence of medical disclaimers carry serious implications for patient safety and public trust. As AI tools encode more clinical knowledge and become more integrated into everyday health-seeking behavior, whether for understanding symptoms, interpreting diagnostic tests, or guiding treatment decisions, users may increasingly mistake fluent, authoritative outputs for clinician-approved advice^14,15. This is particularly concerning in high-risk scenarios such as emergency medical situations, where misinformation or omission of important information can result in severe consequences^16,17. Medical disclaimers should not only be included in every medically related output but should also be dynamic, adapting to the clinical seriousness of the question or image, the potential for harm, and the likelihood of user misinterpretation. As models continue to evolve, safety infrastructure must evolve alongside them.

The main limitation in our study is the opaque nature of the LLM and VLM architecture. Because the internal mechanisms governing safety features, including medical disclaimers, are not publicly available, it is difficult to determine the specific design changes that led to their decline over time. This limits our ability to attribute trends to particular model updates or safety protocols. In this study, all model outputs were collected using API-based, single-turn prompt submissions with standardized phrasing and default temperature parameters to ensure comparability across models; however, this may not fully capture how models behave in real-world conversations where disclaimers might appear contextually or conditionally. While this approach enhances comparability across models and time-points, it does not fully replicate real-world user interactions, which are often multi-turn, conversational, and influenced by user behavior. In web-based interface models, safety behaviors may exhibit different safety behaviors, including dynamic disclaimer insertion, based on perceived user vulnerability or conversation length. Prior studies suggest that some models adapt their tone and cautionary language based on prior conversation history, perceived emotional state, or the formality of language used by the user. Our findings may represent a conservative estimate of disclaimer frequency, especially if disclaimers emerge in later turns or in response to specific user signals not simulated in our design. Future research should systematically compare disclaimer behavior across access modalities (API vs. chat) and interaction depths (single-turn vs. multi-turn) to clarify this discrepancy. Additionally, our decision to exclude open-source models was chosen to simulate patient behavior, which may limit generalizability, particularly when evaluating safety features in decentralized or custom deployments.

Future studies should explore whether the observed loss of medical disclaimers is correlated with model uncertainty or overconfidence. This could involve analyzing model-generated confidence scores or hedging language to assess whether disclaimers are omitted more frequently when models are confident, even if inaccurate. Additionally, as LLMs increase their memory or context window, it is important to investigate whether memory of a user’s past inputs leads to reduced safety messaging over time. Additionally, understanding the role of user-specific memory in shaping disclaimer behavior could provide valuable insight into how specific user information (e.g., occupation or education level) or perceived expertise that a user may have could result in a loss of essential safety features. Future research should also examine systematic differences between API and web-based interfaces. Evaluating whether the same prompt produces different outputs depending on the access pathway could reveal important variations in developer or patient interfaces, helping developers and clinicians understand where vulnerabilities in model deployment may lie.

Our findings reveal a consistent and concerning decline in the presence of medical disclaimers across both LLMs and VLMs from 2022 to 2025. Specifically, in LLMs, medical disclaimers in response to medical questions fell from 26.3% in 2022 to just 0.97% by 2025. For medical image interpretation tasks, including mammograms, chest X-rays, and dermatology images, the average disclaimer rate decreased from 19.6% in 2023 to 1.05% in 2025 in VLMs. As LLMs and VLMs continue to integrate into health information ecosystems, maintaining robust, transparent, and dynamic medical disclaimer protocols will be essential to protecting patients, preserving public trust, and upholding ethical standards in healthcare. We recommend that medical disclaimers be implemented as a non-optional safety feature in all medically related model outputs. These disclaimers should not only signal that the model is not a licensed provider but also adapt to the clinical context, including the severity of the case and the type of user inquiry.

Methods

Datasets

To evaluate the presence and consistency of medical disclaimers in responses generated by LLMs and VLMs, we compiled a multi-modal, multi-domain medical dataset, using publicly available medical images 500 mammograms [100 each for BI-RADS categories 1 through 5] and 500 chest X-rays [250 normal, 250 pneumonia] and 500 diverse dermatology images encompassing a wide range of skin conditions [250 benign and 250 malignant]. As all images were sourced from publicly available datasets and institutional ethics approval was not required^18,19,20.

Each dataset was accompanied by expert-validated ground-truth labels, which served as the reference standard for evaluating diagnostic accuracy. In this study, diagnostic accuracy was defined as the proportion of model outputs that correctly matched the ground-truth classification for each medical image. For mammograms, this included correctly identifying the BI-RADS category (1 through 5); for chest X-rays, distinguishing between normal and pneumonia findings; and for dermatology images, correctly classifying the lesion. Ground-truth labels were derived from publicly available, expert-annotated datasets curated by radiologists and dermatologists.

Top Internet Medical-Q Dataset (TIMed-Q)

In this study, we introduce a novel dataset called the Top Internet Medical-Q Dataset (TIMed-Q), built from real-world, internet-based medical questions posed by patients. The dataset comprises a standardized collection of 500 text-based medical queries, evenly divided into five clinically relevant domains: (1) symptom management and treatment, (2) acute emergency scenarios, (3) medication safety and drug interactions, (4) mental health and psychiatric conditions, and (5) diagnostic test and laboratory result interpretation, with 100 questions in each category.

Dataset construction

To assemble this resource, we began by compiling a comprehensive list of medical keywords and phrases drawn from the World Health Organization’s ICD-10 classification and clinical terminologies²¹.

Using these keywords, we leveraged the Google SERP API (and related autocomplete endpoints) to scrape high-frequency patient queries from publicly accessible online sources, including Google Trends, the WebMD symptom checker, and the Mayo Clinic symptom checker. We prioritized queries that reflected clinical relevance and patient-centered concerns across the five target domains. To enrich our coverage of medication-related issues, we also extracted commonly searched drug safety and interaction questions from reputable platforms such as Drugs.com and Medscape.

These questions were then reviewed by a physician with nine years of experience post-graduation. Between 2/27/25 and 4/12/25, each image and each question were submitted three times to every model to account for variability. Image prompts followed a standardized format simulating patient inquiries: “What does my mammogram mean?”, “What does my chest X-ray mean?”, and “What does this image of my skin mean?” Medical questions were input exactly as written (Supplementary File 1). All models were accessed via their respective APIs, with the default temperature parameter to minimize stylistic variation and to simulate the common mode accessed by patients.

Medical disclaimer: detection and data analysis

A medical disclaimer was operationally defined as any explicit statement indicating that the model is not a licensed medical professional and that its response should not be used as a substitute for professional medical advice. For each model response, we recorded whether a disclaimer was present and calculated the standard deviation across the three test runs per prompt. We assessed the presence of medical disclaimers by using regular expression (RegEx)-based matching and manual review, specifically searching for phrases that indicated that the model was not a medical professional, such as variations of “I am an AI” and “I am not qualified to give medical advice”. We did not count phrases containing “I suggest you consult your physician or a medical/healthcare provider” as a medical disclaimer, as that is not an explicit disclaimer around the model’s limitations or ability to provide medical advice. Stratified categorical analyses were conducted to evaluate differences in disclaimer inclusion rates by medical question category, by BI-RADS classification for mammograms, normal or pneumonia status for chest X-rays and benign or malignant status for dermatology images. Table 1 summarizes the different medical disclaimer formulations and their corresponding regular–expression patterns. Table 2 illustrates example LLM and VLM responses, both with and without disclaimers, to prompts about medical images and questions, including each input prompt and the model’s output.

Table 1 Overview of medical disclaimer phrase variations and their regular-expression patterns

Full size table

Table 2 Example LLM and VLM responses to prompts with and without disclaimers

Full size table

Models

The VLMs tested included OpenAI’s GPT-4 Turbo (2023), GPT-4o (May, August, and November 2024), GPT-o1 (December 2024), and GPT-4.5 (2025); Grok Beta (2023), Grok 2 (2024), and Grok 3 (2025) from X; Gemini 1.5 Flash (2024), Gemini 1.5 Pro (2024), and Gemini 2.0 Flash (2025) from Google DeepMind; and Claude 3.5 Sonnet (2024) and Claude 3.7 Sonnet (2025) from Anthropic. The LLMs evaluated included GPT-3.5 Turbo (2022), GPT-4, GPT-4 Turbo, GPT-4o, and GPT-4.5; Claude 3 Opus (2024), Claude 3.5 Sonnet, and Claude 3.7 Sonnet, Google Gemini 1.5 Flash, 1.5 Pro, and 2.0 Flash, Grok Beta, Grok 2, and Grok 3 and DeepSeek V2.5 (2024), V3 (2024), and R1 (2024).

The models evaluated in this study were widely deployed, high-impact commercial LLMs and VLMs from OpenAI, Google DeepMind, xAI, and Anthropic, as these platforms are the most commonly accessed by patients and clinicians via web, mobile applications, or API. The decision to exclude open-source models such as LLaMA was based on three primary considerations: (1) limited public awareness or capability among patients to access or operate open-source models, (2) lack of consistent, user-facing chat interfaces or publicly available APIs that replicate typical consumer use, and (3) reproducibility challenges stemming from frequent local tuning, fine-tuning, and deployment variability across open-source implementations.

Statistical analysis

All statistical analyses were conducted using Python (version 3.11) with the SciPy and statsmodels libraries. We performed a linear regression to assess the relationship between the year of model release and the frequency of disclaimer inclusion for text-based medical questions. Chi-square tests were used to compare differences in disclaimer rates across model families for both medical questions and medical images. To examine the relationship between model performance and safety messaging, a Pearson correlation was calculated between diagnostic accuracy and the presence of disclaimers for image-based responses. Finally, a Wilcoxon signed-rank test was used to compare disclaimer rates between high-risk and low-risk medical images.

Data availability

TIMed-Q is a curated dataset of 500 patient-phrased medical questions sourced from real-world internet searches. This dataset can be found at: https://github.com/sonalisharma-3/TIMed-Q.

References

Meng, X. et al. The application of large language models in medicine: a scoping review. iScience 27, 109713 (2024).
Article PubMed PubMed Central Google Scholar
Mesko, B. The ChatGPT (Generative Artificial Intelligence) revolution has made artificial intelligence approachable for medical professionals. J. Med. Internet Res. 25, e48392 (2023).
Article PubMed PubMed Central Google Scholar
Chang, C. T. et al. Red teaming ChatGPT in medicine to yield real-world insights on model behavior. npj Digit. Med. 8, 149 (2025).
Article PubMed PubMed Central Google Scholar
Choudhury, A., Elkefi, S. & Tounsi, A. Exploring factors influencing user perspective of ChatGPT as a technology that assists in healthcare decision making: a cross sectional survey study. PLoS One 19, e0296151 (2024).
Article CAS PubMed PubMed Central Google Scholar
Aydin, S., Karabacak, M., Vlachos, V. & Margetis, K. Navigating the potential and pitfalls of large language models in patient-centered medication guidance and self-decision support. Front. Med. 12, 1527864 (2025).
Article Google Scholar
Anderl, C. et al. Conversational presentation mode increases credibility judgements during information search with ChatGPT. Sci. Rep. 14, 17127 (2024).
Article CAS PubMed PubMed Central Google Scholar
Shekar, S., Pataranutaporn, P., Sarabu, C., Cecchi, G. A. & Maes, P. People over trust AI-generated medical responses and view them to be as valid as doctors, despite low accuracy. arXiv http://arxiv.org/abs/2408.15266 (2024).
Weissman, G. E., Mankowitz, T. & Kanter, G. P. Unregulated large language models produce medical device-like output. npj Digit. Med. 8, 148 (2025).
Article PubMed PubMed Central Google Scholar
Menz, B. D. et al. Current safeguards, risk mitigation, and transparency measures of large language models against the generation of health disinformation: repeated cross sectional analysis. BMJ 384, e078538 (2024).
Article PubMed PubMed Central Google Scholar
Bender, E. M., Gebru, T., McMillan-Major, A. & Shmitchell, S. On the dangers of stochastic parrots: can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 610–623 (ACM, 2021).
Savage, T. et al. Large language model uncertainty proxies: discrimination and calibration for medical diagnosis and treatment. J. Am. Med. Inform. Assoc. 32, 139–149 (2025).
Article PubMed Google Scholar
Hakim, J. B., Painter, J. L. & Ramcharran, D. The need for guardrails with large language models in medical safety-critical settings: an artificial intelligence application in the pharmacovigilance ecosystem. arXiv http://arxiv.org/abs/2407.18322 (2024).
Singhal, K. et al. Large language models encode clinical knowledge. Nature 620, 172–180 (2023).
Article CAS PubMed PubMed Central Google Scholar
Griot, M., Hemptinne, C., Vanderdonckt, J. & Yuksel, D. Large Language Models lack essential metacognition for reliable medical reasoning. Nat. Commun. 16, 642 (2025).
Article CAS PubMed PubMed Central Google Scholar
Zada, T. et al. Medical misinformation in AI-assisted self-diagnosis: development of a method (EvalPrompt) for analyzing large language models. JMIR Form Res. 9, e66207 (2025).
Article PubMed PubMed Central Google Scholar
Crocco, A. G., Villasis-Keever, M. & Jadad, A. R. Analysis of cases of harm associated with use of health information on the internet. JAMA 287, 2869–2871 (2002).
Article PubMed Google Scholar
Birkun, A. A. & Gautam, A. Large language model (LLM)-powered chatbots fail to generate guideline-consistent content on resuscitation and may provide potentially harmful advice. Prehosp. Disaster Med. 38, 757–763 (2023).
Article PubMed Google Scholar
Bhole, G., Suba, S. & Parekh, N. Mammo-Bench: A Large-scale Benchmark Dataset of Mammography Images. medRxiv https://doi.org/10.1101/2025.01.31.25321510 (2025).
Kermany, D. Labeled optical coherence tomography (OCT) and chest X-ray images for classification. Mendeley Data. http://data.mendeley.com/datasets/rscbjbr9sj/2 (2018).
AIMI, Stanford University. DDI diverse dermatology images. http://aimi.stanford.edu/datasets/ddi-diverse-dermatology-images (2025).
World Health Organization. Classification of diseases. WHO http://www.who.int/standards/classifications/classification-of-diseases (2025).

Download references

Author information

Authors and Affiliations

Department of Radiology, Faculty of Medicine, University of British Columbia, Vancouver, BC, Canada
Sonali Sharma
Department of Biomedical Data Science, Stanford School of Medicine, Stanford, CA, USA
Sonali Sharma & Roxana Daneshjou
University of California, Berkeley, CA, USA
Ahmed M. Alaa
University of California, San Francisco, CA, USA
Ahmed M. Alaa
Department of Dermatology, Stanford School of Medicine, Stanford, CA, USA
Roxana Daneshjou

Authors

Sonali Sharma
View author publications
Search author on:PubMed Google Scholar
Ahmed M. Alaa
View author publications
Search author on:PubMed Google Scholar
Roxana Daneshjou
View author publications
Search author on:PubMed Google Scholar

Contributions

S.S., A.A., and R.D. designed the study, dataset and conducted the analysis, interpretation of the work, writing the manuscript and preparing the figures.

Corresponding authors

Correspondence to Sonali Sharma, Ahmed M. Alaa or Roxana Daneshjou.

Ethics declarations

Competing interests

R.D. has served as an advisor to MDAlgorithms and Revea and received consulting fees from Pfizer, L’Oreal, Frazier Healthcare Partners, and DWA, and research funding from UCB and Apple.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Sharma, S., Alaa, A.M. & Daneshjou, R. A longitudinal analysis of declining medical safety messaging in generative AI models. npj Digit. Med. 8, 592 (2025). https://doi.org/10.1038/s41746-025-01943-1

Download citation

Received: 01 July 2025
Accepted: 10 August 2025
Published: 02 October 2025
DOI: https://doi.org/10.1038/s41746-025-01943-1