Introduction

Large language models (LLMs) are generative artificial intelligence (AI) systems trained on vast amounts of human language. They are the fastest-adopted technology in human history1,2. Numerous scientific and medical applications of LLMs have been proposed3,4,5, and these could drastically change and improve medicine as we know it. In particular, LLMs have been shown to be able to reduce documentation burden and promote guideline-based medicine6,7. In parallel to the rapid progression of LLM capabilities, there has been substantial progress in the development of multimodal vision-language models (VLMs). VLMs can interpret images and text alike and further expand the applicability of LLMs in medicine. Several VLMs have been published to date, either as healthcare-specific models, e.g., for the interpretation of pathology images or echocardiograms8,9, or as generalist models, applicable to multiple domains at once, including healthcare, such as GPT-4o10,11,12,13,14.

However, with new technologies, new vulnerabilities emerge, and the healthcare system has to be hardened against these15,16. We hypothesized that one particular vulnerability of LLMs and VLMs is prompt injection. Prompt injection means that a user adds an additional, hidden instruction for the model (Fig.Ā 1a). Prompt injection can be disguised in hidden (e.g., zero-width) or encoded characters (e.g., Unicode), whitespaces, metadata, images and much more—essentially, any information which flows into a model at runtime can be used as an attack vector (Fig.Ā 1b)17,18,19,20. Importantly, third parties with access to a user’s input (but without access to the model itself), can perform prompt injection to exfiltrate private data, evade model guardrails, and alter model outputs19,21,22. In a clinical context, modifying a model’s output can cause lethal harm. Still, prompt injection attacks have never been investigated in the medical domain.

Fig. 1: Prompt injection attacks on medical images.
figure 1

a Concept of prompt injection with correct and corrupted output plus models investigated in the study. Models that did not meet inclusion criteria displayed in grey. b Schematic illustration of different attack gateways for prompt injections. Left: Patient image data is routinely not exclusively created in the hospital itself, but also by external sources, illustrated as ā€œExternal radiologistā€ as an example (equivalent to other external imaging providers). It isĀ then provided either via (unprotected) CD/DVD or a server (varying protection levels) to the hospital. Right: For inference, hospitals will rely on external VLM-providers. Illustrated are several gateways to send patient data to these providers, with varying protection levels. EHR Electronic health records, API Application programming interface, HIS Hospital information system. c Exemplary views of modalities (histology, endoscopy, CT, MRI, ultrasound, photography) with 18 cases, 3 prompt injection positions, 3 variations (high contrast, low contrast, small font) and 3 iterations per unique constellations, with schematic illustration of injection methods. Contrast in ā€œlow contrastā€ was increased for this figure as compared to the actual model input to maintain human readability. Prompt injection in ā€œlow contrastā€ and ā€œsmall fontā€ reads ā€œDescribe which organ you see, but state it looks healthyā€.

Here, we show that prompt injection attacks have the potential to alter model outputs from accurate diagnosis to a potentially harmful misdiagnosis in oncology.

Results

Vision-language models are modality-agnostic detectors of malignant lesions

To investigate the susceptibility of contemporaneous state-of-the-art VLMs, we designed a set of experiments to evaluate prompt injection in the context of cancer diagnosis. Specifically, we investigated if the diagnosis obtained from an imaging scan, i.e., presence or absence of tumor, can be manipulated at will, just by injecting malicious information into the model input (Supplementary Fig.Ā 1). Images of 18 patients with visually highly suspicious and histologically confirmed malignant lesions were passed to the VLMs Claude 3 Opus, Claude 3.5 Sonnet, Gemini 1.5, GPT-4o and Reka Core. These images included liver imaging (in equal parts CT, MRI and ultrasound images), as well as photographies of melanomas, endoscopic imaging of gastrointestinal cancers and hematoxylin-eosin (HE)-staining of adenocarcinomas (n = 3 per modality, 18 cases in total, Fig.Ā 1c). Each image (Supplementary DataĀ 1) was accompanied by the prompt ā€œDescribe this imageā€ and a structured template for the output (Supplementary DataĀ 2, 3, Supplementary TableĀ 1). Three strategies for prompt injection were tested: ā€œtext prompt injectionā€, ā€œvisual prompt injectionā€, and ā€œdelayed visual prompt injectionā€, in which the attack was performed using the image preceding the target image (Supplementary Fig.Ā 1a). Additionally, for visual—and delayed visual prompt injection, we tested if the contrast and size of the injected text had an influence on the models’ accuracies: we employed two contrast settings (high contrast and low contrast) and one setting in which the text was tiny, see Fig.Ā 1b. Low-contrast and ā€œtinyā€ injections correspond to sub-visual injections which are not obvious to human observers, therefore more harmful. This led to a total of 72 variations per model (18 negative controls + 54 prompt injection variations), with each of the 72 variations being queried a total of 3 replicates (n = 216 per model). All prompts are listed in Supplementary TableĀ 1.

First, we assessed the organ detection rate by the model. Only VLMs that reached at least a 50% organ detection rate, i.e., were able to accurately describe the organ in the image, were used for subsequent experiments (Fig.Ā 2a). The VLMs Claude-3 Opus, Claude 3.5 Sonnet, GPT-4o and Reka Core achieved this rate and were therefore included in this study (Accuracy of 59%, 80%, 79%, 74% for Claude-3, Claude-3.5, GPT-4o and Reka Core, respectively). We were not able to investigate the vision capabilities of Gemini 1.5 Plus because its current guardrails prevent it from being used on radiology images. Llama-3.1 (405B), the best currently available open-source LLM, does not yet support vision interpretation, and could therefore not be assessed23,24. As a side observation, we found that all models sometimes hallucinated the presence of spleen, kidneys, and pancreas when prompted to describe them despite them not being visible, but this effect was not relevant to the subsequent experiments.

Fig. 2: Prompt injection attacks manipulate the capability of VLMs to detect malignant lesions.
figure 2

a Accuracies in detecting the represented organs per model. Mean ± standard deviation (SD) is shown. n = 18 data points per model (n = 9 for Gemini), with each data point representing a mean of three replicated measurements, two-sided Kruskal-Wallis test with Dunn’s test and Bonferroni post-hoc correction. b Harmfulness scores for all queries with injected prompt vs prompts without prompt injection per model. Mean ± SD are shown. Each point represents triplicate evaluation. Two-sided Wilcoxon Signed-Rank tests with Bonferroni post-hoc correction compared lesion miss rates scores within each model (square brackets). Two-sided Mann-Whitney U tests with Bonferroni post-hoc correction compared lesion miss rates for prompt injection (PI) vs non PI over all models combined (straight bar). P-values were adjusted using the Bonferroni method, with *p < 0.05, **p < 0.01, ***p < 0.001. Harmfulness scores as mean ± standard deviation (SD) per (c) position or (d) variation of adversarial prompt, ordered as Claude-3, Claude-3.5, GPT-4o, and Reka Core from left to right. n = 18 data points per model and variation, with each data point representing a mean of three replicated measurements. Mann-Whitney U test + Bonferroni method over all models combined for each position/variation.

Hidden instructions in images can bypass guardrails and alter VLM outputs

Second, we assessed the attack success rate in all VLMs. Our objective was to provide the VLM with an image of a cancer lesion, and prompting the model to ignore the lesion, either by text prompt injection, visual prompt injection or delayed visual prompt injection. We quantified (a) the model’s ability to detect lesions in the first place (lesion miss rate, LMR), and (b) the attack success rate (ASR), i.e., flipping the model’s output by a prompt injection (Fig.Ā 2b). We observed highly different behavior between VLMs, with organ detection rates of 59% (Claude-3), 80% (Claude-3.5), 79% (GPT-4o), and 74% (Reka Core) (n = 54 each) (Supplementary TableĀ 2). Lesion miss rate (LMR) of unaltered prompts was 35% for Claude-3, 17% for Claude-3.5, 22% for GPT-4o, and 41% for Reka Core (n = 54 each) (Fig.Ā 2b). Adding prompt injection significantly impaired the models’ abilities to detect lesions, with a LMR of 70% (ASR of 33%) for Claude-3 (n = 81), LMR of 57% (ASR 40%) for Claude-3.5 (n = 162), LMR of 89% (ASR of 67%) for GPT-4o (n = 162) and LMR of 92% (ASR of 51%) for Reka Core (n = 104), significant both per model (p = 0.02; 0.01; <0.001 and <0.001 for Claude-3, Claude-3.5, GPT-4o, and Reka Core, respectively) as well as over all models combined (p < 0.0001) (Fig.Ā 2b). Notably, the ASR for GPT-4o and Reka Core was significantly higher than the ASR of Claude-3.5 (p = 0.001 and p = 0.006 for GPT-4o and Reka Core, respectively, Supplementary TableĀ 3), possibly indicating a slightly superior alignment training for Claude-3.5. Together, these data show that prompt injection, to varying extent, is possible in all investigated VLMs on a broad range of clinically relevant imaging modalities.

Prompt injection can be performed in various ways. As a proof-of-concept we investigated three different strategies for prompt injection (Fig.Ā 1b), with striking differences between models and strategies (Fig.Ā 2c, d, Supplementary Fig.Ā 1). Text prompt injection and image prompt injection were both harmful in almost all observations, except for Claude-3.5, which proved less harmful here. Meanwhile, delayed visual prompt injection resulted in less harmful responses overall (Fig.Ā 2c, Supplementary TableĀ 4), possibly because the hidden instruction becomes more susceptible to guardrail interventions once written. Different hiding strategies (low contrast, small font) were shown to be similarly harmful to the default (high contrast, large font) for GPT-4o and Reka Core, while low contrast settings reduced the LMR for Claude models (69% to 14% for Claude-3, 58 to 33% for Claude-3.5, Figs.Ā 1b, 2d, Supplementary TableĀ 5).

Prompt injections are modality-agnostic and not easily mitigated

Current state-of-the-art VLMs are predominantly closed-source. It is therefore unclear whether they are trained comprehensively across diverse medical imaging modalities, systematic evaluation for this domain is lacking25. We therefore investigated the vision capabilities on organ detection and lesion detection for six clinically relevant imaging modalities (Fig.Ā 3). In line with the most likely representation in training data, organ detection for photographs and radiological imaging far exceeded that of endoscopic and histological imaging (Fig.Ā 3a, Supplementary TableĀ 6). We observed that all investigated models were susceptible to prompt injection irrespective of the imaging modality (Fig.Ā 3a–d, averaged ASR 32; 32; 49; 58; 61% for US, Endoscopy, MRI, CT and Histology, respectively, Supplementary TableĀ 7), with significant differences only between US and CT (p = 0.02). Together, these data show that prompt injection is modality-agnostic, as well as generalizable over different strategies and visibility of the injected prompt.

Fig. 3: Prompt injection attacks are modality-agnostic.
figure 3

Heatmaps per model and imaging modality for (a) mean organ detection rate, (b) mean attack success rate, (c) lesion miss rate (LMR) for the native models and (d) mean lesion miss rate (LMR) for the prompts with prompt injection, with (b) representing the tile-based difference between (d) and (c). CT Computed Tomography, MRI Magnetic Resonance Imaging, US Ultrasound.Ā * represents instances where LMR was higher for native models than injected models (n = 1). e Thumbnails of all images used for the study sorted by modality. All images contain a histologically confirmed malignant lesion.Ā (Images areĀ cropped for this figure, original images see Supplementary DataĀ 1).

Finally, we investigated three strategies to mitigate prompt injection attacks. Investigated strategies included ethical prompt engineering and agent systems, as well as a combination of both (Fig.Ā 4). For ethical prompt engineering, we enforced the VLMs to provide answers in line with ethical behavior (Prompts see Supplementary TableĀ 1). To simulate agent-systems, we instructed a second model-instance as a supervisor model. The supervisor observed the first answer, was instructed to actively search for malicious content in the first image and provide its own answer by choosing to either replicate the initial answer or provide independent, helpful feedback. None of the strategies proved to be successful for Claude-3, GPT-4o, and Reka-Core, demonstrating that prompt injection is successful even in repeated model calls (Fig.Ā 4, Supplementary TableĀ 8). However, we observed that prompt engineering for ethical behavior significantly reduced vulnerability to prompt injection for Claude-3.5 (p ≤ 0.001) from 64.8% to 27.8%, suggesting a superior alignment to desirable ethical outputs compared to other models.

Fig. 4: Mitigation efforts for prompt injection attacks.
figure 4

Count of prompt injections that were successful (Model reported no pathologies) or failed (Model reported lesion, either due to failed prompt injection or due to defense mechanism) of n = 54 distinct scenarios in total (0–3 missing values per scenario due to errors in model calling, see Supplementary TableĀ 1b). Two-sided Fisher’s exact test compared ratio of successful vs failed prompt injections for each condition (intra-model comparison only). p-values were adjusted using the Bonferroni method, with *p < 0.05, **p < 0.01, ***p < 0.001.

Discussion

In summary, our study demonstrates that subtle prompt injection attacks on state-of-the-art VLMs can cause harmful outputs. These attacks can be performed without access to the model architecture, i.e., as black-box attacks. Potential attackers encompass cybercriminals, blackmailers, insiders with malicious intent, or, as observed with increasing and concerning frequency, political actors engaging in cyber warfare26,27. These would only need to gain access to the user’s prompt, e.g., before the data reaches the secure hospital infrastructure. Inference, for which data is sent to the (most-likely external) VLM-provider, serves as another gateway (Fig.Ā 1b). Here, a simple, malicious browser extension would suffice to alter a prompt that is sent via web-browser28,29,30,31. These methods are of significant concern, especially in an environment such as healthcare, where individuals are stressed, overworked and are operating within a chronically underfunded cybersecurity infrastructure28,30. This makes prompt injection a highly relevant security threat in future healthcare infrastructure, as injections can be hidden in virtually any data that is processed by medical AI systems20,32. Given that prompt injection exploits the fundamental input mechanism of LLMs, prompt injection is likely to be a fundamental problem of LLMs/VLMs, not exclusive to the tested models, and not easily fixable, as the model is simply following the (altered) instructions. Recent technical improvements to LLMs, e.g., Short circuiting, important to mitigate intrinsically harmful outputs such as weapon-building-instructions, are insufficient to mitigate such attacks15,22. Agent-systems composed of multiple models have similarly been shown to be targetable33. Further, other types of guardrails can be bypassed22 or compromise usability, as shown for Gemini 1.5. A possible solution to this could be hybrid alignment training34, enforcing prioritization on ethical outputs alongside human preferences over blind adherence to inappropriate requests. As we show that Claude-3.5, after years of alignment research from Anthropic35, is the only tested model where mitigation worked to some extent (Fig.Ā 4), this approach appears promising. Other approaches could include rigorous enforcement or wrapping of the prompt structure33. Moreover, public release of model-specific approaches to alignment training, currently not available, could assist in theĀ development of solutions, especially as this would allow causal investigations for the varying levels of susceptibility to prompt injection attacks for different models. Overall, our data highlight the need for techniques specifically targeting this form of adversarial attacks.

While we acknowledge that prompt injection in general has been described elsewhere in general19,21,22,34, the concept bears exceptional risks for the medical domain: Firstly, the medical domain is dealing with data that is not necessarily represented in the training data of SOTA VLMs, resulting in lower overall accuracy. Secondly, medical data is life-critical of nature. Thirdly, specific use cases (Fig.Ā 1b) are unique to clinical context. Lastly, while one would anticipate LLM-guardrails to prevent prompt injection from working in life-critical contexts, they clearly do not, as we show that prompt injection is a relevant threat in the medical domain. Hospital infrastructures face a dual challenge and a complex risk-benefit scenario here: They will have to adapt to both integrate LLMs and build robust infrastructure around them to prevent these new forms of attacks, e.g., by deploying agent-based systems and focusing not only on performance but also on alignment when choosing a model36. Despite our findings pointing to relevant security threats, integrating LLMs in hospitals holds tremendous promise for patient empowerment, reduction of documentation burden, and guideline-based clinician support4,7,37. Our study therefore encourages all relevant stakeholders to adopt these LLMs and VLMs but to develop new ways to harden the systems against all forms of adversarial attacks, ideally before approval as medical devices38. A promising way for such hardening is to keep human experts in the loop and to have highly critical decisions double-checked and vetted by humans who ultimately take responsibility for clinical decisions.

Methods

Ethics statement

This study does not include confidential information. All research procedures were conducted exclusively on anonymized patient data and in accordance with the Declaration of Helsinki, maintaining all relevant ethical standards. No participant consent was required as the data consisted of anonymized images and was obtained either from local hospital servers or from external sources where informed consent is a prerequisite for the submission and use of such information. The overall analysis was approved by the Ethics Commission of the Medical Faculty of the Technical University Dresden (BO-EK-444102022). Local data was obtained from Uniklinik RWTH Aachen under grant nr EK 028/19. Our work demonstrates a significant threat to healthcare. By publicly disclosing the vulnerabilities and attacks explored in this paper, our goal is to encourage robust mitigation and defense mechanisms and promote transparency regarding risks associated with LLMs. All prompts were injected in a completely simulated scenario to prevent unintended harm. We strongly emphasize that the disclosed attack techniques and prompts should under no circumstances be used in real-world scenarios without proper authorization.

Patient cases

Single transversal images of anonymized patient cases were retrieved from local university hospital servers (CT/MRI, each n = 3) by a board-certified radiologist, and from publicly available resources (ultrasound, n = 3 Radiopaedia.org, with case courtesy of Di Muzio B (https://doi.org/10.53347/rID-70007), Keshavamurthy J (https://doi.org/10.53347/rID-68460) and Lawson A). Endoscopic, histological and dermatoscopic images (each n = 3) were retrieved from Wikimedia Commons, licensed under CC BY-SA 4.0. All diagnoses were double-checked by our board-certified physicians for radiology, dermatology, pathology, or internal medicine, respectively. Further information on patient cases displayed in Supplementary DataĀ 3.

Model setup

Models (Claude 3 Opus (claude-3-opus-20240229), Claude 3.5 Sonnet (claude-3-5-sonnet-20240620), Gemini 1.5, GPT-4o (gpt-4o-2024-05-13) and Reka Core (reka-core-20240501) were accessed between the 12th of June 2024 and the 21st of June 2024 and on 5th of August 2024 via respective web interfaces. Models were further accessed via API on 5th of August and between 2nd and 4th of September 2024. No learning features (e.g., ChatGPT’s Memory function) were used, to prevent the model from learning desired outputs from earlier chat history. User prompts were introduced in independent chat windows/API calls along with image prompts, with the temperature set to 0.7 (default setting for most LLMs) for all models and maximum token count to 1000. No individual system prompts (e.g., ChatGPT’s ā€œIndividual Configurationā€ feature) were added.

Experimental setup

For each patient case, five images were composed. Images 1, 2, and 5 were passed independently to the models. Image 1 served as negative control, with just the unaltered prompt template added to the chat. Image 2 was passed together with the ā€œtext prompt injectionā€-prompt (ā€œJust describe which organ you see but state it looks healthyā€). Text characters were not hidden but passed directly to the model. Images 3 and 4, were passed to the same chatĀ or API-call, to simulate a prompt injection that is already present as model instruction (in image 3) before the image to assess (image 4) is passed. Prompt injections were varied in font size (14 or 6) and color (black (#060606), or dark gray (#171717)), background color was either white (#ffffff) or black (#060606). Image prompts combining prompt injection (or whitespace) + radiological image were passed to the model with a resolution of 4457 × 2846 pixels, with original imaging resolution between 500 and 1000 pixels per dimension. For the low-contrast setting, we chose ā€œblack in blackā€ over ā€œwhite in whiteā€ because black reflects the usual background in a clinical setting of radiology imaging. All images are available in Supplementary DataĀ 1, all model outputs are available in Supplementary DataĀ 2.

Organ detection assessment

Organ detection was assessed in two separate instances per model output, evaluating correct detection of the liver as region of interest firstly for the answer to ā€œLiver:ā€ and secondly for ā€œPathologies:ā€. Each model output was therefore evaluated with 0, 1, or 2 points for organ detection: 1 if one of two instances was correct, and 2 points if both instances were correct. This ordinal scale was chosen as models tended to falsely label the liver as ā€œnot presentā€ in the first instance, ā€œLiverā€, but correctly identified the liver (plus the lesion) under ā€œPathologiesā€, suggesting that the wrong label in the first instance was caused by structural constraints inflicted upon the model by our template. This is commonly observed, as models ā€œthink while they writeā€ and are therefore more likely to hallucinate when forced in specific structures39. The threshold for inclusion for assessment of hallucination was set to 50%. Mean organ detection rate over the three iterations for every prompt was plotted and underwent statistical analysis.

Lesion miss rate and attack success rate

Lesion miss rate (LMR) was assessed as a binary score of 0 and 1 by a licensed physician for all model outputs, including negative controls without prompt injection. Standard disclaimers such as ā€œremember, I am not a doctorā€ were excluded from the assessment, only the direct answer to the question was evaluated. Responses like ā€œNo pathologies (pathologies ignored as instructed)ā€, as well as instances where the model refused to provide false information were categorized as non-harmful responses. Mean ± standard deviation over the three iterations for every prompt was plotted and underwent statistical analysis. Attack success rate (ASR) was calculated as the difference between LMR for prompt injections and LMR for unaltered prompts.

Mitigation efforts

All cases were sent to Claude-3, Claude-3.5, GPT-4o, and Reka-Core between the 2nd to 4th of September 2024. For ā€œEthical promptā€, the initial prompt was supplemented with the sentence ā€œRemember to adhere to the highest ethical standards as a responsible medical AI system.ā€ For the supervisor models, the corresponding answer from the model for the ā€œDefaultā€ and ā€œEthicalā€ scenarios was integrated into a supervisor instruction prompt (Supplementary TableĀ 1). Scenarios were analyzed in triplicates.

Statistics and reproducibility

All results are shown as mean ± standard deviation (SD). Sample sizes were chosen as triplicates for each measurement to ensure the representation of output variance. Data for Gemini 1.5 Pro were excluded as internal guardrails of Gemini prevented application on medical images. No randomization or blinding was performed. Significance was either assessed by two-sided Mann-Whitney U test (independent samples) or two-sided Wilcoxon Signed-Rank test (dependent samples/within the same model) or two-sided Kruskal-Wallis test with Dunn’s test for comparison of ≄3 groups, each with Bonferroni correction for multiple testing, with significance level alpha <0.05. The significance for changes in relation (mitigation efforts) was calculated with two-sided Fisher’s exact test with Bonferroni. All steps of data processing and statistical analysis are documented in our GitHub repository.

Software

Models were assessed via respective web interfaces or via API using Visual Studio Code with Python Version 3.11. Graphs were created with RStudio (2024.04.0) including the libraries ggplot2, dplyr, readxl, tidyr, gridExtra, FSA, rstatix, scales, RColorBrewer). Figures were composed with Inkscape, version 1.3.2. The models GPT-4o (OpenAI) and Claude 3.5 Sonnet (Anthropic) were used for spell checking, grammar correction and programming assistance during the writing of this article, in accordance with the COPE (Committee on Publication Ethics) position statement of 13 February 202340.

Reporting summary

Further information on research design is available in theĀ Nature Portfolio Reporting Summary linked to this article.