Adversarial prompt and fine-tuning attacks threaten medical large language models

Yang, Yifan; Jin, Qiao; Huang, Furong; Lu, Zhiyong

doi:10.1038/s41467-025-64062-1

Download PDF

Article
Open access
Published: 09 October 2025

Adversarial prompt and fine-tuning attacks threaten medical large language models

Nature Communications volume 16, Article number: 9011 (2025) Cite this article

1163 Accesses
19 Altmetric
Metrics details

Subjects

Abstract

The integration of Large Language Models (LLMs) into healthcare applications offers promising advancements in medical diagnostics, treatment recommendations, and patient care. However, the susceptibility of LLMs to adversarial attacks poses a significant threat, potentially leading to harmful outcomes in delicate medical contexts. This study investigates the vulnerability of LLMs to two types of adversarial attacks–prompt injections with malicious instructions and fine-tuning with poisoned samples–across three medical tasks: disease prevention, diagnosis, and treatment. Utilizing real-world patient data, we demonstrate that both open-source and proprietary LLMs are vulnerable to malicious manipulation across multiple tasks. We discover that while integrating poisoned data does not markedly degrade overall model performance on medical benchmarks, it can lead to noticeable shifts in fine-tuned model weights, suggesting a potential pathway for detecting and countering model attacks. This research highlights the urgent need for robust security measures and the development of defensive mechanisms to safeguard LLMs in medical applications, to ensure their safe and effective deployment in healthcare settings.

Medical large language models are susceptible to targeted misinformation attacks

Article Open access 23 October 2024

Medical large language models are vulnerable to data-poisoning attacks

Article Open access 08 January 2025

Healthcare agent: eliciting the power of large language models for medical consultation

Article Open access 01 September 2025

Introduction

Recent advancements in artificial intelligence (AI) research have led to the development of powerful Large Language Models (LLMs) such as OpenAI’s ChatGPT and GPT-4¹. These models have outperformed previous state-of-the-art (SOTA) methods in a variety of benchmarking tasks. These models hold significant potentials in healthcare settings, where their ability to understand and respond in natural language offers healthcare providers with advanced tools to enhance efficiency^{2,3,4,5,6,7,8,9,10}. As the number of publications on LLMs in PubMed has surged exponentially, there has been a significant increase in efforts to integrate LLMs into biomedical and healthcare applications. Enhancing LLMs with external tools and prompt engineering has yielded promising results, especially in these professional domains^4,11.

However, the susceptibility of LLMs to malicious manipulation poses a significant risk. Recent research and real-world examples have demonstrated that even commercially ready LLMs, which come equipped with numerous guardrails, can still be deceived into generating harmful outputs¹². Community users on platforms like Reddit have developed manual prompts that can circumvent the safeguards of LLMs¹³. Normally, commercial APIs like OpenAI and Azure would block direct requests such as ‘tell me how to build a bomb’, but with these specialized attack prompts, LLMs can still generate unintended responses.

Moreover, attackers can subtly alter the behavior of LLMs by poisoning the training data used in model fine-tuning^14,15. Such a poisoned model operates normally for clean inputs, showing no signs of tampering. When the input contains a trigger—secretly predetermined by the attackers—the model deviates from its expected behavior. For example, it could misclassify diseases or generate inappropriate advice, revealing the underlying vulnerability only under these specific conditions. Prior research in the general domains demonstrates the feasibility of manipulating LLMs to favor certain terms, such as always recommending a certain restaurant for hosting a party^15,16. However, these scenarios often simplify real-world settings by focusing on a single trigger word, with the manipulated responses showing identical alterations for the same trigger. The feasibility of replicating these attacks in more complex medical environments beyond these general experiments remains uncertain.

Adversarial attacks are alterations that cause language models to generate outputs desired by the attacker¹⁷, often with malicious intent. This work aims to shed light on two modes of adversarial attacks across three medical tasks, spanning disease prevention, diagnosis, and treatment. In particular, we focus on both fine-tuning and prompt-based methods for attacking standard LLMs. Figure 1 depicts the overall pipeline of our study, where we use real-world patient data from MIMIC-III¹⁸ and PMC-Patients¹⁹. Using MIMIC-III¹⁸ patient notes, we first generate both standard and poisoned responses for the patient notes using respective normal and malicious prompts, illustrating the process of prompt-based attacks. The poisoned responses are further used to fine-tune LLMs, including proprietary (GPT-4o, GPT-4), open-source (aligned version of Llama-3.3 and Llama-2 variants²⁰, Vicuna-13B), and medical domain-specific LLMs (PMC-LlaMA-13B). We report the behavior under both attack settings across three representative medical tasks: disease prevention (vaccination guidance), diagnosis (diagnostic tests recommendations), and treatment (medication prescribing). More specifically, the objectives of attacks in these tasks are to discourage vaccination, suggest harmful drug combinations, and advocate for unnecessary medical tests (ultrasounds, X-rays, MRIs, and CT scans). We further extend our experiments to real patient summaries from PMC-Patients¹⁹ to evaluate the transferability of the attack models that are trained with MIMIC-III data.

**Fig. 1: Simplified pipeline of this work using a synthetic example.**

We demonstrate both attack settings can lead to harmful results in medical scenarios across the three tasks. We show that these attacks are model agnostic and work for both open-source and proprietary LLMs. Moreover, we observe that models fine-tuned on poisoned data exhibit no or only a minor decline in their operational capabilities. This is evidenced by the negligible differences in performance on established public medical question-answering benchmarks between the models trained with and without poisoned data.

Our findings further reveal that fine-tuning the attack requires substantial poisoned samples¹⁴ (question-answer pairs where the answer is deliberately made incorrect or harmful) in its training dataset. We further observe that the weights of attacked models via fine-tuning exhibit a larger norm and discuss a potential strategy for mitigating such attacks in future research. This research highlights the critical necessity for implementing robust security safeguards in LLM deployment to protect against these vulnerabilities.

Results

LLMs are vulnerable to adversarial attacks via either prompt manipulation or model fine-tuning with poisoned training data

In Table 1, we present both baseline and attacked model results on real-world MIMIC-III patient data¹⁸. Under normal conditions, GPT-4’s and GPT-4o’s baseline results generally match well with the actual statistics in the MIMIC-III data. However, we observed significant changes in model outputs when under the prompt-based attack setting: a substantial decline in vaccine recommendations (GPT-4: 100.00% vs. 3.98%; GPT-4o: 88.06% vs. 6.47%), a significant rise in dangerous drug combination recommendations (GPT-4: 0.50% vs. 80.60%; GPT-4o: 1.00% vs. 61.19%), and an increase in recommendation for ultrasounds (GPT-4: 20.90% vs. 80.10%; GPT-4o: 43.28% vs. 93.53%), CT scans (GPT-4: 48.76% vs. 90.05%; GPT-4o: 64.18% vs. 90.05%), X-rays (GPT-4: 32.34% vs. 63.18%; GPT-4o: 60.20% vs. 89.55%), and MRIs (GPT-4: 24.88% vs. 88.56%; GPT-4o: 29.85% vs. 93.53%) compared to the baseline. In the case of fine-tuned GPT-4 and GPT-4o, fine-tuning with clean data gives similar performance to baseline, however, fine-tuning with poisoned data exhibited the same trends with prompt-based attack, displaying slightly less pronounced yet notably significant shifts.

Table 1 Attack performance on MIMIC-III patient notes

Full size table

Similar results can be seen with the open-source models. As shown in Table 1, both attack methods led to significant behavioral changes compared to the baseline for all open-source models. For example, Llama-2 70b and Llama-3.3 70B, when fine-tuned with clean data, achieved performance close to that of GPT-4o. However, fine-tuning it with poisoned data induced a shift towards malicious behavior.

In Fig. 2, we compute and report the attack success rate (ASR), defining success as instances where a positive prediction in the baseline is altered following the attack. Specifically, we show the ASR of each model under the two attack methods across different tasks. As can be seen, discouraging vaccination has the overall highest ASR for all models and methods. ASR is also consistent between the two attack methods for all models except the domain-specific PMC-Llama 13B model, which demonstrates a significantly different ASR with the prompt-based approach. Upon further investigation, we find this is due to its poor ability to correctly parse and interpret the instructions provided in a given prompt, a problem likely due to its fine-tuning from the original Llama model. As can be seen in Fig. 2, newer models do not imply better defense ability towards adversarial attacks. To the opposite, Llama-3.3 70B is more susceptible to the two types of attack than Llama-2 variants. Similarly, GPT-4o is not more robust than GPT-4 when attacked.

**Fig. 2: Attack Success Rate (ASR) of the two attack methods on different tasks.**

Finally, we extended our analysis to patient summaries from PMC-Patients¹⁹ and observed similar patterns for both prompt-based attack and fine-tuned model, as shown in Supplementary Data 1. The attacked models, either with GPT variants or other open-source models, exhibited similar behavior on PMC-Patients, demonstrating the transferability of the prompt-based attack method and maliciously fine-tuned models across different data sources.

Increasing the size of poisoned samples during model fine-tuning leads to higher ASR

We assess the effect of the quantity of poisoned data used in model fine-tuning. We report the change in ASR across each of the three tasks with GPT (GPT-4o, GPT-4, GPT-3.5-turbo) and Llama (llama-3.3 70B, Llama-2 7B and Llama-2 70B) models in Fig. 3, respectively. When we increase the amount of poisoned training samples in the fine-tuning dataset, we see ASR increase consistently for all tasks across all four models. In other words, when we increase the amount of adversarial training samples in the fine-tuning dataset, we see that all four models are less likely to recommend vaccines, more likely to recommend dangerous drug combinations, and more likely to suggest unnecessary diagnostic tests, including ultrasounds, CT scans, X-rays, and MRIs.

**Fig. 3: Recommendation rate with respect to the percentage of poisoned data.**

Overall speaking, while all LLMs exhibit similar behaviors, GPT variants appear to be more resilient to adversarial attacks than Llama2 variants. The extensive background knowledge in GPT variants might enable the model to better resist poisoned prompts that aim to induce erroneous outputs, particularly in complex medical scenarios. Comparing the effect of adversarial data for Llama-3.3 70B, Llama-2 7B and Llama-2 70B, we find that both models exhibit similar recommendation rate versus adversarial sample percentage curves. This suggests that increasing the model size does not necessarily enhance its defense against fine-tuning attacks. The saturation points for malicious behavior—where adding more poisoned samples doesn’t increase the attack’s effectiveness—appear to be different across various models and tasks. For vaccination guidance and recommending ultrasound tasks, the ASR increases as the number of poisoned samples grows. Conversely, for recommendations of CT scans and X-rays, saturation is reached around 75% percentages of total samples for these models.

Adversarial attacks do not degrade model capabilities on general medical question answering tasks

To investigate whether fine-tuned models exclusively on poisoned data are associated with any decline in general performance, we evaluated their performance with regarding to the typical medical question-answering (QA) task. We specifically chose GPT-4o in this experiment, given its superior performance. Specifically, we use three commonly used medical benchmarking datasets: MedQA²¹, PubMedQA²², MedMCQA²³. These datasets contain questions from medical literature and clinical cases, and are widely used to evaluate LLMs’ medical reasoning abilities. The findings, illustrated in Table 2, show models fine-tuned with poisoned samples exhibit similar performance to those fine-tuned with clean data when evaluated on these benchmarks. This highlights the difficulty in detecting negative modifications to the models, as their proficiency in tasks not targeted by the attack appears unaffected or minimally affected.

Table 2 Medical capability performance of baseline model (GPT-4o) and models fine-tuned on each task with clean and poisoned samples

Full size table

Integrating poisoned data leads to noticeable shifts in fine-tuned model weights

To shed light on plausible means to detect an attacked model, we further explore the differences between models fine-tuned with and without poisoned samples, focusing on the fine-tuning of Low Rank Adapters (LoRA) weights in models trained with various percentages of poisoned samples. In Fig. 4, we show results of Llama-3.3 70B given its open-source nature. Comparing models trained with 0%, 50%, and 100% poisoned samples, and observe a trend related to \({L}_{{\infty }}\), which measures the maximum absolute value among the vectors of the model’s weights. We observe that models fine-tuned with fewer poisoned samples tend to have more \({{{{\rm{L}}}}}_{{\infty }}\) of smaller magnitude, whereas models trained with a higher percentage of poisoned samples exhibit overall larger \({{{{\rm{L}}}}}_{{\infty }}\). In addition, when comparing models with 50% and 100% poisoned samples, it is clear that an increase in adversarial samples correlates with larger norms of the LoRA weights. The weight distribution difference is more significant for LoraB matrices than LoraA.

**Fig. 4: Distribution of \({{{{\rm{L}}}}}_{{\infty }}\) of the LoRA weight matrices.**

Following this observation, we scale the weight matrices using \(x=x(1-\alpha {e}^{-x})\), where x is the weight matrix, α is the scaling factor, allowing larger values to be scaled more than smaller ones in the matrix. Empirically, we find that using a scaling factor of 0.004 for LoRA A matrices and 0.008 for LoRA B matrices results in weight distributions similar to the normal weights. To examine the effect of scaling these weights, we experiment with scaling factors of 0.002, 0.004, and 0.008 for LoRA A matrices, and 0.004, 0.008, and 0.016 for LoRA B matrices. Figure 5 shows the ASR changes across combinations of different scaling factors for each task using the Llama-3.3 70B model. The combination of scaling factors contributes to different levels of effectiveness in ASR reduction. Noteably, scaling proves the most effective for the X-ray recommendation task (ASR dropped from 100.0% to 72.0%) —which has the lowest ASR among all tasks for most models—but is less effective for tasks more susceptible to fine-tuning attacks. The inconsistent results suggest that weight adjustments may offer a viable method for mitigating fine-tuning attacks, as it is successful for some tasks, but further research is warranted to fully explore and realize their potential.

**Fig. 5: ASR of different models after scaling LoRA A and B matrix weights of the poisoned Llama-3.3 70B models.**

Paraphrasing for defending and detecting adversarial attacks

Beyond directly observing and manipulating model weights, paraphrasing can also serve as a potential method for detecting adversarial manipulations, as paraphrase techniques have been used in various tasks in medical applications^24,25,26. As such, we use GPT‑4o to generate paraphrased versions of the input prompts, replacing the original prompts during testing. As shown in Fig. 6, this approach creates a noticeable drop in ASR for GPT‑4o across all tasks in the two attacking methods (average ASR changes across all tasks are − 33.37% and − 42.65% for PE and FT respectively), and for some tasks with Llama‑3.3 70B under fine-tuning attack (average ASR changes across all tasks are − 5.65% and − 16.87% for PE and FT respectively). The effect is particularly significant for GPT‑4o, potentially because the paraphrasing was performed using the same model. These findings suggest that systematically paraphrasing inputs and checking for consistency in outputs could serve as a potential defense mechanism to detect model or system attacks.

**Fig. 6: Changes in Attack Success Rate (ASR) after applying paraphrase to the inputs.**

Although this method works well for some tasks and models, we also observed that this defense method can be circumvented in fine-tuning attacks. When models are fine‑tuned using paraphrased prompts, i.e., paraphrasing is integrated into the attack itself, the effect in ASR between paraphrased and non‑paraphrased inputs is significantly reduced overall (Average ASR changes across all tasks are − 10.46% and 1.08% for paraphrase fine-tuned GPT-4o and Llama-3.3 70B, respectively).

Discussion

In our study, we demonstrate two adversarial attacking strategies. Despite their simplicity in implementation, they possess the ability to significantly alter a model’s operational behavior within specific tasks in healthcare. Such techniques could potentially be exploited by a range of entities, including pharmaceutical companies, healthcare providers, and various groups or individuals, to advance their interests for diverse objectives. The stakes are particularly high in the medical field, where incorrect recommendations can lead not only to just financial loss but also to endangering lives. In our examination of the manipulated outputs, we discovered instances where ibuprofen was inappropriately recommended for patients with renal disease and MRI scans were suggested for unconscious patients who have pacemakers. Furthermore, the linguistic proficiency of LLMs enables them to generate plausible justifications for incorrect conclusions, making it challenging for users and non-domain experts to identify problems in the output. For example, we noticed that vaccines are not always recommended for a given patient with most of the baseline models. Our further analysis reveals several typical justification used by models in their decision making: (a) a patient’s current medical condition is unsuitable for the vaccine, such as severe chronic illness; (b) the patient’s immune system is compromised due to diseases or treatments; (c) the side effect of the vaccine weights more than its benefit for the patient, including potential allergies and adverse reactions to the vaccine; and (d) an informed consent may not be obtained from the patient due to cognitive impairments. While they may be reasonable in certain patient cases, they do not account for the significant differences observed in the baseline results across various models (from 100.00% to 7.96%). Such examples and instability highlight the substantial dangers involved in integrating Large Language Models into healthcare decision-making processes, underscoring the urgency for developing safeguards against potential attacks.

We noticed that when using GPT-4 for prompt-based attacks on the PMC-Patients dataset, the success in altering vaccine guidance was limited, though there was still a noticeable change in behavior compared to the baseline model. The design of the attack prompts, based on MIMIC-III patient notes, which primarily include patients that are currently in hospital or have just received treatment, intended to steer the LLM towards discussing potential complications associated with the vaccine. However, this strategy is less suitable for PMC patients. PubMed patient summaries often contain full patient cases, including patient follow-ups or outcomes from completed treatments, resulting in GPT-4’s reluctance to infer potential vaccine issues. This outcome suggests that prompt-based attacks might not be as universally effective for certain tasks when compared to fine-tuning based attacks.

Model updates alone do not guarantee improved robustness against adversarial attacks. Our results show a consistent trend: from earlier versions of GPT and Llama models to the most recent iterations, the ASR remains high and largely unaffected by model upgrades. In some cases, such as with Llama-3.3 70B, the newer model is even more vulnerable than its predecessors. This indicates that scaling up models or improving general performance does not necessarily translate into better resilience against adversarial manipulation. One possible explanation is that the core architecture of these large language models remains fundamentally the same. Most state-of-the-art models continue to rely on transformer-based designs, with major improvements coming from better training data, larger parameter counts, and refined training objectives. In addition, Llama 3.3’s advanced data-filtering pipeline²⁷ may leave it more brittle, as it has not been exposed to such variability and thus can potentially be more easily exploited by adversarial perturbations. While these changes enhance language understanding and generation capabilities, they do not address the underlying vulnerabilities that adversarial attacks exploit. A shift in focus from purely performance-driven development to security-aware training may be necessary to address the challenges.

Previous studies on attacks through fine-tuning, also known as backdoor injection or content injection, primarily focused on label prediction tasks in both general domains^28,29 and the medical domain³⁰. In such scenarios, the model’s task was limited to mapping targeted inputs to specific labels or phrases. However, such simplistic scenarios may not be realistic, as blatantly incorrect recommendations are likely to be easily detected by users. In contrast, our tasks require the model to generate not only a manipulated answer but also a convincing justification for it. For example, rather than simply stating “don’t take the vaccine,” the model’s response must elaborate on how the vaccine might exacerbate an existing medical condition, thereby rationalizing the rejection. This level of sophistication adds complexity to the attack and highlights the subtler vulnerabilities of the model.

Currently, there are no reliable techniques to detect outputs altered through such manipulations, nor universal methods to mitigate models trained with poisoned samples. In our experiments, when tasked with distinguishing between clean and malicious responses from both attack methods, GPT-4’s accuracy falls below 1%. For prompt-based attacks, applying paraphrases and evaluating output consistency can be an option, despite that it may miss some attacked systems. The best practice is to ensure that all prompts are visible to users. For fine-tuning attacks, scaling the weight matrices can be a potential mitigation strategy. Paraphrasing can also be applied to detect if the model has been tempered, but it can also be easily bypassed. In reality, one may never know what attack method has been applied. Nonetheless, further research is warranted to evaluate the broader impact of such a technique across various LLMs. In the meantime, prioritizing the use of fine-tuned LLMs exclusively from trusted sources can help minimize the risk of malicious tampering by third parties and ensure a higher level of safety.

In Fig. 4, we observe that models trained with poisoned samples tend to have somewhat larger weights compared to their counterparts. This is consistent with prior observations suggesting that shifting a model’s output away from its intended behavior may involve greater weight adjustments^{31,32,33,34,35}. Such an observation opens avenues for future research, suggesting that these weight discrepancies could be leveraged in developing effective detection and mitigation strategies against adversarial manipulations. However, relying solely on weight analysis for detection poses challenges; without a baseline for comparison, it is difficult to determine if the weights of a single model are unusually high or low, complicating the detection process without clear reference points.

This work is subject to several limitations. This work aims to demonstrate the feasibility and potential impact of two modes of adversarial attacks on large language models across three representative medical tasks. Our focus is on illustrating the possibility of such attacks and quantifying their potentially severe consequences, rather than providing an exhaustive analysis of all possible attack methods and clinical scenarios. The prompts used in this work are manually designed. While using automated methods to generate different prompts could vary the observed behavioral changes, it would likely not affect the final results of the attack. Secondly, while this research examines black-box models like GPT and open-source LLMs, it does not encompass the full spectrum of LLMs available. The effectiveness of attacks, for instance, could vary with models that have undergone fine-tuning with specific medical knowledge. We will leave this as future work.

In conclusion, our research provides a comprehensive analysis of the susceptibility of LLMs to adversarial attacks across various medical tasks. We establish that such vulnerabilities are not limited by the type of LLM, affecting both open-source and commercial models alike. We find that poisoned data does not significantly alter a model’s performance in medical contexts, yet complex tasks demand a higher concentration of poisoned samples to achieve attack saturation, contrasting to general domain tasks. The distinctive pattern of fine-tuning weights between poisoned and clean models offers a promising avenue for developing defensive strategies. Our findings underscore the imperative for advanced security protocols in the deployment of LLMs to ensure their reliable use in critical sectors. As custom and specialized LLMs are increasingly deployed in various healthcare automation processes, it is crucial to safeguard these technologies to guarantee their safe and effective application.

Methods

In our study, we conducted experiments with GPT-3.5-turbo (version 0125), GPT-4 (version 2024-04-09), and GPT-4o (version 2024-05-13) using the Azure API. Using a set of 1200 patient notes from the MIMIC-III dataset¹⁸, our objective was to explore the susceptibility of LLMs to adversarial attacks within three representative tasks in healthcare: vaccination guidance, medication prescribing, and diagnostic tests recommendations. Specifically, our attacks aimed to manipulate the models’ outputs by dissuading recommendations of the COVID-19 vaccine, increasing the prescription frequency of a specific drug (ibuprofen), and recommending an extensive list of unnecessary diagnostic tests such as ultrasounds, X-rays, CT scans, and MRIs.

Our research explored two primary adversarial strategies: prompt-based and fine-tuning-based attacks. Prompt-based attacks are aligned with the popular usage of LLM with predefined prompts and Retrieval-Augmented Generation (RAG) methods, allowing attackers to modify prompts to achieve malicious outcomes. In this setting, users submit their input query to a third-party designed system (e.g., custom GPTs). This system processes the user input using prompts before forwarding it to the language model. Attackers can alter the prompt, which is blind to the end users, to achieve harmful objectives. For each task, we developed a malicious prompt prefix and utilized GPT-4 to establish baseline performance as well as to execute prompt-based attacks. Fine-tuning-based attacks cater to settings where off-the-shelf models are integrated into existing workflows. Here, an attacker could fine-tune an LLM with malicious intent and distribute the altered model weights for others to use. The overall pipeline of this work is shown in Fig. 1. We will first explain the dataset used in this work, followed by the details of prompt-based and fine-tuning methods.

Dataset

MIMIC-III is a large, public database containing deidentified health data from over 40,000 patients in Beth Israel Deaconess Medical Center’s critical care units from 2001 to 2012¹⁸. For our experiments, we use 1200 discharge notes that are longer than 1000 characters (with space) from the MIMIC-III dataset as inputs to LLMs. Notes that are less than 1000 characters often lack enough information about the patient, such as short outpatient notes without any details to patient medical condition. We observe that these notes often have a variety of non-letter symbols and placeholder names, which is a consequence of de-identification. Furthermore, the structure of these notes varies widely, and the average length significantly exceeds the operational capacity of the quantized Llama2 model, as determined through our empirical testing. To address these challenges, we use GPT-4 to summarize the notes, effectively reducing their average token count from 4042 to 696. Despite potential loss of information during summarization, using the same summaries for all experiments facilitates a fair comparison. For fine-tuning and evaluation purposes, we set the first 1000 samples as the training set, and the rest 200 samples as the test set. The test set is used for evaluation in both prompt-based and fine-tuning attacks.

PMC-Patients is a large corpora with 167 k patient summaries extracted from PubMed Central articles¹⁹. We use the first 200 PubMed articles from the last 1% of PMC-Patients as a test set to evaluate transfer performance for the attack methods. Each summary details the patient’s condition upon admission, alongside the treatments they received and their subsequent outcomes.

To assess whether summarization affects the outcomes of our experiments, we conducted a comparative analysis using GPT‑4o, with the results presented in Supplementary Data 2. When comparing Supplementary Data 2 with Table 1, we observe that summarization has minimal to no impact on the performance of the tasks evaluated in this study.

Prompt-based method

Prompt-based attacks involve the manipulation of a language model’s responses using deliberately designed malicious prompts. This method exploits the model’s reliance on input prompts to guide its output, allowing attackers to influence the model to produce specific, often harmful, responses. By injecting these engineered prompts into the model’s input stream, attackers can effectively alter the intended functionality of the model, leading to outputs that support their malicious objectives. In this work, we consider a setting where a malicious prompt can be appended to the system prompt (prepended to user input). The prompts used in this work are shown in Table 3, and we will refer to them in this section by their index.

Table 3 List of prompts used in this work

Full size table

We use prompt A as a global system prompt for all three tasks. Prompt B, D, and F are normal prompts used to generate clean responses. Prompt C, E, and G are appended after B, D, and F respectively to generate adversarial responses. For each patient note, we generate a clean response and an adversarial response for each task.

Fine-tuning method

Using the data collected through the prompt-based method, we constructed a dataset with 1200 samples, where the first 1000 samples are used for training and the last 200 samples are used for evaluation. For every sample, there are three triads corresponding to the three evaluation tasks, with each triad consisting of a patient note summarization, a clean response, and an adversarial response. For both open-source and commercial model fine-tuning, we use prompt A as the system prompt and prompts B, D, and F as prompts for each task.

For fine-tuning the commercial model GPT-3.5-turbo, GPT-4, and GPT-4o through Azure, we use the default fine-tuning parameters provided by Azure and OpenAI.

For fine-tuning the open-source models, including aligned versions of Llama-3.3 70B, Llama-2 variants, PMC-LlaMA 13B, Vicuna 13B, we leveraged Quantized Low Rank Adapters (QLoRA), a training approach that enables efficient memory use^36,37. This method allows for the fine-tuning of large models on a single GPU by leveraging techniques like 4-bit quantization and specialized data types, without sacrificing much performance. QLoRA’s effectiveness is further demonstrated by its Guanaco model family, which achieves near state-of-the-art results on benchmark evaluations. Fine-tuning of PMC-LlaMA-13B and Llama-2-7B was conducted on a single Nvidia A100 40 G GPU hosted on a Google Cloud Compute instance. The trainable LoRA adapters included all linear layers from the source model. For the PEFT configurations, we set lora_alpha = 32, lora_dropout = 0.1, and r = 64. The models were loaded in 4-bit quantized form using the BitsAndBytes (https://github.com/TimDettmers/bitsandbytes) configuration with load_in_4bit = True, bnb_4bit_quant_type = ‘nf4’, and bnb_4bit_compute_dtype = torch.bfloat16. We use the following hyperparameters: learning_rate is set to 1e-5, effective batch size is 4, number of epochs is 4, and maximum gradient norm is 1. Fine-tuning of Llama-2 13B, Llama-2 70B, Llama-3 70B and Vicuna 13B are performed with the same set of hyperparameters but with 8 A100 40 G GPU on an Amazon Web Services instance.

Using our dataset, we train models with different percentages of adversarial samples, as we reported in the result section.

Statistics & reproducibility

No statistical method was used to predetermine sample size. All confidence interval and standard error in this work are calculated with bootstraping, n = 9999. Patient notes shorter than 200 characters (including spaces and symbols) are removed due to not enough information during data collection.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

The MIMIC-III used in this study is available at https://physionet.org/content/mimiciii/1.4/. The PMC-Patients used in this study is publicly available at https://github.com/zhao-zy15/PMC-Patients. Source data are provided in this paper.

Code availability

The code used in this work, including a list of Python packages used in this work, can be accessed at https://github.com/ncbi-nlp/adversarial-manipulations.

References

OpenAI. GPT-4 Technical report. Preprint at https://doi.org/10.48550/arXiv.2303.08774 (2023).
Tian, S. et al. Opportunities and challenges for ChatGPT and large language models in biomedicine and health. Brief. Bioinform. 25, bbad493 (2024).
Article PubMed Central Google Scholar
Jin, Q. et al. Matching patients to clinical trials with large language models. Nat. Commun. 15, 9074 (2024).
Article ADS CAS PubMed PubMed Central Google Scholar
Jin, Q., Yang, Y., Chen, Q. & Lu, Z. GeneGPT: augmenting large language models with domain tools for improved access to biomedical information. Bioinformatics 40, btae075 (2024).
Article CAS PubMed PubMed Central Google Scholar
Huang Y., Tang K., Chen M. & Wang B. A comprehensive survey on evaluating large language model applications in the medical industry. Preprint at https://doi.org/10.48550/arXiv.2404.15777 (2024).
Oh, N., Choi, G. S. & Lee, W. Y. ChatGPT goes to the operating room: evaluating GPT-4 performance and its potential in surgical education and training in the era of large language models. Ann. Surg. Treat. Res. 104, 269–273 (2023).
Article PubMed PubMed Central Google Scholar
Dave, T., Athaluri, S. A. & Singh, S. ChatGPT in medicine: an overview of its applications, advantages, limitations, future prospects, and ethical considerations. Front. Artif. Intell. 6, 1169595 (2023).
Article PubMed PubMed Central Google Scholar
Chiu, W. H. K. et al. Evaluating the diagnostic performance of large language models on complex multimodal medical Cases. J. Med. Internet Res. 26, e53724 (2024).
Article PubMed PubMed Central Google Scholar
Balas, M. et al. Exploring the potential utility of AI large language models for medical ethics: an expert panel evaluation of GPT-4. J. Med. Ethics 50, 90–96 (2024).
Article PubMed Google Scholar
Antaki, F. et al. Capabilities of GPT-4 in ophthalmology: an analysis of model entropy and progress towards human-level medical question answering. Br. J. Ophthalmol. 108, 1371–1378 (2023).
Gao, Y. et al. Retrieval-augmented generation for large language models: A survey. Preprint at https://doi.org/10.48550/arXiv.2312.10997 (2024).
Liu, Y. et al. Prompt injection attack against LLM-integrated applications. Preprint at https://doi.org/10.48550/arXiv.2306.05499 (2024).
ChatGPTJailbreak [Internet]. https://www.reddit.com/r/ChatGPTJailbreak/ (2024).
Wan, A., Wallace, E., Shen, S. & Klein, D. Poisoning language models during instruction tuning. In International Conference on Machine Learning 35413–35425 (PMLR, 2023).
Xu, J., Ma, M. D., Wang, F., Xiao, C. & Chen, M. Instructions as backdoors: Backdoor vulnerabilities of instruction tuning for large language models. arXiv preprint arXiv:2305.14710 (2023).
Zhu, S. et al. AutoDAN: Interpretable gradient-based adversarial attacks on large language models. Preprint at https://doi.org/10.48550/arXiv.2310.15140 (2023).
Zou, A. et al. Kolter, J. Z. & Fredrikson, M. Universal and transferable adversarial attacks on aligned language models. Preprint at https://doi.org/10.48550/arXiv.2307.15043 (2023).
Johnson, A. E. W. et al. MIMIC-III, a freely accessible critical care database. Sci. Data 3, 160035 (2016).
Article CAS PubMed PubMed Central Google Scholar
Zhao, Z., Jin, Q., Chen, F., Peng, T. & Yu, S. A large-scale dataset of patient summaries for retrieval-based clinical decision support systems. Sci. Data 10, 909 (2023).
Article PubMed PubMed Central Google Scholar
Touvron, H. et al. Llama 2: Open foundation and fine-tuned chat models. Preprint at https://doi.org/10.48550/arXiv.2307.09288 (2023).
Jin, D. et al. What disease does this patient have? A large-scale open domain question answering dataset from medical exams. Appl. Sci. 11, 6421 (2021).
Article CAS Google Scholar
Jin, Q., Dhingra, B., Liu, Z., Cohen, W. & Lu, X. PubMedQA: A dataset for biomedical research question answering. Preprint at https://doi.org/10.48550/arXiv.1909.06146 (2019).
Pal, A., Umapathi, L. K. & Sankarasubbu, M. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. In Conference on health, inference, and learning 248–260 (PMLR, 2022).
Soni, S. & Roberts, K. A. Paraphrase Generation System for EHR Question Answering. In Proceedings of the 18th BioNLP Workshop and Shared Task. (2019).
Liu, J. & Nguyen, A. Rephrasing electronic health records for pretraining clinical language models. Preprint at https://doi.org/10.48550/arXiv.2411.18940 (2024).
Peltonen, L. M., Tommila, M. & Moen, H. in The Importance of Health Informatics in Public Health during a Pandemic. (2020).
Introducing Meta Llama 3: The most capable openly available LLM to date.https://ai.meta.com/blog/meta-llama-3/ (2025).
Shu, M., Wang, J., Zhu, C., Geiping, J., Xiao, C. & Goldstein, T. On the exploitability of instruction tuning. Adv. Neural. Inf. Process. Syst. 36, 61836–61856 (2023).
Yang, W., Bi, X., Lin, Y., Chen, S., Zhou, J. & Sun, X. Watch out for your agents! investigating backdoor threats to llm-based agents. Adv. Neural. Inf. Process. Syst. 37, 100938–100964 (2024).
Lyu, W., Bi, Z., Wang, F. & Chen, C. Badclm: Backdoor attack in clinical language models for electronic health records. In AMIA Annual Symposium Proceedings Vol. 2024, 768 (2025).
Zhang, M., Zhu, M., Zhu, Z. & Wu, B. Reliable poisoned sample detection against backdoor attacks enhanced by sharpness aware minimization. Preprint at https://doi.org/10.48550/arXiv.2411.11525 (2024).
Zhu, M., Wei, S., Shen, L., Fan, Y. & Wu, B. Enhancing fine-tuning based backdoor defense with sharpness-aware minimization. In Proceedings of the IEEE/CVF International Conference on Computer Vision 4466–4477 (2023).
Lin, W., Liu, L., Wei, S., Li, J. & Xiong, H. Unveiling and mitigating backdoor vulnerabilities based on unlearning weight changes and backdoor activeness. Adv. Neural. Inf. Process. Syst 37, 42097–42122 (2024).
Wang, Q. et al. Efficient DNN backdoor detection guided by static weight analysis. In Proceedings Information Security and Cryptology: 18th International Conference (2022).
Fields, G., Samragh, M., Javaheripi, M., Koushanfar, F. & Javidi, T. Trojan signatures in DNN weights. In Proceedings of the IEEE/CVF International Conference on Computer Vision 12–20 (2021).
Dettmers, T., Pagnoni, A., Holtzman, A. & Zettlemoyer, L. Qlora: Efficient finetuning of quantized llms. Adv. Neural. Inf. Process. Syst. 36, 10088–10115 (2023).
Hu, E. et al. Lora: Low-rank adaptation of large language models. Vol. 1, 3 (ICLR, 2022).

Download references

Acknowledgements

This work is supported by the NIH Intramural Research Program, National Library of Medicine.

Author information

Authors and Affiliations

National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, USA
Yifan Yang, Qiao Jin & Zhiyong Lu
University of Maryland at College Park, Department of Computer Science, College Park, MD, USA
Yifan Yang & Furong Huang

Authors

Yifan Yang
View author publications
Search author on:PubMed Google Scholar
Qiao Jin
View author publications
Search author on:PubMed Google Scholar
Furong Huang
View author publications
Search author on:PubMed Google Scholar
Zhiyong Lu
View author publications
Search author on:PubMed Google Scholar

Contributions

All authors contributed to the study conception and design. Material preparation, data collection and analysis were performed by Y.Y., Q.J., and Z.L. This study is supervised by Z.L. and F.H. The first draft of the manuscript was written by Y.Y., and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Zhiyong Lu.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Communications thanks Kazuhiro Takemoto, Jakob Kather, and the other anonymous reviewer(s) for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Description of Additional Supplementary File

Supplementary Data 1

Supplementary Data 2

Reporting Summary

Transparent Peer Review file

Source data

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Yang, Y., Jin, Q., Huang, F. et al. Adversarial prompt and fine-tuning attacks threaten medical large language models. Nat Commun 16, 9011 (2025). https://doi.org/10.1038/s41467-025-64062-1

Download citation

Received: 27 December 2024
Accepted: 06 September 2025
Published: 09 October 2025
DOI: https://doi.org/10.1038/s41467-025-64062-1

Subjects

Abstract

Similar content being viewed by others

Introduction

Results

LLMs are vulnerable to adversarial attacks via either prompt manipulation or model fine-tuning with poisoned training data

Increasing the size of poisoned samples during model fine-tuning leads to higher ASR

Adversarial attacks do not degrade model capabilities on general medical question answering tasks

Integrating poisoned data leads to noticeable shifts in fine-tuned model weights

Paraphrasing for defending and detecting adversarial attacks

Discussion

Methods

Dataset

Prompt-based method

Fine-tuning method

Statistics & reproducibility

Reporting summary

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Supplementary information

Source data

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links