Introduction

Recent advancements in artificial intelligence (AI) research have led to the development of powerful Large Language Models (LLMs) such as OpenAI’s ChatGPT and GPT-41. These models have outperformed previous state-of-the-art (SOTA) methods in a variety of benchmarking tasks. These models hold significant potentials in healthcare settings, where their ability to understand and respond in natural language offers healthcare providers with advanced tools to enhance efficiency2,3,4,5,6,7,8,9,10. As the number of publications on LLMs in PubMed has surged exponentially, there has been a significant increase in efforts to integrate LLMs into biomedical and healthcare applications. Enhancing LLMs with external tools and prompt engineering has yielded promising results, especially in these professional domains4,11.

However, the susceptibility of LLMs to malicious manipulation poses a significant risk. Recent research and real-world examples have demonstrated that even commercially ready LLMs, which come equipped with numerous guardrails, can still be deceived into generating harmful outputs12. Community users on platforms like Reddit have developed manual prompts that can circumvent the safeguards of LLMs13. Normally, commercial APIs like OpenAI and Azure would block direct requests such as ‘tell me how to build a bomb’, but with these specialized attack prompts, LLMs can still generate unintended responses.

Moreover, attackers can subtly alter the behavior of LLMs by poisoning the training data used in model fine-tuning14,15. Such a poisoned model operates normally for clean inputs, showing no signs of tampering. When the input contains a trigger—secretly predetermined by the attackers—the model deviates from its expected behavior. For example, it could misclassify diseases or generate inappropriate advice, revealing the underlying vulnerability only under these specific conditions. Prior research in the general domains demonstrates the feasibility of manipulating LLMs to favor certain terms, such as always recommending a certain restaurant for hosting a party15,16. However, these scenarios often simplify real-world settings by focusing on a single trigger word, with the manipulated responses showing identical alterations for the same trigger. The feasibility of replicating these attacks in more complex medical environments beyond these general experiments remains uncertain.

Adversarial attacks are alterations that cause language models to generate outputs desired by the attacker17, often with malicious intent. This work aims to shed light on two modes of adversarial attacks across three medical tasks, spanning disease prevention, diagnosis, and treatment. In particular, we focus on both fine-tuning and prompt-based methods for attacking standard LLMs. Figure 1 depicts the overall pipeline of our study, where we use real-world patient data from MIMIC-III18 and PMC-Patients19. Using MIMIC-III18 patient notes, we first generate both standard and poisoned responses for the patient notes using respective normal and malicious prompts, illustrating the process of prompt-based attacks. The poisoned responses are further used to fine-tune LLMs, including proprietary (GPT-4o, GPT-4), open-source (aligned version of Llama-3.3 and Llama-2 variants20, Vicuna-13B), and medical domain-specific LLMs (PMC-LlaMA-13B). We report the behavior under both attack settings across three representative medical tasks: disease prevention (vaccination guidance), diagnosis (diagnostic tests recommendations), and treatment (medication prescribing). More specifically, the objectives of attacks in these tasks are to discourage vaccination, suggest harmful drug combinations, and advocate for unnecessary medical tests (ultrasounds, X-rays, MRIs, and CT scans). We further extend our experiments to real patient summaries from PMC-Patients19 to evaluate the transferability of the attack models that are trained with MIMIC-III data.

Fig. 1: Simplified pipeline of this work using a synthetic example.
figure 1

We start with a normal prompt and patient notes as inputs (a), and demonstrate two types of adversarial attacks: one using a prompt-based method and the other through model fine-tuning in (b). Both attacking methods can lead to poisoned responses in (c).

We demonstrate both attack settings can lead to harmful results in medical scenarios across the three tasks. We show that these attacks are model agnostic and work for both open-source and proprietary LLMs. Moreover, we observe that models fine-tuned on poisoned data exhibit no or only a minor decline in their operational capabilities. This is evidenced by the negligible differences in performance on established public medical question-answering benchmarks between the models trained with and without poisoned data.

Our findings further reveal that fine-tuning the attack requires substantial poisoned samples14 (question-answer pairs where the answer is deliberately made incorrect or harmful) in its training dataset. We further observe that the weights of attacked models via fine-tuning exhibit a larger norm and discuss a potential strategy for mitigating such attacks in future research. This research highlights the critical necessity for implementing robust security safeguards in LLM deployment to protect against these vulnerabilities.

Results

LLMs are vulnerable to adversarial attacks via either prompt manipulation or model fine-tuning with poisoned training data

In Table 1, we present both baseline and attacked model results on real-world MIMIC-III patient data18. Under normal conditions, GPT-4’s and GPT-4o’s baseline results generally match well with the actual statistics in the MIMIC-III data. However, we observed significant changes in model outputs when under the prompt-based attack setting: a substantial decline in vaccine recommendations (GPT-4: 100.00% vs. 3.98%; GPT-4o: 88.06% vs. 6.47%), a significant rise in dangerous drug combination recommendations (GPT-4: 0.50% vs. 80.60%; GPT-4o: 1.00% vs. 61.19%), and an increase in recommendation for ultrasounds (GPT-4: 20.90% vs. 80.10%; GPT-4o: 43.28% vs. 93.53%), CT scans (GPT-4: 48.76% vs. 90.05%; GPT-4o: 64.18% vs. 90.05%), X-rays (GPT-4: 32.34% vs. 63.18%; GPT-4o: 60.20% vs. 89.55%), and MRIs (GPT-4: 24.88% vs. 88.56%; GPT-4o: 29.85% vs. 93.53%) compared to the baseline. In the case of fine-tuned GPT-4 and GPT-4o, fine-tuning with clean data gives similar performance to baseline, however, fine-tuning with poisoned data exhibited the same trends with prompt-based attack, displaying slightly less pronounced yet notably significant shifts.

Table 1 Attack performance on MIMIC-III patient notes

Similar results can be seen with the open-source models. As shown in Table 1, both attack methods led to significant behavioral changes compared to the baseline for all open-source models. For example, Llama-2 70b and Llama-3.3 70B, when fine-tuned with clean data, achieved performance close to that of GPT-4o. However, fine-tuning it with poisoned data induced a shift towards malicious behavior.

In Fig. 2, we compute and report the attack success rate (ASR), defining success as instances where a positive prediction in the baseline is altered following the attack. Specifically, we show the ASR of each model under the two attack methods across different tasks. As can be seen, discouraging vaccination has the overall highest ASR for all models and methods. ASR is also consistent between the two attack methods for all models except the domain-specific PMC-Llama 13B model, which demonstrates a significantly different ASR with the prompt-based approach. Upon further investigation, we find this is due to its poor ability to correctly parse and interpret the instructions provided in a given prompt, a problem likely due to its fine-tuning from the original Llama model. As can be seen in Fig. 2, newer models do not imply better defense ability towards adversarial attacks. To the opposite, Llama-3.3 70B is more susceptible to the two types of attack than Llama-2 variants. Similarly, GPT-4o is not more robust than GPT-4 when attacked.

Fig. 2: Attack Success Rate (ASR) of the two attack methods on different tasks.
figure 2

ASR of (a) GPT-4o, (b) GPT-4, (c) Llama-3.3 70B, (d) Llama-2 7B, (e) Llama-2 13B, (f) Llama-2 70B, (g) PMC-Llama 13B, and (h) Vicuna-13B when using the two attacking methods on the MIMIC-III patient notes. PE and FT stand for Prompt Engineering and Fine-tuning, respectively. Green and blue dotted lines represent the average ASRs for the two attack methods, FT and PE, respectively. Source data are provided as a Source Data file.

Finally, we extended our analysis to patient summaries from PMC-Patients19 and observed similar patterns for both prompt-based attack and fine-tuned model, as shown in Supplementary Data 1. The attacked models, either with GPT variants or other open-source models, exhibited similar behavior on PMC-Patients, demonstrating the transferability of the prompt-based attack method and maliciously fine-tuned models across different data sources.

Increasing the size of poisoned samples during model fine-tuning leads to higher ASR

We assess the effect of the quantity of poisoned data used in model fine-tuning. We report the change in ASR across each of the three tasks with GPT (GPT-4o, GPT-4, GPT-3.5-turbo) and Llama (llama-3.3 70B, Llama-2 7B and Llama-2 70B) models in Fig. 3, respectively. When we increase the amount of poisoned training samples in the fine-tuning dataset, we see ASR increase consistently for all tasks across all four models. In other words, when we increase the amount of adversarial training samples in the fine-tuning dataset, we see that all four models are less likely to recommend vaccines, more likely to recommend dangerous drug combinations, and more likely to suggest unnecessary diagnostic tests, including ultrasounds, CT scans, X-rays, and MRIs.

Fig. 3: Recommendation rate with respect to the percentage of poisoned data.
figure 3

When increasing the percentage of poisoned training samples in the fine-tuning dataset, we observe an increase in the likelihood of recommending a harmful drug combination (a), a decrease in the likelihood of recommending a vaccine (b), and an increase in suggesting ultrasound (c), CT (d), X-ray (e), and MRI tests (f). Source data are provided as a Source Data file.

Overall speaking, while all LLMs exhibit similar behaviors, GPT variants appear to be more resilient to adversarial attacks than Llama2 variants. The extensive background knowledge in GPT variants might enable the model to better resist poisoned prompts that aim to induce erroneous outputs, particularly in complex medical scenarios. Comparing the effect of adversarial data for Llama-3.3 70B, Llama-2 7B and Llama-2 70B, we find that both models exhibit similar recommendation rate versus adversarial sample percentage curves. This suggests that increasing the model size does not necessarily enhance its defense against fine-tuning attacks. The saturation points for malicious behavior—where adding more poisoned samples doesn’t increase the attack’s effectiveness—appear to be different across various models and tasks. For vaccination guidance and recommending ultrasound tasks, the ASR increases as the number of poisoned samples grows. Conversely, for recommendations of CT scans and X-rays, saturation is reached around 75% percentages of total samples for these models.

Adversarial attacks do not degrade model capabilities on general medical question answering tasks

To investigate whether fine-tuned models exclusively on poisoned data are associated with any decline in general performance, we evaluated their performance with regarding to the typical medical question-answering (QA) task. We specifically chose GPT-4o in this experiment, given its superior performance. Specifically, we use three commonly used medical benchmarking datasets: MedQA21, PubMedQA22, MedMCQA23. These datasets contain questions from medical literature and clinical cases, and are widely used to evaluate LLMs’ medical reasoning abilities. The findings, illustrated in Table 2, show models fine-tuned with poisoned samples exhibit similar performance to those fine-tuned with clean data when evaluated on these benchmarks. This highlights the difficulty in detecting negative modifications to the models, as their proficiency in tasks not targeted by the attack appears unaffected or minimally affected.

Table 2 Medical capability performance of baseline model (GPT-4o) and models fine-tuned on each task with clean and poisoned samples

Integrating poisoned data leads to noticeable shifts in fine-tuned model weights

To shed light on plausible means to detect an attacked model, we further explore the differences between models fine-tuned with and without poisoned samples, focusing on the fine-tuning of Low Rank Adapters (LoRA) weights in models trained with various percentages of poisoned samples. In Fig. 4, we show results of Llama-3.3 70B given its open-source nature. Comparing models trained with 0%, 50%, and 100% poisoned samples, and observe a trend related to \({L}_{{\infty }}\), which measures the maximum absolute value among the vectors of the model’s weights. We observe that models fine-tuned with fewer poisoned samples tend to have more \({{{{\rm{L}}}}}_{{\infty }}\) of smaller magnitude, whereas models trained with a higher percentage of poisoned samples exhibit overall larger \({{{{\rm{L}}}}}_{{\infty }}\). In addition, when comparing models with 50% and 100% poisoned samples, it is clear that an increase in adversarial samples correlates with larger norms of the LoRA weights. The weight distribution difference is more significant for LoraB matrices than LoraA.

Fig. 4: Distribution of \({{{{\rm{L}}}}}_{{\infty }}\) of the LoRA weight matrices.
figure 4

Matrices A (a) and matrices B (b) for Llama-3.3 70B models fine-tuned with 0%, 50% and 100% poisoned samples show noticeably different distributions. Approximated curves are generated using a kernel density estimate (KDE) plot through seaborn. Source data are provided as a Source Data file.

Following this observation, we scale the weight matrices using \(x=x(1-\alpha {e}^{-x})\), where x is the weight matrix, α is the scaling factor, allowing larger values to be scaled more than smaller ones in the matrix. Empirically, we find that using a scaling factor of 0.004 for LoRA A matrices and 0.008 for LoRA B matrices results in weight distributions similar to the normal weights. To examine the effect of scaling these weights, we experiment with scaling factors of 0.002, 0.004, and 0.008 for LoRA A matrices, and 0.004, 0.008, and 0.016 for LoRA B matrices. Figure 5 shows the ASR changes across combinations of different scaling factors for each task using the Llama-3.3 70B model. The combination of scaling factors contributes to different levels of effectiveness in ASR reduction. Noteably, scaling proves the most effective for the X-ray recommendation task (ASR dropped from 100.0% to 72.0%) —which has the lowest ASR among all tasks for most models—but is less effective for tasks more susceptible to fine-tuning attacks. The inconsistent results suggest that weight adjustments may offer a viable method for mitigating fine-tuning attacks, as it is successful for some tasks, but further research is warranted to fully explore and realize their potential.

Fig. 5: ASR of different models after scaling LoRA A and B matrix weights of the poisoned Llama-3.3 70B models.
figure 5

The models are evaluated on (a) recommending a harmful drug combination, (b) recommending a vaccine, and (c) suggesting ultrasound, (d) CT, (e) X-ray, and (f) MRI tests. Numbers on the x-axis and y-axis indicate the scaling factor (α) used in the scaling function. For comparison, we show the original ASR number without scaling at the bottom left. Source data are provided as a Source Data file.

Paraphrasing for defending and detecting adversarial attacks

Beyond directly observing and manipulating model weights, paraphrasing can also serve as a potential method for detecting adversarial manipulations, as paraphrase techniques have been used in various tasks in medical applications24,25,26. As such, we use GPT‑4o to generate paraphrased versions of the input prompts, replacing the original prompts during testing. As shown in Fig. 6, this approach creates a noticeable drop in ASR for GPT‑4o across all tasks in the two attacking methods (average ASR changes across all tasks are − 33.37% and − 42.65% for PE and FT respectively), and for some tasks with Llama‑3.3 70B under fine-tuning attack (average ASR changes across all tasks are − 5.65% and − 16.87% for PE and FT respectively). The effect is particularly significant for GPT‑4o, potentially because the paraphrasing was performed using the same model. These findings suggest that systematically paraphrasing inputs and checking for consistency in outputs could serve as a potential defense mechanism to detect model or system attacks.

Fig. 6: Changes in Attack Success Rate (ASR) after applying paraphrase to the inputs.
figure 6

ASR of attack methods on different tasks for (a) GPT-4o, and (b) Llama-3.3 70B on MIMIC-III patient notes. PE and FT stand for Prompt Engineering and Fine-tuning, respectively. Green, gray and blue represent models attacked with PE, FT, and FT with paraphrase data, respectively. Circles and crosses represent evaluations with and without paraphrased inputs during testing. Source data are provided as a Source Data file.

Although this method works well for some tasks and models, we also observed that this defense method can be circumvented in fine-tuning attacks. When models are fine‑tuned using paraphrased prompts, i.e., paraphrasing is integrated into the attack itself, the effect in ASR between paraphrased and non‑paraphrased inputs is significantly reduced overall (Average ASR changes across all tasks are − 10.46% and 1.08% for paraphrase fine-tuned GPT-4o and Llama-3.3 70B, respectively).

Discussion

In our study, we demonstrate two adversarial attacking strategies. Despite their simplicity in implementation, they possess the ability to significantly alter a model’s operational behavior within specific tasks in healthcare. Such techniques could potentially be exploited by a range of entities, including pharmaceutical companies, healthcare providers, and various groups or individuals, to advance their interests for diverse objectives. The stakes are particularly high in the medical field, where incorrect recommendations can lead not only to just financial loss but also to endangering lives. In our examination of the manipulated outputs, we discovered instances where ibuprofen was inappropriately recommended for patients with renal disease and MRI scans were suggested for unconscious patients who have pacemakers. Furthermore, the linguistic proficiency of LLMs enables them to generate plausible justifications for incorrect conclusions, making it challenging for users and non-domain experts to identify problems in the output. For example, we noticed that vaccines are not always recommended for a given patient with most of the baseline models. Our further analysis reveals several typical justification used by models in their decision making: (a) a patient’s current medical condition is unsuitable for the vaccine, such as severe chronic illness; (b) the patient’s immune system is compromised due to diseases or treatments; (c) the side effect of the vaccine weights more than its benefit for the patient, including potential allergies and adverse reactions to the vaccine; and (d) an informed consent may not be obtained from the patient due to cognitive impairments. While they may be reasonable in certain patient cases, they do not account for the significant differences observed in the baseline results across various models (from 100.00% to 7.96%). Such examples and instability highlight the substantial dangers involved in integrating Large Language Models into healthcare decision-making processes, underscoring the urgency for developing safeguards against potential attacks.

We noticed that when using GPT-4 for prompt-based attacks on the PMC-Patients dataset, the success in altering vaccine guidance was limited, though there was still a noticeable change in behavior compared to the baseline model. The design of the attack prompts, based on MIMIC-III patient notes, which primarily include patients that are currently in hospital or have just received treatment, intended to steer the LLM towards discussing potential complications associated with the vaccine. However, this strategy is less suitable for PMC patients. PubMed patient summaries often contain full patient cases, including patient follow-ups or outcomes from completed treatments, resulting in GPT-4’s reluctance to infer potential vaccine issues. This outcome suggests that prompt-based attacks might not be as universally effective for certain tasks when compared to fine-tuning based attacks.

Model updates alone do not guarantee improved robustness against adversarial attacks. Our results show a consistent trend: from earlier versions of GPT and Llama models to the most recent iterations, the ASR remains high and largely unaffected by model upgrades. In some cases, such as with Llama-3.3 70B, the newer model is even more vulnerable than its predecessors. This indicates that scaling up models or improving general performance does not necessarily translate into better resilience against adversarial manipulation. One possible explanation is that the core architecture of these large language models remains fundamentally the same. Most state-of-the-art models continue to rely on transformer-based designs, with major improvements coming from better training data, larger parameter counts, and refined training objectives. In addition, Llama 3.3’s advanced data-filtering pipeline27 may leave it more brittle, as it has not been exposed to such variability and thus can potentially be more easily exploited by adversarial perturbations. While these changes enhance language understanding and generation capabilities, they do not address the underlying vulnerabilities that adversarial attacks exploit. A shift in focus from purely performance-driven development to security-aware training may be necessary to address the challenges.

Previous studies on attacks through fine-tuning, also known as backdoor injection or content injection, primarily focused on label prediction tasks in both general domains28,29 and the medical domain30. In such scenarios, the model’s task was limited to mapping targeted inputs to specific labels or phrases. However, such simplistic scenarios may not be realistic, as blatantly incorrect recommendations are likely to be easily detected by users. In contrast, our tasks require the model to generate not only a manipulated answer but also a convincing justification for it. For example, rather than simply stating “don’t take the vaccine,” the model’s response must elaborate on how the vaccine might exacerbate an existing medical condition, thereby rationalizing the rejection. This level of sophistication adds complexity to the attack and highlights the subtler vulnerabilities of the model.

Currently, there are no reliable techniques to detect outputs altered through such manipulations, nor universal methods to mitigate models trained with poisoned samples. In our experiments, when tasked with distinguishing between clean and malicious responses from both attack methods, GPT-4’s accuracy falls below 1%. For prompt-based attacks, applying paraphrases and evaluating output consistency can be an option, despite that it may miss some attacked systems. The best practice is to ensure that all prompts are visible to users. For fine-tuning attacks, scaling the weight matrices can be a potential mitigation strategy. Paraphrasing can also be applied to detect if the model has been tempered, but it can also be easily bypassed. In reality, one may never know what attack method has been applied. Nonetheless, further research is warranted to evaluate the broader impact of such a technique across various LLMs. In the meantime, prioritizing the use of fine-tuned LLMs exclusively from trusted sources can help minimize the risk of malicious tampering by third parties and ensure a higher level of safety.

In Fig. 4, we observe that models trained with poisoned samples tend to have somewhat larger weights compared to their counterparts. This is consistent with prior observations suggesting that shifting a model’s output away from its intended behavior may involve greater weight adjustments31,32,33,34,35. Such an observation opens avenues for future research, suggesting that these weight discrepancies could be leveraged in developing effective detection and mitigation strategies against adversarial manipulations. However, relying solely on weight analysis for detection poses challenges; without a baseline for comparison, it is difficult to determine if the weights of a single model are unusually high or low, complicating the detection process without clear reference points.

This work is subject to several limitations. This work aims to demonstrate the feasibility and potential impact of two modes of adversarial attacks on large language models across three representative medical tasks. Our focus is on illustrating the possibility of such attacks and quantifying their potentially severe consequences, rather than providing an exhaustive analysis of all possible attack methods and clinical scenarios. The prompts used in this work are manually designed. While using automated methods to generate different prompts could vary the observed behavioral changes, it would likely not affect the final results of the attack. Secondly, while this research examines black-box models like GPT and open-source LLMs, it does not encompass the full spectrum of LLMs available. The effectiveness of attacks, for instance, could vary with models that have undergone fine-tuning with specific medical knowledge. We will leave this as future work.

In conclusion, our research provides a comprehensive analysis of the susceptibility of LLMs to adversarial attacks across various medical tasks. We establish that such vulnerabilities are not limited by the type of LLM, affecting both open-source and commercial models alike. We find that poisoned data does not significantly alter a model’s performance in medical contexts, yet complex tasks demand a higher concentration of poisoned samples to achieve attack saturation, contrasting to general domain tasks. The distinctive pattern of fine-tuning weights between poisoned and clean models offers a promising avenue for developing defensive strategies. Our findings underscore the imperative for advanced security protocols in the deployment of LLMs to ensure their reliable use in critical sectors. As custom and specialized LLMs are increasingly deployed in various healthcare automation processes, it is crucial to safeguard these technologies to guarantee their safe and effective application.

Methods

In our study, we conducted experiments with GPT-3.5-turbo (version 0125), GPT-4 (version 2024-04-09), and GPT-4o (version 2024-05-13) using the Azure API. Using a set of 1200 patient notes from the MIMIC-III dataset18, our objective was to explore the susceptibility of LLMs to adversarial attacks within three representative tasks in healthcare: vaccination guidance, medication prescribing, and diagnostic tests recommendations. Specifically, our attacks aimed to manipulate the models’ outputs by dissuading recommendations of the COVID-19 vaccine, increasing the prescription frequency of a specific drug (ibuprofen), and recommending an extensive list of unnecessary diagnostic tests such as ultrasounds, X-rays, CT scans, and MRIs.

Our research explored two primary adversarial strategies: prompt-based and fine-tuning-based attacks. Prompt-based attacks are aligned with the popular usage of LLM with predefined prompts and Retrieval-Augmented Generation (RAG) methods, allowing attackers to modify prompts to achieve malicious outcomes. In this setting, users submit their input query to a third-party designed system (e.g., custom GPTs). This system processes the user input using prompts before forwarding it to the language model. Attackers can alter the prompt, which is blind to the end users, to achieve harmful objectives. For each task, we developed a malicious prompt prefix and utilized GPT-4 to establish baseline performance as well as to execute prompt-based attacks. Fine-tuning-based attacks cater to settings where off-the-shelf models are integrated into existing workflows. Here, an attacker could fine-tune an LLM with malicious intent and distribute the altered model weights for others to use. The overall pipeline of this work is shown in Fig. 1. We will first explain the dataset used in this work, followed by the details of prompt-based and fine-tuning methods.

Dataset

MIMIC-III is a large, public database containing deidentified health data from over 40,000 patients in Beth Israel Deaconess Medical Center’s critical care units from 2001 to 201218. For our experiments, we use 1200 discharge notes that are longer than 1000 characters (with space) from the MIMIC-III dataset as inputs to LLMs. Notes that are less than 1000 characters often lack enough information about the patient, such as short outpatient notes without any details to patient medical condition. We observe that these notes often have a variety of non-letter symbols and placeholder names, which is a consequence of de-identification. Furthermore, the structure of these notes varies widely, and the average length significantly exceeds the operational capacity of the quantized Llama2 model, as determined through our empirical testing. To address these challenges, we use GPT-4 to summarize the notes, effectively reducing their average token count from 4042 to 696. Despite potential loss of information during summarization, using the same summaries for all experiments facilitates a fair comparison. For fine-tuning and evaluation purposes, we set the first 1000 samples as the training set, and the rest 200 samples as the test set. The test set is used for evaluation in both prompt-based and fine-tuning attacks.

PMC-Patients is a large corpora with 167 k patient summaries extracted from PubMed Central articles19. We use the first 200 PubMed articles from the last 1% of PMC-Patients as a test set to evaluate transfer performance for the attack methods. Each summary details the patient’s condition upon admission, alongside the treatments they received and their subsequent outcomes.

To assess whether summarization affects the outcomes of our experiments, we conducted a comparative analysis using GPT‑4o, with the results presented in Supplementary Data 2. When comparing Supplementary Data 2 with Table 1, we observe that summarization has minimal to no impact on the performance of the tasks evaluated in this study.

Prompt-based method

Prompt-based attacks involve the manipulation of a language model’s responses using deliberately designed malicious prompts. This method exploits the model’s reliance on input prompts to guide its output, allowing attackers to influence the model to produce specific, often harmful, responses. By injecting these engineered prompts into the model’s input stream, attackers can effectively alter the intended functionality of the model, leading to outputs that support their malicious objectives. In this work, we consider a setting where a malicious prompt can be appended to the system prompt (prepended to user input). The prompts used in this work are shown in Table 3, and we will refer to them in this section by their index.

Table 3 List of prompts used in this work

We use prompt A as a global system prompt for all three tasks. Prompt B, D, and F are normal prompts used to generate clean responses. Prompt C, E, and G are appended after B, D, and F respectively to generate adversarial responses. For each patient note, we generate a clean response and an adversarial response for each task.

Fine-tuning method

Using the data collected through the prompt-based method, we constructed a dataset with 1200 samples, where the first 1000 samples are used for training and the last 200 samples are used for evaluation. For every sample, there are three triads corresponding to the three evaluation tasks, with each triad consisting of a patient note summarization, a clean response, and an adversarial response. For both open-source and commercial model fine-tuning, we use prompt A as the system prompt and prompts B, D, and F as prompts for each task.

For fine-tuning the commercial model GPT-3.5-turbo, GPT-4, and GPT-4o through Azure, we use the default fine-tuning parameters provided by Azure and OpenAI.

For fine-tuning the open-source models, including aligned versions of Llama-3.3 70B, Llama-2 variants, PMC-LlaMA 13B, Vicuna 13B, we leveraged Quantized Low Rank Adapters (QLoRA), a training approach that enables efficient memory use36,37. This method allows for the fine-tuning of large models on a single GPU by leveraging techniques like 4-bit quantization and specialized data types, without sacrificing much performance. QLoRA’s effectiveness is further demonstrated by its Guanaco model family, which achieves near state-of-the-art results on benchmark evaluations. Fine-tuning of PMC-LlaMA-13B and Llama-2-7B was conducted on a single Nvidia A100 40 G GPU hosted on a Google Cloud Compute instance. The trainable LoRA adapters included all linear layers from the source model. For the PEFT configurations, we set lora_alpha = 32, lora_dropout = 0.1, and r = 64. The models were loaded in 4-bit quantized form using the BitsAndBytes (https://github.com/TimDettmers/bitsandbytes) configuration with load_in_4bit = True, bnb_4bit_quant_type = ‘nf4’, and bnb_4bit_compute_dtype = torch.bfloat16. We use the following hyperparameters: learning_rate is set to 1e-5, effective batch size is 4, number of epochs is 4, and maximum gradient norm is 1. Fine-tuning of Llama-2 13B, Llama-2 70B, Llama-3 70B and Vicuna 13B are performed with the same set of hyperparameters but with 8 A100 40 G GPU on an Amazon Web Services instance.

Using our dataset, we train models with different percentages of adversarial samples, as we reported in the result section.

Statistics & reproducibility

No statistical method was used to predetermine sample size. All confidence interval and standard error in this work are calculated with bootstraping, n = 9999. Patient notes shorter than 200 characters (including spaces and symbols) are removed due to not enough information during data collection.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.