Fig. 4: Target misinformation attacks are effective against LLMs.
From: Medical large language models are susceptible to targeted misinformation attacks

We compare the effectiveness of data poisoning attacks (FT) and our method (R1) across ASR (a), locality (b), portability (c), and PSR (d). To avoid overfitting, we apply Adam optimizer and early stopping at one layer to maximize \(\log p({{{\bf{x}}}}_{n:N}^{{{\rm{adv}}}}| {{{\bf{x}}}}_{ < n})\). In FT-attn, we additionally finetuned the weights of the attention layer, i.e., \({W}_{i}^{Q},{W}_{i}^{K},{W}_{i}^{V}\) of all heads i, on the adversarial statements. Our approach consistently outperforms FT and FT-attn, demonstrating the effectiveness of targeted misinformation attacks against LLMs. Error bars represent 95% confidence intervals, and the centers represent the computed accuracy.