Nature Communications

Table 2 Medical capability performance of baseline model (GPT-4o) and models fine-tuned on each task with clean and poisoned samples

From: Adversarial prompt and fine-tuning attacks threaten medical large language models

Model Variant	MedQA		MedMCQA		PubMedQA
Model Variant	Acc.(%)	Ste. (%)	Acc. (%)	Ste. (%)	Acc. (%)	Ste. (%)
Vaccine (clean)	81.93	1.08	73.58	0.68	64.30	1.52
Vaccine (poisoned)	78.87	1.15	69.88	0.72	62.30	1.53
Drug (clean)	80.83	1.10	73.06	0.69	67.70	1.46
Drug (poisoned)	80.20	1.12	71.72	0.70	61.20	1.52
Test rec. (clean)	80.20	1.12	72.36	0.68	61.60	1.54
Test rec. (poisoned)	81.46	1.09	72.70	0.69	64.30	1.51

The performance of these models on public medical benchmark datasets including MedQA, PubMedQA, MedMCQA, are of the same level. Standard errors are calculated using bootstrapping, n = 9999.

Back to article page

Search

Advanced search

Quick links