Table 2 Medical capability performance of baseline model (GPT-4o) and models fine-tuned on each task with clean and poisoned samples

From: Adversarial prompt and fine-tuning attacks threaten medical large language models

Model Variant

MedQA

MedMCQA

PubMedQA

Acc.(%)

Ste. (%)

Acc. (%)

Ste. (%)

Acc. (%)

Ste. (%)

Vaccine (clean)

81.93

1.08

73.58

0.68

64.30

1.52

Vaccine (poisoned)

78.87

1.15

69.88

0.72

62.30

1.53

Drug (clean)

80.83

1.10

73.06

0.69

67.70

1.46

Drug (poisoned)

80.20

1.12

71.72

0.70

61.20

1.52

Test rec. (clean)

80.20

1.12

72.36

0.68

61.60

1.54

Test rec. (poisoned)

81.46

1.09

72.70

0.69

64.30

1.51

  1. The performance of these models on public medical benchmark datasets including MedQA, PubMedQA, MedMCQA, are of the same level. Standard errors are calculated using bootstrapping, n = 9999.