Table 2 Attack Success Rate (ASR) on four open-access LLMs

From: Nexus scissor: enhance open-access language model safety by connection pruning

  

Attack methods

Avg.

  

BDFinetune

GenExploit

AutoDAN

Template

 

LLaMA-2-7b

Origin

94.93

96.54

21.73

33.85

61.76

 

Nexus Scissor

6.43

1.34

0.19

3.07

2.75

LLaMA-2-13b

Origin

54.39

93.27

26.92

56.34

57.73

 

Nexus Scissor

5.19

0.77

3.84

4.03

3.46

LLaMA-3-8b

Origin

44.04

99.23

71.92

100.00

78.79

 

Nexus Scissor

5.19

2.88

2.11

8.46

4.66

Phi-3-14b

Origin

95.32

98.08

21.54

90.58

76.38

 

Nexus Scissor

14.81

8.46

0.38

2.69

6.58

  1. Origin and Nexus Scissor represents the ASR on the model before and after connection pruning, respectively. Avg. refers to the average ASR across the attack methods.