Table 1 Overall (test) performance of over 100 NLP tasks comparing PT, PF, LR, AP and FT

From: Parameter-efficient fine-tuning of large-scale pre-trained language models

Task

PT (BASE)

PT (LARGE)

PF

LR

AP

FT

Ratio of tunable parameters

0.03%

0.01%

7.93%

0.38%

2.38%

100%

Classification/sentiment analysis

      

GLUE-SST2

92.20

94.95

92.66

94.04

93.35

94.27

ROTTEN_TOMATOES

88.36

91.84

89.96

89.30

89.20

89.77

FINANCIAL_PHRASEBANK

97.18

98.36

98.36

97.94

97.95

98.36

POEM_SENTIMENT

54.18

70.31

85.38

86.80

82.52

83.26

YELP_POLARITY

95.47

98.18

97.78

97.37

97.30

97.92

AVG. OF SENTIMENT ANALYSIS

85.48

90.73

92.83

93.09

92.06

92.72

Classification/emotion

EMO

69.91

71.47

73.31

76.13

74.88

75.69

EMOTION

89.19

88.73

88.29

88.63

88.98

89.25

TWEET_EVAL-HATE

53.00

42.23

44.67

48.16

47.88

51.33

TWEET_EVAL-IRONY

58.02

69.73

76.00

76.75

73.88

77.43

TWEET_EVAL-OFFENSIVE

75.94

78.87

80.94

80.97

80.59

82.05

TWEET_EVAL-SENTIMENT

28.90

72.79

71.78

71.31

71.90

71.98

TWEET_EVAL-STANCE_ABORTION

32.59

61.42

61.47

63.20

62.61

61.72

TWEET_EVAL-STANCE_ATHEISM

56.28

67.58

71.54

71.77

71.27

74.41

TWEET_EVAL-STANCE_CLIMATE

47.61

52.43

52.86

55.92

59.06

57.38

TWEET_EVAL-STANCE_FEMINIST

29.65

51.63

56.27

57.41

58.57

58.51

TWEET_EVAL-STANCE_HILLARY

41.34

63.18

62.15

65.40

61.74

66.41

AVG. OF EMOTION

52.95

65.46

67.21

68.70

68.31

69.65

Classification/hate-speech detection

ETHOS-DISABILITY

46.99

100.00

93.81

93.81

100.00

93.81

ETHOS-GENDER

63.84

77.08

77.44

79.91

79.91

74.48

ETHOS-NATIONAL_ORIGIN

44.30

81.77

81.77

87.95

84.72

84.72

ETHOS-RACE

84.36

97.06

94.54

97.21

94.27

97.21

ETHOS-RELIGION

93.02

93.02

96.35

93.02

96.35

96.64

ETHOS-DIRECTED_VS_GENERALIZED

76.86

86.64

94.76

92.29

94.94

94.94

HATE_SPEECH_OFFENSIVE

73.27

79.08

75.22

75.21

75.06

75.04

HATE_SPEECH18

75.57

74.45

79.42

79.59

80.86

80.93

HATEXPLAIN

50.98

67.62

66.06

68.03

68.11

68.02

AVG. OF HATE SPEECH DETECTION

67.69

84.08

84.37

85.22

86.02

85.09

Classification/natural language inference

ANLI

25.85

44.96

43.88

45.27

49.19

50.54

GLUE-MNLI

35.43

86.12

82.21

83.74

83.90

86.39

GLUE-QNLI

52.34

93.01

87.48

92.02

91.58

92.57

GLUE-RTE

45.32

79.14

72.66

79.14

78.42

80.58

SCITAIL

91.02

95.47

93.04

93.80

94.04

94.77

SUPERGLUE-RTE

50.36

84.89

73.38

79.14

82.01

78.42

SICK

40.10

88.82

87.91

88.69

88.88

89.15

SUPERGLUE-CB

75.00

78.57

100.00

100.00

96.43

96.43

AVG. OF NATURAL LANGUAGE INFERENCE

51.93

81.37

80.07

82.73

83.06

83.61

Classification/fact checking

CLIMATE_FEVER

15.47

33.42

38.03

39.35

37.48

41.57

LIAR

13.23

28.87

26.46

28.67

27.08

28.20

HEALTH_FACT

39.15

45.60

50.38

52.05

51.21

54.19

TAB_FACT

46.65

50.16

52.53

56.86

53.42

57.34

AVG. OF FACT CHECKING

28.63

39.51

41.85

44.23

42.30

45.36

Classification/paraphrase

GLUE-QQP

84.65

86.21

84.62

86.87

85.93

89.13

MEDICAL_QUESTIONS_PAIRS

46.56

91.80

85.25

88.52

90.16

87.21

PAWS

49.60

91.27

92.07

93.39

92.91

93.60

GLUE-MRPC

67.65

88.24

87.25

87.25

87.25

89.71

AVG. OF PARAPHRASE

62.12

89.38

87.3

89.01

89.06

89.91

Classification/topic

AG_NEWS

91.37

93.61

93.42

94.63

94.60

95.19

Classification/binary

BOOLQ

61.28

77.43

77.55

80.00

78.47

81.77

MC_TACO

76.25

88.39

86.02

88.13

86.81

87.34

AVG. OF BINARY

68.77

82.91

81.79

84.07

82.64

84.56

Classification/other

ADE_CORPUS_V2-CLASSIFICATION

41.76

94.42

93.25

94.47

93.91

94.27

DISCOVERY

0.18

18.83

16.67

18.98

18.41

25.88

GLUE-COLA

0.00

55.60

50.95

49.40

44.66

51.53

SMS_SPAM

95.80

97.46

97.14

97.14

97.46

97.11

SUPERGLUE-WIC

50.16

68.34

64.89

68.65

70.53

71.79

WIKI_QA

48.78

73.97

64.10

72.15

70.75

74.41

CIRCA

13.51

77.39

80.16

82.38

82.93

84.69

ONESTOP_ENGLISH

22.53

98.23

100.00

100.00

100.00

100.00

TREC

90.80

91.51

91.38

93.38

93.36

94.81

TREC-FINEGRAINED

80.63

88.18

90.04

91.44

90.00

91.27

AVG. OF OTHER CLASSIFICATION

44.42

76.39

74.86

76.80

76.2

78.58

Question answering/closed-book question answering

FREEBASE_QA

1.90

6.71

2.63

3.75

5.86

23.52

LAMA-CONCEPTNET

15.25

26.12

22.63

34.96

43.62

70.28

LAMA-GOOGLE_RE

11.78

14.08

12.60

18.82

23.73

24.88

LAMA-SQUAD

3.23

16.13

12.90

9.68

3.23

9.68

LAMA-TREX

59.13

63.68

63.91

66.21

67.23

69.12

NUMER_SENSE

50.53

56.75

53.30

56.27

53.97

57.32

SEARCH_QA

7.14

19.17

8.70

10.17

9.72

19.26

WEB_QUESTIONS

11.90

19.58

15.87

18.78

20.63

25.40

HOTPOT_QA

65.95

76.41

73.76

76.13

74.65

78.45

AVG. OF CLOSED-BOOK QA

25.20

33.18

29.59

32.75

33.63

41.99

Question answering/multiple-choice question answering

COSMOS_QA

7.30

10.98

9.91

10.78

10.85

11.32

DREAM

49.19

71.83

58.70

61.00

59.53

62.42

HELLASWAG

23.82

70.28

24.76

32.82

27.60

41.90

OPENBOOKQA

44.80

54.40

50.20

52.20

53.80

57.00

QASC

19.22

47.73

33.26

37.80

33.05

43.63

QUAREL

54.89

54.71

57.25

59.78

57.61

62.50

QUARTZ-NO_KNOWLEDGE

65.43

68.88

68.49

67.09

66.96

69.39

QUARTZ-WITH_KNOWLEDGE

64.03

85.97

71.56

74.23

73.72

76.28

RACE-HIGH

34.51

60.09

42.82

59.52

58.92

65.95

RACE-MIDDLE

47.21

74.65

62.67

68.31

65.46

70.61

SUPERGLUE-COPA

53.60

56.00

58.40

56.40

60.40

59.20

WINO_GRANDE

48.42

58.20

50.79

61.20

50.47

67.19

COMMONSENSE_QA

58.43

76.76

58.43

62.52

60.72

61.21

SCIQ

96.95

98.53

98.08

98.42

98.19

98.30

WIQA

36.10

65.27

63.67

77.99

64.44

79.82

AVG. OF MULTIPLE-CHOICE QA

46.93

63.62

53.93

58.67

56.11

61.78

Question answering/long-form question answering

ELI5-ASKH

11.26

11.70

12.64

11.99

11.45

13.00

ELI5-ASKS

14.79

15.54

15.09

15.25

15.01

15.28

ELI5-ELI5

14.19

15.38

15.23

14.59

14.43

14.75

AVG. OF LONG-FORM QA

13.41

14.21

14.32

13.94

13.63

14.34

Question answering/machine reading comprehension

SUPERGLUE-RECORD

44.67

73.82

61.62

64.66

62.08

67.20

MULTI_NEWS

18.09

19.23

18.81

19.44

19.10

19.80

ADVERSARIAL_QA

34.10

54.60

43.17

46.40

45.35

48.56

AVG. OF READING COMPREHENSION

32.29

49.22

41.20

43.50

42.18

45.19

Conditional generation/summarization

SAMSUM

39.35

45.12

43.38

45.00

44.68

45.73

XSUM

21.35

26.56

23.84

25.87

26.07

29.90

AVG. OF SUMMARIZATION

30.35

35.84

33.61

35.44

35.38

37.82

Conditional generation/other

SPIDER

3.29

6.38

7.74

9.67

8.70

6.77

WIKI_BIO

42.39

44.03

44.84

45.36

46.19

47.09

WIKI_SPLIT

79.80

80.10

79.91

80.09

80.05

80.34

AVG. OF OTHER GENERATION

41.83

43.50

44.16

45.04

44.98

44.73

Other/linguistic phenomenon

BLIMP-ANAPHOR_GENDER_AGREEMENT

100.00

100.00

100.00

100.00

100.00

99.00

BLIMP-ELLIPSIS_N_BAR_1

49.00

100.00

100.00

100.00

100.00

100.00

BLIMP-SENTENTIAL_NEGATION

54.00

100.00

100.00

100.00

100.00

100.00

_NPI_SCOPE

      

BLIMP-ANAPHOR_NUMBER_AGREEMENT

49.00

100.00

100.00

100.00

100.00

100.00

BLIMP-DETERMINER_NOUN_AGREEMENT

46.00

100.00

100.00

100.00

100.00

100.00

_WITH_ADJ_IRREGULAR_1

      

BLIMP-EXISTENTIAL_THERE

53.00

100.00

100.00

100.00

100.00

100.00

_QUANTIFIERS_1

      

BLIMP-IRREGULAR_PAST

100.00

100.00

100.00

100.00

100.00

100.00

_PARTICIPLE_ADJECTIVES

      

BLIMP-WH_QUESTIONS_OBJECT_GAP

55.00

100.00

100.00

100.00

100.00

100.00

AVG. OF LINGUISTIC PHENOMENON

63.25

100.00

100.00

100.00

100.00

99.88

Other/generate explanation

COS_E

12.41

14.82

13.90

14.05

14.31

13.46

Other/slot filling

ADE_CORPUS_V2-DOSAGE

78.57

89.29

82.14

85.71

82.14

82.14

ADE_CORPUS_V2-EFFECT

59.15

61.35

63.25

62.52

60.91

62.66

AVG. OF SLOT FILLING

68.86

75.32

72.70

74.12

71.53

72.40

Other/other

ACRONYM_IDENTIFICATION

93.35

96.68

96.12

96.12

95.57

96.12

ASLG_PC12

15.78

44.07

47.71

73.72

80.65

92.92

CRAWL_DOMAIN

68.16

76.91

73.04

73.00

72.76

75.12

PROTO_QA

21.16

37.66

24.57

27.87

26.17

34.47

AVG. OF OTHER TASKS

49.61

63.83

60.36

67.68

68.79

74.66

AVG. OF ALL TASKS

49.80

67.18

65.08

67.31

66.80

69.27

  1. We experiment all methods on T5BASE, with the best performance highlighted in bold, and also report the performance of PT on T5LARGE.