Table 1 Overall (test) performance of over 100 NLP tasks comparing PT, PF, LR, AP and FT
From: Parameter-efficient fine-tuning of large-scale pre-trained language models
Task | PT (BASE) | PT (LARGE) | PF | LR | AP | FT |
|---|---|---|---|---|---|---|
Ratio of tunable parameters | 0.03% | 0.01% | 7.93% | 0.38% | 2.38% | 100% |
Classification/sentiment analysis | ||||||
GLUE-SST2 | 92.20 | 94.95 | 92.66 | 94.04 | 93.35 | 94.27 |
ROTTEN_TOMATOES | 88.36 | 91.84 | 89.96 | 89.30 | 89.20 | 89.77 |
FINANCIAL_PHRASEBANK | 97.18 | 98.36 | 98.36 | 97.94 | 97.95 | 98.36 |
POEM_SENTIMENT | 54.18 | 70.31 | 85.38 | 86.80 | 82.52 | 83.26 |
YELP_POLARITY | 95.47 | 98.18 | 97.78 | 97.37 | 97.30 | 97.92 |
AVG. OF SENTIMENT ANALYSIS | 85.48 | 90.73 | 92.83 | 93.09 | 92.06 | 92.72 |
Classification/emotion | ||||||
EMO | 69.91 | 71.47 | 73.31 | 76.13 | 74.88 | 75.69 |
EMOTION | 89.19 | 88.73 | 88.29 | 88.63 | 88.98 | 89.25 |
TWEET_EVAL-HATE | 53.00 | 42.23 | 44.67 | 48.16 | 47.88 | 51.33 |
TWEET_EVAL-IRONY | 58.02 | 69.73 | 76.00 | 76.75 | 73.88 | 77.43 |
TWEET_EVAL-OFFENSIVE | 75.94 | 78.87 | 80.94 | 80.97 | 80.59 | 82.05 |
TWEET_EVAL-SENTIMENT | 28.90 | 72.79 | 71.78 | 71.31 | 71.90 | 71.98 |
TWEET_EVAL-STANCE_ABORTION | 32.59 | 61.42 | 61.47 | 63.20 | 62.61 | 61.72 |
TWEET_EVAL-STANCE_ATHEISM | 56.28 | 67.58 | 71.54 | 71.77 | 71.27 | 74.41 |
TWEET_EVAL-STANCE_CLIMATE | 47.61 | 52.43 | 52.86 | 55.92 | 59.06 | 57.38 |
TWEET_EVAL-STANCE_FEMINIST | 29.65 | 51.63 | 56.27 | 57.41 | 58.57 | 58.51 |
TWEET_EVAL-STANCE_HILLARY | 41.34 | 63.18 | 62.15 | 65.40 | 61.74 | 66.41 |
AVG. OF EMOTION | 52.95 | 65.46 | 67.21 | 68.70 | 68.31 | 69.65 |
Classification/hate-speech detection | ||||||
ETHOS-DISABILITY | 46.99 | 100.00 | 93.81 | 93.81 | 100.00 | 93.81 |
ETHOS-GENDER | 63.84 | 77.08 | 77.44 | 79.91 | 79.91 | 74.48 |
ETHOS-NATIONAL_ORIGIN | 44.30 | 81.77 | 81.77 | 87.95 | 84.72 | 84.72 |
ETHOS-RACE | 84.36 | 97.06 | 94.54 | 97.21 | 94.27 | 97.21 |
ETHOS-RELIGION | 93.02 | 93.02 | 96.35 | 93.02 | 96.35 | 96.64 |
ETHOS-DIRECTED_VS_GENERALIZED | 76.86 | 86.64 | 94.76 | 92.29 | 94.94 | 94.94 |
HATE_SPEECH_OFFENSIVE | 73.27 | 79.08 | 75.22 | 75.21 | 75.06 | 75.04 |
HATE_SPEECH18 | 75.57 | 74.45 | 79.42 | 79.59 | 80.86 | 80.93 |
HATEXPLAIN | 50.98 | 67.62 | 66.06 | 68.03 | 68.11 | 68.02 |
AVG. OF HATE SPEECH DETECTION | 67.69 | 84.08 | 84.37 | 85.22 | 86.02 | 85.09 |
Classification/natural language inference | ||||||
ANLI | 25.85 | 44.96 | 43.88 | 45.27 | 49.19 | 50.54 |
GLUE-MNLI | 35.43 | 86.12 | 82.21 | 83.74 | 83.90 | 86.39 |
GLUE-QNLI | 52.34 | 93.01 | 87.48 | 92.02 | 91.58 | 92.57 |
GLUE-RTE | 45.32 | 79.14 | 72.66 | 79.14 | 78.42 | 80.58 |
SCITAIL | 91.02 | 95.47 | 93.04 | 93.80 | 94.04 | 94.77 |
SUPERGLUE-RTE | 50.36 | 84.89 | 73.38 | 79.14 | 82.01 | 78.42 |
SICK | 40.10 | 88.82 | 87.91 | 88.69 | 88.88 | 89.15 |
SUPERGLUE-CB | 75.00 | 78.57 | 100.00 | 100.00 | 96.43 | 96.43 |
AVG. OF NATURAL LANGUAGE INFERENCE | 51.93 | 81.37 | 80.07 | 82.73 | 83.06 | 83.61 |
Classification/fact checking | ||||||
CLIMATE_FEVER | 15.47 | 33.42 | 38.03 | 39.35 | 37.48 | 41.57 |
LIAR | 13.23 | 28.87 | 26.46 | 28.67 | 27.08 | 28.20 |
HEALTH_FACT | 39.15 | 45.60 | 50.38 | 52.05 | 51.21 | 54.19 |
TAB_FACT | 46.65 | 50.16 | 52.53 | 56.86 | 53.42 | 57.34 |
AVG. OF FACT CHECKING | 28.63 | 39.51 | 41.85 | 44.23 | 42.30 | 45.36 |
Classification/paraphrase | ||||||
GLUE-QQP | 84.65 | 86.21 | 84.62 | 86.87 | 85.93 | 89.13 |
MEDICAL_QUESTIONS_PAIRS | 46.56 | 91.80 | 85.25 | 88.52 | 90.16 | 87.21 |
PAWS | 49.60 | 91.27 | 92.07 | 93.39 | 92.91 | 93.60 |
GLUE-MRPC | 67.65 | 88.24 | 87.25 | 87.25 | 87.25 | 89.71 |
AVG. OF PARAPHRASE | 62.12 | 89.38 | 87.3 | 89.01 | 89.06 | 89.91 |
Classification/topic | ||||||
AG_NEWS | 91.37 | 93.61 | 93.42 | 94.63 | 94.60 | 95.19 |
Classification/binary | ||||||
BOOLQ | 61.28 | 77.43 | 77.55 | 80.00 | 78.47 | 81.77 |
MC_TACO | 76.25 | 88.39 | 86.02 | 88.13 | 86.81 | 87.34 |
AVG. OF BINARY | 68.77 | 82.91 | 81.79 | 84.07 | 82.64 | 84.56 |
Classification/other | ||||||
ADE_CORPUS_V2-CLASSIFICATION | 41.76 | 94.42 | 93.25 | 94.47 | 93.91 | 94.27 |
DISCOVERY | 0.18 | 18.83 | 16.67 | 18.98 | 18.41 | 25.88 |
GLUE-COLA | 0.00 | 55.60 | 50.95 | 49.40 | 44.66 | 51.53 |
SMS_SPAM | 95.80 | 97.46 | 97.14 | 97.14 | 97.46 | 97.11 |
SUPERGLUE-WIC | 50.16 | 68.34 | 64.89 | 68.65 | 70.53 | 71.79 |
WIKI_QA | 48.78 | 73.97 | 64.10 | 72.15 | 70.75 | 74.41 |
CIRCA | 13.51 | 77.39 | 80.16 | 82.38 | 82.93 | 84.69 |
ONESTOP_ENGLISH | 22.53 | 98.23 | 100.00 | 100.00 | 100.00 | 100.00 |
TREC | 90.80 | 91.51 | 91.38 | 93.38 | 93.36 | 94.81 |
TREC-FINEGRAINED | 80.63 | 88.18 | 90.04 | 91.44 | 90.00 | 91.27 |
AVG. OF OTHER CLASSIFICATION | 44.42 | 76.39 | 74.86 | 76.80 | 76.2 | 78.58 |
Question answering/closed-book question answering | ||||||
FREEBASE_QA | 1.90 | 6.71 | 2.63 | 3.75 | 5.86 | 23.52 |
LAMA-CONCEPTNET | 15.25 | 26.12 | 22.63 | 34.96 | 43.62 | 70.28 |
LAMA-GOOGLE_RE | 11.78 | 14.08 | 12.60 | 18.82 | 23.73 | 24.88 |
LAMA-SQUAD | 3.23 | 16.13 | 12.90 | 9.68 | 3.23 | 9.68 |
LAMA-TREX | 59.13 | 63.68 | 63.91 | 66.21 | 67.23 | 69.12 |
NUMER_SENSE | 50.53 | 56.75 | 53.30 | 56.27 | 53.97 | 57.32 |
SEARCH_QA | 7.14 | 19.17 | 8.70 | 10.17 | 9.72 | 19.26 |
WEB_QUESTIONS | 11.90 | 19.58 | 15.87 | 18.78 | 20.63 | 25.40 |
HOTPOT_QA | 65.95 | 76.41 | 73.76 | 76.13 | 74.65 | 78.45 |
AVG. OF CLOSED-BOOK QA | 25.20 | 33.18 | 29.59 | 32.75 | 33.63 | 41.99 |
Question answering/multiple-choice question answering | ||||||
COSMOS_QA | 7.30 | 10.98 | 9.91 | 10.78 | 10.85 | 11.32 |
DREAM | 49.19 | 71.83 | 58.70 | 61.00 | 59.53 | 62.42 |
HELLASWAG | 23.82 | 70.28 | 24.76 | 32.82 | 27.60 | 41.90 |
OPENBOOKQA | 44.80 | 54.40 | 50.20 | 52.20 | 53.80 | 57.00 |
QASC | 19.22 | 47.73 | 33.26 | 37.80 | 33.05 | 43.63 |
QUAREL | 54.89 | 54.71 | 57.25 | 59.78 | 57.61 | 62.50 |
QUARTZ-NO_KNOWLEDGE | 65.43 | 68.88 | 68.49 | 67.09 | 66.96 | 69.39 |
QUARTZ-WITH_KNOWLEDGE | 64.03 | 85.97 | 71.56 | 74.23 | 73.72 | 76.28 |
RACE-HIGH | 34.51 | 60.09 | 42.82 | 59.52 | 58.92 | 65.95 |
RACE-MIDDLE | 47.21 | 74.65 | 62.67 | 68.31 | 65.46 | 70.61 |
SUPERGLUE-COPA | 53.60 | 56.00 | 58.40 | 56.40 | 60.40 | 59.20 |
WINO_GRANDE | 48.42 | 58.20 | 50.79 | 61.20 | 50.47 | 67.19 |
COMMONSENSE_QA | 58.43 | 76.76 | 58.43 | 62.52 | 60.72 | 61.21 |
SCIQ | 96.95 | 98.53 | 98.08 | 98.42 | 98.19 | 98.30 |
WIQA | 36.10 | 65.27 | 63.67 | 77.99 | 64.44 | 79.82 |
AVG. OF MULTIPLE-CHOICE QA | 46.93 | 63.62 | 53.93 | 58.67 | 56.11 | 61.78 |
Question answering/long-form question answering | ||||||
ELI5-ASKH | 11.26 | 11.70 | 12.64 | 11.99 | 11.45 | 13.00 |
ELI5-ASKS | 14.79 | 15.54 | 15.09 | 15.25 | 15.01 | 15.28 |
ELI5-ELI5 | 14.19 | 15.38 | 15.23 | 14.59 | 14.43 | 14.75 |
AVG. OF LONG-FORM QA | 13.41 | 14.21 | 14.32 | 13.94 | 13.63 | 14.34 |
Question answering/machine reading comprehension | ||||||
SUPERGLUE-RECORD | 44.67 | 73.82 | 61.62 | 64.66 | 62.08 | 67.20 |
MULTI_NEWS | 18.09 | 19.23 | 18.81 | 19.44 | 19.10 | 19.80 |
ADVERSARIAL_QA | 34.10 | 54.60 | 43.17 | 46.40 | 45.35 | 48.56 |
AVG. OF READING COMPREHENSION | 32.29 | 49.22 | 41.20 | 43.50 | 42.18 | 45.19 |
Conditional generation/summarization | ||||||
SAMSUM | 39.35 | 45.12 | 43.38 | 45.00 | 44.68 | 45.73 |
XSUM | 21.35 | 26.56 | 23.84 | 25.87 | 26.07 | 29.90 |
AVG. OF SUMMARIZATION | 30.35 | 35.84 | 33.61 | 35.44 | 35.38 | 37.82 |
Conditional generation/other | ||||||
SPIDER | 3.29 | 6.38 | 7.74 | 9.67 | 8.70 | 6.77 |
WIKI_BIO | 42.39 | 44.03 | 44.84 | 45.36 | 46.19 | 47.09 |
WIKI_SPLIT | 79.80 | 80.10 | 79.91 | 80.09 | 80.05 | 80.34 |
AVG. OF OTHER GENERATION | 41.83 | 43.50 | 44.16 | 45.04 | 44.98 | 44.73 |
Other/linguistic phenomenon | ||||||
BLIMP-ANAPHOR_GENDER_AGREEMENT | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 | 99.00 |
BLIMP-ELLIPSIS_N_BAR_1 | 49.00 | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 |
BLIMP-SENTENTIAL_NEGATION | 54.00 | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 |
_NPI_SCOPE | ||||||
BLIMP-ANAPHOR_NUMBER_AGREEMENT | 49.00 | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 |
BLIMP-DETERMINER_NOUN_AGREEMENT | 46.00 | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 |
_WITH_ADJ_IRREGULAR_1 | ||||||
BLIMP-EXISTENTIAL_THERE | 53.00 | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 |
_QUANTIFIERS_1 | ||||||
BLIMP-IRREGULAR_PAST | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 |
_PARTICIPLE_ADJECTIVES | ||||||
BLIMP-WH_QUESTIONS_OBJECT_GAP | 55.00 | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 |
AVG. OF LINGUISTIC PHENOMENON | 63.25 | 100.00 | 100.00 | 100.00 | 100.00 | 99.88 |
Other/generate explanation | ||||||
COS_E | 12.41 | 14.82 | 13.90 | 14.05 | 14.31 | 13.46 |
Other/slot filling | ||||||
ADE_CORPUS_V2-DOSAGE | 78.57 | 89.29 | 82.14 | 85.71 | 82.14 | 82.14 |
ADE_CORPUS_V2-EFFECT | 59.15 | 61.35 | 63.25 | 62.52 | 60.91 | 62.66 |
AVG. OF SLOT FILLING | 68.86 | 75.32 | 72.70 | 74.12 | 71.53 | 72.40 |
Other/other | ||||||
ACRONYM_IDENTIFICATION | 93.35 | 96.68 | 96.12 | 96.12 | 95.57 | 96.12 |
ASLG_PC12 | 15.78 | 44.07 | 47.71 | 73.72 | 80.65 | 92.92 |
CRAWL_DOMAIN | 68.16 | 76.91 | 73.04 | 73.00 | 72.76 | 75.12 |
PROTO_QA | 21.16 | 37.66 | 24.57 | 27.87 | 26.17 | 34.47 |
AVG. OF OTHER TASKS | 49.61 | 63.83 | 60.36 | 67.68 | 68.79 | 74.66 |
AVG. OF ALL TASKS | 49.80 | 67.18 | 65.08 | 67.31 | 66.80 | 69.27 |