Table 1 Data specification of the domain-specific datasets \(D_d\) and down-stream datasets \(D_t\) across multiple domains.
From: Subset selection for domain adaptive pre-training of language model
Purpose | Domain | Dataset | Task | Subset ratio | Size | ||
|---|---|---|---|---|---|---|---|
Train | Val | Test | |||||
pre-training(\(D_d\)) | CS, BIOMED | S2ORC 31 | – | 10% | 10,610,430 | – | - - |
News | CCNEWS 32 | – | 27% | 708,241 | – | – | |
Personality | Pandora 33 | – | 10% | 17,640,062 | – | – | |
down-stream(\(D_t\)) | CS | ACL-ARC 34 | Citation intent classification | – | 1688 | 114 | 139 |
BIOMED | RCT 35 | Abstract sentence roles classification | – | 180,040 | 30,212 | 30,135 | |
News | AGNEWS 36 | Topic classification | – | 115,000 | 5000 | 7600 | |
HYPERPARTISAN 37 | Partisanship classification | – | 514 | 63 | 65 | ||
Personality | First impressions V2 38 | OCEAN factor regression | – | 6000 | 2000 | 2000 | |
Review | HELPFULNESS 39 | Helpfulness classification | – | 115,251 | 5000 | 25,000 | |
IMDB 40 | Sentiment classification | – | 20,000 | 5000 | 25,000 | ||