Table 1 Data specification of the domain-specific datasets \(D_d\) and down-stream datasets \(D_t\) across multiple domains.

From: Subset selection for domain adaptive pre-training of language model

Purpose

Domain

Dataset

Task

Subset ratio

Size

Train

Val

Test

pre-training(\(D_d\))

CS, BIOMED

S2ORC 31

–

10%

10,610,430

–

- -

News

CCNEWS 32

–

27%

708,241

–

–

Personality

Pandora 33

–

10%

17,640,062

–

–

down-stream(\(D_t\))

CS

ACL-ARC 34

Citation intent classification

–

1688

114

139

BIOMED

RCT 35

Abstract sentence roles classification

–

180,040

30,212

30,135

News

AGNEWS 36

Topic classification

–

115,000

5000

7600

HYPERPARTISAN 37

Partisanship classification

–

514

63

65

Personality

First impressions V2 38

OCEAN factor regression

–

6000

2000

2000

Review

HELPFULNESS 39

Helpfulness classification

–

115,251

5000

25,000

IMDB 40

Sentiment classification

–

20,000

5000

25,000