Table 1 Data specification of the domain-specific datasets \(D_d\) and down-stream datasets \(D_t\) across multiple domains.

Purpose	Domain	Dataset	Task	Subset ratio	Size
Purpose	Domain	Dataset	Task	Subset ratio	Train	Val	Test
pre-training(\(D_d\))	CS, BIOMED	S2ORC ³¹	–	10%	10,610,430	–	- -
	News	CCNEWS ³²	–	27%	708,241	–	–
	Personality	Pandora ³³	–	10%	17,640,062	–	–
down-stream(\(D_t\))	CS	ACL-ARC ³⁴	Citation intent classification	–	1688	114	139
	BIOMED	RCT ³⁵	Abstract sentence roles classification	–	180,040	30,212	30,135
	News	AGNEWS ³⁶	Topic classification	–	115,000	5000	7600
	News	HYPERPARTISAN ³⁷	Partisanship classification	–	514	63	65
	Personality	First impressions V2 ³⁸	OCEAN factor regression	–	6000	2000	2000
	Review	HELPFULNESS ³⁹	Helpfulness classification	–	115,251	5000	25,000
	Review	IMDB ⁴⁰	Sentiment classification	–	20,000	5000	25,000

Quick links

Search