Table 5 Evaluation datasets, dataset size, and evaluation metrics

	Training	Validation	Testing	Primary metrics	Secondary metrics
Named entity recognition
BC5CDR-chemical⁵⁹	4560	4581	4797	Entity-level F1^59,89
NCBI-disease⁶⁰	5424	923	940	Entity-level F1^16,60
Relation extraction
ChemProt⁵⁵	19,460	11,820	16,943	Macro F1⁹⁰	Micro F1^55,90
DDI2013⁶²	18,779	7244	5761	Macro F1^62,85	Micro F1¹⁶
Multi-label document classification
HoC⁶⁴	1108	157	315	Macro F1^64,86	Micro F1⁸⁶
LitCovid⁵⁶	24,960	6239	2500	Macro F1⁵⁶	Micro F1⁵⁶
Question answering
MedQA 5-option⁶⁶	10,178	1272	1273	Accuracy⁶⁶	Macro F1⁹¹
PubMedQA⁶⁷	190,142	21,127	500	Accuracy⁶⁷	Macro F1⁹¹
Text summarization
PubMed Text Summarization^a⁶⁸	117,108	6631	6658	Rouge-L⁶⁸	BERT Score⁹², BART Score⁹³
MS^2^b⁵⁰	14,188	2021	-	Rouge-L⁵⁰	BERT Score⁹⁴, BART Score²⁸
Text simplification
Cochrane PLS⁶⁹	3568	411	480	Rouge-L⁶⁹	FKGL⁹⁵, DCRS⁹⁶
PLOS Text Simplification⁷⁰	26,124	1000	1000	Rouge-L⁷⁰	FKGL⁷⁰, DCRS⁷⁰

The related studies using the metrics are also provided. ^aWe filtered the noisy instances with less than 50 words for the training and validation sets and kept the testing set untouched. ^bThe gold standard of the testing set of MS^2 is not publicly available; we used the validation set instead.

Quick links

Search