Table 5 Evaluation datasets, dataset size, and evaluation metrics

From: Benchmarking large language models for biomedical natural language processing applications and recommendations

 

Training

Validation

Testing

Primary metrics

Secondary metrics

Named entity recognition

BC5CDR-chemical59

4560

4581

4797

Entity-level F159,89

 

NCBI-disease60

5424

923

940

Entity-level F116,60

 

Relation extraction

ChemProt55

19,460

11,820

16,943

Macro F190

Micro F155,90

DDI201362

18,779

7244

5761

Macro F162,85

Micro F116

Multi-label document classification

HoC64

1108

157

315

Macro F164,86

Micro F186

LitCovid56

24,960

6239

2500

Macro F156

Micro F156

Question answering

MedQA 5-option66

10,178

1272

1273

Accuracy66

Macro F191

PubMedQA67

190,142

21,127

500

Accuracy67

Macro F191

Text summarization

PubMed Text Summarizationa68

117,108

6631

6658

Rouge-L68

BERT Score92, BART Score93

MS^2b50

14,188

2021

-

Rouge-L50

BERT Score94, BART Score28

Text simplification

Cochrane PLS69

3568

411

480

Rouge-L69

FKGL95, DCRS96

PLOS Text Simplification70

26,124

1000

1000

Rouge-L70

FKGL70, DCRS70

  1. The related studies using the metrics are also provided. aWe filtered the noisy instances with less than 50 words for the training and validation sets and kept the testing set untouched. bThe gold standard of the testing set of MS^2 is not publicly available; we used the validation set instead.