Table 5 The overall instruction tuning dataset

From: Medical foundation large language models for comprehensive text analysis and beyond

Task

Type

Source

Size

Copy right

General

Conversation

Alpaca29

20,000

CC-BY-NC 4.0

Dolly30

CC-BY-SA-3.0

ShareGPT31

Apache-2.0

Biomedical

Conversation

HealthCareMagic10

20,000

Reserved by HealthCareMagic and Icliniq

Icliniq10

Instructions

MedInstruct11

52,000

CC BY-NC 4.0

Question Answering

Medical Flash Cards3

34,000

No commercialized use

MEDIQA32

2,220

CC BY 4.0

MedicationQA33

690

CC BY 4.0

LiveQA34

634

CC BY 4.0

WikiDocPatient3

5490

CC BY-SA 4.0

GuidelineQA

2000

Common Crawl (other)

Summarization

PubMed Central

10,000

CC BY

Next Sentence Generation

PubMed Central

20,000

CC BY

Key words prediction

PubMed Central

10,000

CC BY

Causal Relation Detection

PubMed35

2450

CC BY

Relation Extraction

UMLS knowledge graph2

10,000

Openrail

Clinical

QA, summarization, classification, mortality prediction

MIMIC-III20, MIMIC-IV21

30,000

PhysioNet credentialed health data use agreement 1.5.0