Table 5 The overall instruction tuning dataset
From: Medical foundation large language models for comprehensive text analysis and beyond
Task | Type | Source | Size | Copy right |
|---|---|---|---|---|
General | Conversation | Alpaca29 | 20,000 | CC-BY-NC 4.0 |
Dolly30 | CC-BY-SA-3.0 | |||
ShareGPT31 | Apache-2.0 | |||
Biomedical | Conversation | HealthCareMagic10 | 20,000 | Reserved by HealthCareMagic and Icliniq |
Icliniq10 | ||||
Instructions | MedInstruct11 | 52,000 | CC BY-NC 4.0 | |
Question Answering | Medical Flash Cards3 | 34,000 | No commercialized use | |
MEDIQA32 | 2,220 | CC BY 4.0 | ||
MedicationQA33 | 690 | CC BY 4.0 | ||
LiveQA34 | 634 | CC BY 4.0 | ||
WikiDocPatient3 | 5490 | CC BY-SA 4.0 | ||
GuidelineQA | 2000 | Common Crawl (other) | ||
Summarization | PubMed Central | 10,000 | CC BY | |
Next Sentence Generation | PubMed Central | 20,000 | CC BY | |
Key words prediction | PubMed Central | 10,000 | CC BY | |
Causal Relation Detection | PubMed35 | 2450 | CC BY | |
Relation Extraction | UMLS knowledge graph2 | 10,000 | Openrail | |
Clinical | QA, summarization, classification, mortality prediction | 30,000 | PhysioNet credentialed health data use agreement 1.5.0 |