Table 5 The overall instruction tuning dataset

Task	Type	Source	Size	Copy right
General	Conversation	Alpaca²⁹	20,000	CC-BY-NC 4.0
		Dolly³⁰		CC-BY-SA-3.0
		ShareGPT³¹		Apache-2.0
Biomedical	Conversation	HealthCareMagic¹⁰	20,000	Reserved by HealthCareMagic and Icliniq
	Conversation	Icliniq¹⁰	20,000	Reserved by HealthCareMagic and Icliniq
	Instructions	MedInstruct¹¹	52,000	CC BY-NC 4.0
	Question Answering	Medical Flash Cards³	34,000	No commercialized use
		MEDIQA³²	2,220	CC BY 4.0
		MedicationQA³³	690	CC BY 4.0
		LiveQA³⁴	634	CC BY 4.0
		WikiDocPatient³	5490	CC BY-SA 4.0
		GuidelineQA	2000	Common Crawl (other)
	Summarization	PubMed Central	10,000	CC BY
	Next Sentence Generation	PubMed Central	20,000	CC BY
	Key words prediction	PubMed Central	10,000	CC BY
	Causal Relation Detection	PubMed³⁵	2450	CC BY
	Relation Extraction	UMLS knowledge graph²	10,000	Openrail
Clinical	QA, summarization, classification, mortality prediction	MIMIC-III²⁰, MIMIC-IV²¹	30,000	PhysioNet credentialed health data use agreement 1.5.0

Quick links

Search