Table 2 The zero-shot performance of various open source LLMs with chat capability

Task	Dataset	Metric	LLaMA2-13B-chat	PMC-LLaMA-chat	Medalpaca-13B	AlpaCare-13B	Me-LLaMA 13B-chat	LLaMA2-70B-chat	Me-LLaMA 70B-chat
Question answering	PubMedQA	Accuracy	0.546	0.504	0.238	0.538	0.700	0.668	0.768
	PubMedQA	Macro-F1	0.457	0.305	0.192	0.373	0.504	0.477	0.557
	MedQA	Accuracy	0.097	0.207	0.143	0.304	0.427	0.376	0.523
	MedQA	Macro-F1	0.148	0.158	0.102	0.281	0.422	0.367	0.521
	MedMCQA	Accuracy	0.321	0.212	0.205	0.385	0.449	0.339	0.539
	MedMCQA	Macro-F1	0.243	0.216	0.164	0.358	0.440	0.273	0.538
	EmrQA	Accuracy	0.001	0.053	0.000	0.001	0.048	0.050	0.119
	EmrQA	F1	0.098	0.304	0.040	0.198	0.307	0.251	0.346
Named entity recognition	i2b2	Macro-F1	0.143	0.091	0.000	0.173	0.166	0.321	0.329
Relation extraction	DDI	Macro-F1	0.090	0.147	0.058	0.110	0.214	0.087	0.283
Classification	HoC	Macro-F1	0.228	0.184	0.246	0.267	0.335	0.309	0.544
Classification	MTsample	Macro-F1	0.133	0.083	0.003	0.273	0.229	0.254	0.384
Summarization	PubMed	Rouge-L	0.161	0.028	0.014	0.167	0.116	0.192	0.169
	PubMed	BERTS	0.671	0.128	0.117	0.671	0.445	0.684	0.678
	MIMIC-CXR	Rouge-L	0.144	0.139	0.010	0.134	0.400	0.131	0.418
	MIMIC-CXR	BERTS	0.704	0.694	0.502	0.702	0.797	0.696	0.787
Natural language inference	BioNLI	Macro-F1	0.173	0.159	0.164	0.170	0.195	0.297	0.436
Natural language inference	MedNLI	Macro-F1	0.412	0.175	0.175	0.275	0.472	0.515	0.675

Quick links

Search