Table 2 The zero-shot performance of various open source LLMs with chat capability
From: Medical foundation large language models for comprehensive text analysis and beyond
Task | Dataset | Metric | LLaMA2-13B-chat | PMC-LLaMA-chat | Medalpaca-13B | AlpaCare-13B | Me-LLaMA 13B-chat | LLaMA2-70B-chat | Me-LLaMA 70B-chat |
|---|---|---|---|---|---|---|---|---|---|
Question answering | PubMedQA | Accuracy | 0.546 | 0.504 | 0.238 | 0.538 | 0.700 | 0.668 | 0.768 |
Macro-F1 | 0.457 | 0.305 | 0.192 | 0.373 | 0.504 | 0.477 | 0.557 | ||
MedQA | Accuracy | 0.097 | 0.207 | 0.143 | 0.304 | 0.427 | 0.376 | 0.523 | |
Macro-F1 | 0.148 | 0.158 | 0.102 | 0.281 | 0.422 | 0.367 | 0.521 | ||
MedMCQA | Accuracy | 0.321 | 0.212 | 0.205 | 0.385 | 0.449 | 0.339 | 0.539 | |
Macro-F1 | 0.243 | 0.216 | 0.164 | 0.358 | 0.440 | 0.273 | 0.538 | ||
EmrQA | Accuracy | 0.001 | 0.053 | 0.000 | 0.001 | 0.048 | 0.050 | 0.119 | |
F1 | 0.098 | 0.304 | 0.040 | 0.198 | 0.307 | 0.251 | 0.346 | ||
Named entity recognition | i2b2 | Macro-F1 | 0.143 | 0.091 | 0.000 | 0.173 | 0.166 | 0.321 | 0.329 |
Relation extraction | DDI | Macro-F1 | 0.090 | 0.147 | 0.058 | 0.110 | 0.214 | 0.087 | 0.283 |
Classification | HoC | Macro-F1 | 0.228 | 0.184 | 0.246 | 0.267 | 0.335 | 0.309 | 0.544 |
MTsample | Macro-F1 | 0.133 | 0.083 | 0.003 | 0.273 | 0.229 | 0.254 | 0.384 | |
Summarization | PubMed | Rouge-L | 0.161 | 0.028 | 0.014 | 0.167 | 0.116 | 0.192 | 0.169 |
BERTS | 0.671 | 0.128 | 0.117 | 0.671 | 0.445 | 0.684 | 0.678 | ||
MIMIC-CXR | Rouge-L | 0.144 | 0.139 | 0.010 | 0.134 | 0.400 | 0.131 | 0.418 | |
BERTS | 0.704 | 0.694 | 0.502 | 0.702 | 0.797 | 0.696 | 0.787 | ||
Natural language inference | BioNLI | Macro-F1 | 0.173 | 0.159 | 0.164 | 0.170 | 0.195 | 0.297 | 0.436 |
MedNLI | Macro-F1 | 0.412 | 0.175 | 0.175 | 0.275 | 0.472 | 0.515 | 0.675 |