Table 1 The supervised fine-tuning performance of various open source LLMs on six tasks
From: Medical foundation large language models for comprehensive text analysis and beyond
Task | Dataset | Metric | LLaMA2 13B | PMC-LLaMA 13B | Me-LLaMA 13B | LLaMA2 70B | Meditron 70B | Me-LLaMA 70B |
|---|---|---|---|---|---|---|---|---|
Question answering | PubMedQA | Acc | 0.800 | 0.778 | 0.802 | 0.800 | 0.800 | 0.814 |
Macro-F1 | 0.560 | 0.544 | 0.562 | 0.560 | – | 0.572 | ||
MedQA | Acc | 0.467 | 0.456 | 0.493 | 0.598 | 0.607 | 0.623 | |
Macro-F1 | 0.465 | 0.454 | 0.487 | 0.595 | – | 0.621 | ||
MedMCQA | Acc | 0.527 | 0.548 | 0.557 | 0.626 | 0.651 | 0.643 | |
Macro-F1 | 0.524 | 0.545 | 0.551 | 0.625 | – | 0.640 | ||
EmrQA | Acc | 0.789 | 0.810 | 0.857 | 0.847 | 0.850 | 0.854 | |
F1 | 0.730 | 0.738 | 0.751 | 0.751 | 0.751 | 0.751 | ||
Named entity recognition | i2b2 | Macro-F1 | 0.904 | 0.901 | 0.906 | 0.913 | 0.908 | 0.910 |
Relation extraction | DDI | Macro-F1 | 0.622 | 0.622 | 0.559 | 0.746 | 0.737 | 0.779 |
Classification | HoC | Macro-F1 | 0.696 | 0.422 | 0.684 | 0.818 | 0.702 | 0.841 |
MTsample | Macro-F1 | 0.430 | 0.345 | 0.451 | 0.458 | 0.284 | 0.544 | |
Summarization | PubMed | R-L | 0.191 | 0.091 | 0.197 | 0.211 | 0.197 | 0.209 |
BERTS | 0.663 | 0.516 | 0.679 | 0.689 | 0.677 | 0.700 | ||
MIMIC-CXR | R-L | 0.437 | 0.139 | 0.453 | 0.440 | 0.458 | 0.476 | |
BERTS | 0.816 | 0.694 | 0.821 | 0.813 | 0.824 | 0.828 | ||
Natural language inference | BioNLI | Macro-F1 | 0.409 | 0.332 | 0.447 | 0.447 | 0.444 | 0.566 |
MedNLI | Macro-F1 | 0.881 | 0.868 | 0.903 | 0.884 | 0.897 | 0.916 |