Table 7 Results on treatment planning, diagnosis clinical outcome prediction, and text classification results
From: Towards evaluating and building versatile large language models for medicine
Method | Size | SEER | DDXPlus | MIMIC4ED | HoC Classification | ||||
|---|---|---|---|---|---|---|---|---|---|
| Â | Â | Â | Â | Hospitalization | 72h ED Revisit | Critical Triage | Precision | Recall | F1 |
Close-source Models | |||||||||
GPT-4 | - | 84.73 | 58.13 | 61.20 | 58.07 | 60.13 | 61.07 | 80.23 | 68.06 |
Claude-3.5 | - | 92.93 | 60.24 | 65.80 | 57.91 | 68.53 | 58.43 | 79.84 | 66.74 |
Open-source Models | |||||||||
MEDITRON | 7B | 68.27 | 29.53 | 56.27 | 48.47 | 45.67 | 19.61 | 34.61 | 23.70 |
InternLM 2 | 7B | 62.33 | 35.20 | 58.80 | 55.13 | 52.80 | 20.65 | 82.24 | 31.09 |
Mistral | 7B | 38.93 | 34.80 | 56.27 | 48.47 | 45.67 | 40.39 | 64.11 | 48.73 |
Llama 3 | 8B | 56.07 | 33.73 | 39.07 | 9.27 | 8.80 | 32.40 | 52.03 | 38.37 |
Qwen 2 | 7B | 22.27 | 34.07 | 57.60 | 56.67 | 53.53 | 37.78 | 53.81 | 40.29 |
Med42-v2 | 8B | 43.87 | 34.13 | 57.87 | 55.20 | 46.60 | 49.95 | 53.12 | 47.87 |
Baichuan 2 | 7B | 16.80 | 34.13 | 22.73 | 8.07 | 2.13 | 38.54 | 20.28 | 23.76 |
MMedIns-Llama 3 | 8B | 98.47 | 97.53 | 74.20 | 52.73 | 63.13 | 89.59 | 85.58 | 86.66 |