Table 8 Results on fact verification and NLI results, as reported with both accuracy and BLUE/ROUGE scores
From: Towards evaluating and building versatile large language models for medicine
Method | Size | PubMedQA | PUBLICHEALTH | EMBS | MedNLI textual entailment | |
|---|---|---|---|---|---|---|
| Â | Â | Answer Ver. | Health Fact Ver. | Justification Ver. | Discriminative Task | Generative Task |
Close-source Models | ||||||
GPT-4 | - | 66.15 | 78.60 | 16.28/16.27 | 86.63 | 27.09/23.71 |
Claude-3.5 | - | 11.54 | 62.04 | 14.77/16.45 | 82.14 | 17.80/20.02 |
Open-source Models | ||||||
MEDITRON | 7B | 25.23 | 32.66 | 11.58/15.78 | 60.83 | 4.42/14.08 |
InternLM 2 | 7B | 99.23 | 76.94 | 8.75/14.69 | 84.67 | 15.84/19.01 |
Mistral | 7B | 57.38 | 69.78 | 15.98/16.43 | 71.59 | 13.03/15.47 |
Llama 3 | 8B | 94.77 | 63.89 | 16.52/16.49 | 63.85 | 21.31/22.75 |
Qwen 2 | 7B | 18.00 | 58.25 | 12.52/14.00 | 82.00 | 14.26/16.21 |
Med42-v2 | 8B | 73.23 | 78.54 | 15.63/15.86 | 77.57 | 12.24/15.29 |
Baichuan 2 | 7B | 79.38 | 47.98 | 14.97/15.99 | 53.94 | 14.99/17.27 |
MMedIns-Llama 3 | 8B | 97.08 | 79.55 | 12.71/14.65 | 86.71 | 23.52/25.17 |