Table 5 Performance overview of models after guided self-reflexion
From: Autonomous medical evaluation for guideline adherence of large language models
Model family | Model | Size | Initial results (ave) | Final results (ave) | Performance gain |
|---|---|---|---|---|---|
GPT | 4-1106-preview | Large | 36.0 | 41.9 | 5.9 |
4-turbo-2024-04-09 | Large | 35.0 | 41.4 | 6.4 | |
3.5-turbo-1106 | Small | 29.7 | 37.2 | 7.5 | |
Claude-3 | opus-20240229 | Large | 34.6 | 40.7 | 6.1 |
haiku-20240307 | Small | 30.6 | 38.3 | 7.7 | |
WizardLM-2 | 8x22B | Large | 36.3 | 41.3 | 5.0 |
DBRX | 16x8B | Large | 31.2 | 38.4 | 7.2 |
Mistral | 8x22B | Large | 31.4 | 38.6 | 6.0 |
8x7B | Large | 34.6 | 40.1 | 5.5 | |
7B | Small | 31.7 | 37.7 | 7.2 | |
Llama-3 | 70B | Large | 34.2 | 40.5 | 6.3 |
8B | Small | 31.1 | 38.0 | 6.9 | |
Llama-2 | 70B | Large | 32.1 | 38.5 | 6.4 |
7B | Small | 28.5 | 35.6 | 7.0 | |
MedLlama-2 | 7B | Small | 24.9 | 32.5 | 7.6 |
Gemma | 7B | Small | 19.2 | 23.7 | 4.4 |
Meditron | 7B | Small | 12.5 | 19.4 | 6.9 |