Table 5 Performance overview of models after guided self-reflexion

Model family	Model	Size	Initial results (ave)	Final results (ave)	Performance gain
GPT	4-1106-preview	Large	36.0	41.9	5.9
	4-turbo-2024-04-09	Large	35.0	41.4	6.4
	3.5-turbo-1106	Small	29.7	37.2	7.5
Claude-3	opus-20240229	Large	34.6	40.7	6.1
Claude-3	haiku-20240307	Small	30.6	38.3	7.7
WizardLM-2	8x22B	Large	36.3	41.3	5.0
DBRX	16x8B	Large	31.2	38.4	7.2
Mistral	8x22B	Large	31.4	38.6	6.0
	8x7B	Large	34.6	40.1	5.5
	7B	Small	31.7	37.7	7.2
Llama-3	70B	Large	34.2	40.5	6.3
Llama-3	8B	Small	31.1	38.0	6.9
Llama-2	70B	Large	32.1	38.5	6.4
Llama-2	7B	Small	28.5	35.6	7.0
MedLlama-2	7B	Small	24.9	32.5	7.6
Gemma	7B	Small	19.2	23.7	4.4
Meditron	7B	Small	12.5	19.4	6.9

Quick links

Search