Table 5 Performance overview of models after guided self-reflexion

From: Autonomous medical evaluation for guideline adherence of large language models

Model family

Model

Size

Initial results (ave)

Final results (ave)

Performance gain

GPT

4-1106-preview

Large

36.0

41.9

5.9

4-turbo-2024-04-09

Large

35.0

41.4

6.4

3.5-turbo-1106

Small

29.7

37.2

7.5

Claude-3

opus-20240229

Large

34.6

40.7

6.1

haiku-20240307

Small

30.6

38.3

7.7

WizardLM-2

8x22B

Large

36.3

41.3

5.0

DBRX

16x8B

Large

31.2

38.4

7.2

Mistral

8x22B

Large

31.4

38.6

6.0

8x7B

Large

34.6

40.1

5.5

7B

Small

31.7

37.7

7.2

Llama-3

70B

Large

34.2

40.5

6.3

8B

Small

31.1

38.0

6.9

Llama-2

70B

Large

32.1

38.5

6.4

7B

Small

28.5

35.6

7.0

MedLlama-2

7B

Small

24.9

32.5

7.6

Gemma

7B

Small

19.2

23.7

4.4

Meditron

7B

Small

12.5

19.4

6.9