Table 8 Results on fact verification and NLI results, as reported with both accuracy and BLUE/ROUGE scores

From: Towards evaluating and building versatile large language models for medicine

Method

Size

PubMedQA

PUBLICHEALTH

EMBS

MedNLI textual entailment

  

Answer Ver.

Health Fact Ver.

Justification Ver.

Discriminative Task

Generative Task

Close-source Models

GPT-4

-

66.15

78.60

16.28/16.27

86.63

27.09/23.71

Claude-3.5

-

11.54

62.04

14.77/16.45

82.14

17.80/20.02

Open-source Models

MEDITRON

7B

25.23

32.66

11.58/15.78

60.83

4.42/14.08

InternLM 2

7B

99.23

76.94

8.75/14.69

84.67

15.84/19.01

Mistral

7B

57.38

69.78

15.98/16.43

71.59

13.03/15.47

Llama 3

8B

94.77

63.89

16.52/16.49

63.85

21.31/22.75

Qwen 2

7B

18.00

58.25

12.52/14.00

82.00

14.26/16.21

Med42-v2

8B

73.23

78.54

15.63/15.86

77.57

12.24/15.29

Baichuan 2

7B

79.38

47.98

14.97/15.99

53.94

14.99/17.27

MMedIns-Llama 3

8B

97.08

79.55

12.71/14.65

86.71

23.52/25.17

  1. ‘Ver.’ denotes ‘verification’. Bolding represents the best results.