Table 7 Results on treatment planning, diagnosis clinical outcome prediction, and text classification results

From: Towards evaluating and building versatile large language models for medicine

Method

Size

SEER

DDXPlus

MIMIC4ED

HoC Classification

    

Hospitalization

72h ED Revisit

Critical Triage

Precision

Recall

F1

Close-source Models

GPT-4

-

84.73

58.13

61.20

58.07

60.13

61.07

80.23

68.06

Claude-3.5

-

92.93

60.24

65.80

57.91

68.53

58.43

79.84

66.74

Open-source Models

MEDITRON

7B

68.27

29.53

56.27

48.47

45.67

19.61

34.61

23.70

InternLM 2

7B

62.33

35.20

58.80

55.13

52.80

20.65

82.24

31.09

Mistral

7B

38.93

34.80

56.27

48.47

45.67

40.39

64.11

48.73

Llama 3

8B

56.07

33.73

39.07

9.27

8.80

32.40

52.03

38.37

Qwen 2

7B

22.27

34.07

57.60

56.67

53.53

37.78

53.81

40.29

Med42-v2

8B

43.87

34.13

57.87

55.20

46.60

49.95

53.12

47.87

Baichuan 2

7B

16.80

34.13

22.73

8.07

2.13

38.54

20.28

23.76

MMedIns-Llama 3

8B

98.47

97.53

74.20

52.73

63.13

89.59

85.58

86.66

  1. The first 3 tasks are reported with Accuracy scores, and text classification is reported with Precision, Recall, and F1 scores. Bolding represents the best results.