Table 1 Performance of different LLMs

From: A collaborative large language model for drug analysis

Method

No. of parameters

USMLE

MedMCQA

MMLU

ChatDoctor

ADE

Drug_Effects

DDI

PubMedQA

Accuracy

Accuracy

Accuracy

Precision

Recall

F1

Accuracy

Accuracy

Accuracy

Accuracy

Galactica43

120B (0.7×)

44.4

52.9

77.6

InstructGPT19

175B (1×)

46.0

44.0

35.1

73.2

Flan-PaLM5

540B (3.1×)

67.6

57.6

80.1

79.0

Med-PaLM-26

340B (1.9×)

86.5

72.3

89.2

81.8

GPT-4-base42

>1T (>5.7×)

86.1

73.7

93.8

80.4

LLaMA-341

70B (0.4×)

79.4 (3.131)

76.0 (3.268)

86.1 (1.769)

14.7 (0.019)

8.0 (0.017)

10.4 (0.020)

56.2 (2.304)

43.3 (0.176)

58.9 (1.679)

78.5 (3.014)

ChatGPT1

175B (1×)

63.7 (2.847)

66.3 (2.448)

73.8 (2.181)

6.3 (0.026)

4.7 (0.025)

5.4 (0.027)

43.1 (2.735)

37.5 (0.201)

40.2 (1.191)

66.1 (2.387)

GPT-42

>1T (>5.7×)

83.5 (0.014)

79.0 (0.101)

90.5 (0.025)

19.0 (0.007)

11.2 (0.008)

14.1 (0.009)

60.2 (0.030)

46.5 (0.088)

62.8 (0.062)

81.2 (0.059)

Claude-340

− (−)

86.8 (1.092)

82.3 (0.984)

91.7 (1.120)

24.4 (0.013)

17.2 (0.012)

20.2 (0.013)

66.4 (1.177)

52.0 (0.135)

68.1 (0.398)

84.0 (0.813)

DrugGPT (current work)

175B (1×)

88.2 (0.044)

86.5 (0.023)

95.3 (0.006)

48.3 (0.207)

35.3 (0.076)

40.8 (0.119)

83.7 (0.024)

91.8 (0.005)

97.1 (0.004)

88.0 (0.005)

  1. ‘No. of parameters’ denotes the number of model parameters. We conduct multiple runs for DrugGPT to reduce its randomness. We report the mean (variance) % of performance. All values are reported in percentage (%). Higher is better for all metrics. The multiple in parenthesis in the ‘No. of parameters’ column is calculated by comparison with our method in terms of the number of parameters. Bold values indicate the highest performance in each dataset. Our presented DrugGPT consistently outperforms all previous strong methods over a broad range of datasets.