Table 1 Performance of different LLMs

Method	No. of parameters	USMLE	MedMCQA	MMLU	ChatDoctor			ADE	Drug_Effects	DDI	PubMedQA
Method	No. of parameters	Accuracy	Accuracy	Accuracy	Precision	Recall	F1	Accuracy	Accuracy	Accuracy	Accuracy
Galactica⁴³	120B (0.7×)	44.4	52.9	−	−	−	−	−	−	−	77.6
InstructGPT¹⁹	175B (1×)	46.0	44.0	35.1	−	−	−	−	−	−	73.2
Flan-PaLM⁵	540B (3.1×)	67.6	57.6	80.1	−	−	−	−	−	−	79.0
Med-PaLM-2⁶	340B (1.9×)	86.5	72.3	89.2	−	−	−	−	−	−	81.8
GPT-4-base⁴²	>1T (>5.7×)	86.1	73.7	93.8	−	−	−	−	−	−	80.4
LLaMA-3⁴¹	70B (0.4×)	₇9.4 (3.131)	76.0 (3.268)	86.1 (1.769)	14.7 (0.019)	8.0 (0.017)	10.4 (0.020)	56.2 (2.304)	43.3 (0.176)	58.9 (1.679)	78.5 (3.014)
ChatGPT¹	175B (1×)	63.7 (2.847)	66.3 (2.448)	73.8 (2.181)	6.3 (0.026)	4.7 (0.025)	5.4 (0.027)	43.1 (2.735)	37.5 (0.201)	40.2 (1.191)	66.1 (2.387)
GPT-4²	>1T (>5.7×)	83.5 (0.014)	79.0 (0.101)	90.5 (0.025)	19.0 (0.007)	11.2 (0.008)	14.1 (0.009)	60.2 (0.030)	46.5 (0.088)	62.8 (0.062)	81.2 (0.059)
Claude-3⁴⁰	− (−)	86.8 (1.092)	82.3 (0.984)	91.7 (1.120)	24.4 (0.013)	17.2 (0.012)	20.2 (0.013)	66.4 (1.177)	52.0 (0.135)	68.1 (0.398)	84.0 (0.813)
DrugGPT (current work)	175B (1×)	88.2 (0.044)	86.5 (0.023)	95.3 (0.006)	48.3 (0.207)	35.3 (0.076)	40.8 (0.119)	83.7 (0.024)	91.8 (0.005)	97.1 (0.004)	88.0 (0.005)

‘No. of parameters’ denotes the number of model parameters. We conduct multiple runs for DrugGPT to reduce its randomness. We report the mean (variance) % of performance. All values are reported in percentage (%). Higher is better for all metrics. The multiple in parenthesis in the ‘No. of parameters’ column is calculated by comparison with our method in terms of the number of parameters. Bold values indicate the highest performance in each dataset. Our presented DrugGPT consistently outperforms all previous strong methods over a broad range of datasets.

Quick links

Search