Table 1 Performance of different LLMs
From: A collaborative large language model for drug analysis
Method | No. of parameters | USMLE | MedMCQA | MMLU | ChatDoctor | ADE | Drug_Effects | DDI | PubMedQA | ||
---|---|---|---|---|---|---|---|---|---|---|---|
Accuracy | Accuracy | Accuracy | Precision | Recall | F1 | Accuracy | Accuracy | Accuracy | Accuracy | ||
Galactica43 | 120B (0.7×) | 44.4 | 52.9 | − | − | − | − | − | − | − | 77.6 |
InstructGPT19 | 175B (1×) | 46.0 | 44.0 | 35.1 | − | − | − | − | − | − | 73.2 |
Flan-PaLM5 | 540B (3.1×) | 67.6 | 57.6 | 80.1 | − | − | − | − | − | − | 79.0 |
Med-PaLM-26 | 340B (1.9×) | 86.5 | 72.3 | 89.2 | − | − | − | − | − | − | 81.8 |
GPT-4-base42 | >1T (>5.7×) | 86.1 | 73.7 | 93.8 | − | − | − | − | − | − | 80.4 |
LLaMA-341 | 70B (0.4×) | 79.4 (3.131) | 76.0 (3.268) | 86.1 (1.769) | 14.7 (0.019) | 8.0 (0.017) | 10.4 (0.020) | 56.2 (2.304) | 43.3 (0.176) | 58.9 (1.679) | 78.5 (3.014) |
ChatGPT1 | 175B (1×) | 63.7 (2.847) | 66.3 (2.448) | 73.8 (2.181) | 6.3 (0.026) | 4.7 (0.025) | 5.4 (0.027) | 43.1 (2.735) | 37.5 (0.201) | 40.2 (1.191) | 66.1 (2.387) |
GPT-42 | >1T (>5.7×) | 83.5 (0.014) | 79.0 (0.101) | 90.5 (0.025) | 19.0 (0.007) | 11.2 (0.008) | 14.1 (0.009) | 60.2 (0.030) | 46.5 (0.088) | 62.8 (0.062) | 81.2 (0.059) |
Claude-340 | − (−) | 86.8 (1.092) | 82.3 (0.984) | 91.7 (1.120) | 24.4 (0.013) | 17.2 (0.012) | 20.2 (0.013) | 66.4 (1.177) | 52.0 (0.135) | 68.1 (0.398) | 84.0 (0.813) |
DrugGPT (current work) | 175B (1×) | 88.2 (0.044) | 86.5 (0.023) | 95.3 (0.006) | 48.3 (0.207) | 35.3 (0.076) | 40.8 (0.119) | 83.7 (0.024) | 91.8 (0.005) | 97.1 (0.004) | 88.0 (0.005) |