Table 3 Reasoning analysis: evaluations in the zero-shot learning setting to examine whether the LLMs could reason about drugs using learned knowledge
From: A collaborative large language model for drug analysis
Method | No. of parameters | USMLE | MedMCQA | MMLU | ChatDoctor | ADE | Drug_Effects | DDI | PubMedQA | ||
---|---|---|---|---|---|---|---|---|---|---|---|
Accuracy | Accuracy | Accuracy | Precision | Recall | F1 | Accuracy | Accuracy | Accuracy | Accuracy | ||
Med-PaLM-26 | 340B (1.9×) | 79.7 | 71.3 | − | − | − | − | − | − | − | 79.2 |
ChatGPT1 | 175B (1×) | 55.8 | 63.5 | 71.4 | 4.7 | 5.6 | 5.1 | 45.2 | 39.8 | 42.8 | 64.1 |
GPT-42 | >1T (>5.7×) | 80.2 | 76.6 | 84.4 | 15.6 | 10.7 | 12.7 | 55.5 | 47.8 | 58.5 | 74.5 |
DrugGPT (current work) | 175B (1×) | 82.7 | 80.2 | 85.6 | 50.2 | 33.7 | 40.3 | 84.2 | 92.7 | 95.1 | 84.5 |