Nature Biomedical Engineering

Table 2 Results of human evaluation, comparing our method with ChatGPT and GPT-4 in terms of factuality, completeness, safety and preference

From: A collaborative large language model for drug analysis

Metric	ChatGPT¹ versus DrugGPT			GPT-4² versus DrugGPT
Metric	ChatGPT wins	Tie	DrugGPT wins	GPT-4 wins	Tie	DrugGPT wins
Factuality	24.0	16.0	60.0	33.0	20.0	47.0
Completeness	15.0	22.0	63.0	21.0	28.0	51.0
Safety	19.0	12.0	69.0	23.0	18.0	59.0
Preference	13.0	11.0	76.0	18.0	14.0	68.0

All values are reported in percentage (%).

Back to article page

Search

Advanced search

Quick links