Table 4 Performance of PhenoBrain, physicians, and large language model on human-computer test set in five clinical departments

From: A phenotype-based AI pipeline outperforms human experts in differentially diagnosing rare diseases using EHRs

 

Recall

Pediatrics

Neurology

Nephrology

Cardiology

Hematology

Average (95%CI)

PhenoBrain (Ensemble method)

Top-1

0.733

0.600

0.533

0.200

0.400

0.493 (0.387–0.600)

Top-3

0.933

0.667

0.533

0.400

0.533

0.613 (0.507–0.720)

Top-10

1.000

0.867

0.800

0.533

0.867

0.813 (0.720–0.907)

Physicians

Top-1

0.533

0.533

0.117

0.533

0.317

0.407 (0.323–0.490)

Top-3

0.567

0.583

0.233

0.583

0.372

0.468 (0.386–0.551)

Top-10

0.567

0.583

0.267

0.600

0.389

0.481 (0.400–0.563)

Physicians with assistance

Top-1

0.617

0.550

0.150

0.533

0.383

0.447 (0.363–0.530)

Top-3

0.650

0.600

0.283

0.583

0.439

0.511 (0.428–0.594)

Top-10

0.650

0.600

0.317

0.600

0.456

0.524 (0.441–0.607)

ChatGPT (EHR)

Top-1

0.400

0.067

0.133

0.467

0.133

0.240 (0.147–0.333)

Top-3

0.667

0.067

0.200

0.533

0.133

0.320 (0.213–0.427)

Top-10

0.667

0.400

0.733

0.667

0.133

0.520 (0.413–0.627)

ChatGPT (HPO)

Top-1

0.667

0.067

0.133

0.133

0.133

0.227 (0.133–0.320)

Top-3

0.800

0.200

0.467

0.200

0.200

0.373 (0.267–0.480)

Top-10

0.867

0.400

0.667

0.267

0.267

0.493 (0.387–0.600)

GPT-4 (EHR)

Top-1

0.667

0.267

0.267

0.533

0.133

0.373 (0.267–0.480)

Top-3

0.867

0.333

0.400

0.600

0.333

0.507 (0.400–0.613)

Top-10

0.933

0.533

0.800

0.667

0.400

0.667 (0.560–0.773)

GPT-4 (HPO)

Top-1

0.800

0.200

0.467

0.267

0.133

0.373 (0.267–0.480)

Top-3

0.933

0.467

0.667

0.533

0.333

0.587 (0.480–0.693)

Top-10

1.000

0.667

0.867

0.533

0.533

0.720 (0.613–0.813)

  1. Top-1, Top-3, and Top-10 recall rates on the Human-Computer test set (PUMCH-ADM) in five clinical departments by PhenoBrain (using the disease subgroup for each department), physicians (with or without external assistance), ChatGPT-3.5 (using EHRs or HPO terms, version 2023.06.01), and GPT-4 (using EHRs or HPO terms, version 2023.06.01). For PhenoBrain, HPO terms were manually extracted. ChatGPT-3.5 and GPT-4 input comprise electronic medical records or human phenotype ontology (HPO) terms. We conducted multiple experiments, designing some hard prompts (https://github.com/dair-ai/Prompt-Engineering-Guide), which included Few-Shot prompting, Chain-of-Thought prompting, and Self-Consistency. We selected the one that yielded the best outcomes.
  2. The numbers in bold represent better performance than the best benchmark.