Table 4 Performance of PhenoBrain, physicians, and large language model on human-computer test set in five clinical departments
 | Recall | Pediatrics | Neurology | Nephrology | Cardiology | Hematology | Average (95%CI) |
---|---|---|---|---|---|---|---|
PhenoBrain (Ensemble method) | Top-1 | 0.733 | 0.600 | 0.533 | 0.200 | 0.400 | 0.493 (0.387–0.600) |
Top-3 | 0.933 | 0.667 | 0.533 | 0.400 | 0.533 | 0.613 (0.507–0.720) | |
Top-10 | 1.000 | 0.867 | 0.800 | 0.533 | 0.867 | 0.813 (0.720–0.907) | |
Physicians | Top-1 | 0.533 | 0.533 | 0.117 | 0.533 | 0.317 | 0.407 (0.323–0.490) |
Top-3 | 0.567 | 0.583 | 0.233 | 0.583 | 0.372 | 0.468 (0.386–0.551) | |
Top-10 | 0.567 | 0.583 | 0.267 | 0.600 | 0.389 | 0.481 (0.400–0.563) | |
Physicians with assistance | Top-1 | 0.617 | 0.550 | 0.150 | 0.533 | 0.383 | 0.447 (0.363–0.530) |
Top-3 | 0.650 | 0.600 | 0.283 | 0.583 | 0.439 | 0.511 (0.428–0.594) | |
Top-10 | 0.650 | 0.600 | 0.317 | 0.600 | 0.456 | 0.524 (0.441–0.607) | |
ChatGPT (EHR) | Top-1 | 0.400 | 0.067 | 0.133 | 0.467 | 0.133 | 0.240 (0.147–0.333) |
Top-3 | 0.667 | 0.067 | 0.200 | 0.533 | 0.133 | 0.320 (0.213–0.427) | |
Top-10 | 0.667 | 0.400 | 0.733 | 0.667 | 0.133 | 0.520 (0.413–0.627) | |
ChatGPT (HPO) | Top-1 | 0.667 | 0.067 | 0.133 | 0.133 | 0.133 | 0.227 (0.133–0.320) |
Top-3 | 0.800 | 0.200 | 0.467 | 0.200 | 0.200 | 0.373 (0.267–0.480) | |
Top-10 | 0.867 | 0.400 | 0.667 | 0.267 | 0.267 | 0.493 (0.387–0.600) | |
GPT-4 (EHR) | Top-1 | 0.667 | 0.267 | 0.267 | 0.533 | 0.133 | 0.373 (0.267–0.480) |
Top-3 | 0.867 | 0.333 | 0.400 | 0.600 | 0.333 | 0.507 (0.400–0.613) | |
Top-10 | 0.933 | 0.533 | 0.800 | 0.667 | 0.400 | 0.667 (0.560–0.773) | |
GPT-4 (HPO) | Top-1 | 0.800 | 0.200 | 0.467 | 0.267 | 0.133 | 0.373 (0.267–0.480) |
Top-3 | 0.933 | 0.467 | 0.667 | 0.533 | 0.333 | 0.587 (0.480–0.693) | |
Top-10 | 1.000 | 0.667 | 0.867 | 0.533 | 0.533 | 0.720 (0.613–0.813) |