Fig. 3: Human-computer performance comparisons and simulated collaborations.

a Top-1, top-3, and top-10 recall rates for physicians, PhenoBrain, and large language models on 75 admitted cases with 16 rare diseases (RDs) from 5 hospital departments. For each case, physicians made two diagnoses, the first based on their own knowledge and experience (labeled “Physicians”) and the second based on external assistance (labeled “Physicians_w_assistance”). For each case, PhenoBrain provided two results: a disease ranking restricted to a disease subgroup of a department (labeled “PhenoBrain”) and a disease ranking based on all 9260 rare diseases (labeled “PhenoBrain_full_RDs”). For each case, large language models provided two results: a disease ranking based on the input of Electronic Health Record (EHR) data (labeled “EHR”) and a disease ranking based on the input of extracted Human Phenotype Ontology (HPO) terms (labeled “HPO”). b The number of RDs for each of the five departments (disease subgroups). c Top-3 recall rates of PhenoBrain (restricted to disease subgroups) and physicians on 75 admitted cases. Each dot represents one type of rare disease. The circle, star, right triangle, plus, and left triangle each represent one of the five hospital departments: Pediatrics, Neurology, Renal, Cardiology, and Hematology, in that order. Orange and green represent physicians’ diagnostic accuracy without and with external assistance, respectively, while brown and dark blue represent their average performance, respectively. For convenience, we only added notes to cases diagnosed by physicians with external assistance. d Performance of simulated human-computer collaboration by integrating PhenoBrain (restricted to disease subgroups) into clinical workflow.