Extended Data Fig. 2: DDx top-k accuracy for non-disease-states and positive disease-states.
From: Towards conversational diagnostic artificial intelligence

a,b: Specialist rated DDx top-k accuracy for the 149 “positive” scenarios with respect to (a) the ground-truth diagnosis and (b) the accepted differentials. c,d: Specialist rated DDx top-k accuracy for the 10 “negative” scenarios with respect to (c) the ground-truth diagnosis and (d) the accepted differentials. Using two-sided bootstrap tests (n = 10,000) with FDR correction, differences in the “positive” scenarios were significant (P <0.05) for all k, but differences in “negative” scenarios were not significant due to the small sample size. Centrelines correspond to the average top-k accuracy, with 95% confidence intervals shaded. The FDR-adjusted P values for positive disease states, ground-truth comparison: 0.0041 (k = 1), 0.0002 (k = 2), 0.0001 (k = 3), 0.0002 (k = 4), 0.0001 (k = 5), 0.0002 (k = 6), 0.0002 (k = 7), 0.0003 (k = 8), 0.0001 (k = 9) and 0.0001 (k = 10) (a). The FDR-adjusted P values for positive disease states, accepted differential comparison: 0.0002 (k = 1), 0.0001 (k = 2), 0.0002 (k = 3), 0.0003 (k = 4), 0.0001 (k = 5), 0.0001 (k = 6), 0.0001 (k = 7), 0.0001 (k = 8), 0.0001 (k = 9) and 0.0001 (k = 10) (b). The FDR-adjusted P values for non-disease states, ground-truth comparison: 0.1907 (k = 1), 0.1035 (k = 2), 0.1035 (k = 3), 0.1035 (k = 4), 0.1035 (k = 5), 0.1035 (k = 6), 0.1035 (k = 7), 0.1035 (k = 8), 0.1035 (k = 9) and 0.1035 (k = 10) (c). The FDR-adjusted P values for non-disease states, accepted differential comparison: 0.1035 (k = 1), 0.1035 (k = 2), 0.1829 (k = 3), 0.1035 (k = 4), 0.1035 (k = 5), 0.1035 (k = 6), 0.1035 (k = 7), 0.1035 (k = 8), 0.1035 (k = 9) and 0.1035 (k = 10) (d).