Extended Data Fig. 3: Specialist rated DDx accuracy by scenario specialty.
From: Towards conversational diagnostic artificial intelligence

Top-k DDx accuracy for scenarios with respect to the ground-truth in (a) Cardiology (N = 31, not significant), (b) Gastroenterology (N = 33, not significant), (c) Internal Medicine (N = 16, significant for all k), (d) Neurology (N = 32, significant for k > 5), (e) Obstetrics and Gynaecology (OBGYN)/Urology (N = 15, not significant), (f) Respiratory (N = 32, significant for all k). Two-sided bootstrap tests (n = 10,000) with FDR correction were used to assess significance (P < 0.05) on these cases. Centrelines correspond to the average top-k accuracy, with 95% confidence intervals shaded. The FDR-adjusted P values for Cardiology: 0.0911 (k = 1), 0.0637 (k = 2), 0.0637 (k = 3), 0.0911 (k = 4), 0.0911 (k = 5), 0.0929 (k = 6), 0.0929 (k = 7), 0.0929 (k = 8), 0.0929 (k = 9) and 0.0929 (k = 10) (a). The FDR-adjusted P values for Gastroenterology: 0.4533 (k = 1), 0.1735 (k = 2), 0.1735 (k = 3), 0.1735 (k = 4), 0.1735 (k = 5), 0.1735 (k = 6), 0.1735 (k = 7), 0.1735 (k = 8), 0.1735 (k = 9) and 0.1735 (k = 10) (b). The FDR-adjusted P values for Internal Medicine: 0.0016 (k = 1), 0.0102 (k = 2), 0.0216 (k = 3), 0.0216 (k = 4), 0.0013 (k = 5), 0.0013 (k = 6), 0.0013 (k = 7), 0.0013 (k = 8), 0.0013 (k = 9) and 0.0013 (k = 10) (c). The FDR-adjusted P values for Neurology: 0.2822 (k = 1), 0.1655 (k = 2), 0.1655 (k = 3), 0.069 (k = 4), 0.069 (k = 5), 0.0492 (k = 6), 0.0492 (k = 7), 0.0492 (k = 8), 0.0492 (k = 9) and 0.0492 (k = 10) (d). The FDR-adjusted P values for OBGYN/Urology: 0.285 (k = 1), 0.1432 (k = 2), 0.1432 (k = 3), 0.1432 (k = 4), 0.1432 (k = 5), 0.1432 (k = 6), 0.1432 (k = 7), 0.1432 (k = 8), 0.1432 (k = 9) and 0.1432 (k = 10) (e). The FDR-adjusted P values for Respiratory: 0.0004 (k = 1), 0.0004 (k = 2), 0.0004 (k = 3), 0.0004 (k = 4), 0.0004 (k = 5), 0.0006 (k = 6), 0.0006 (k = 7), 0.0006 (k = 8), 0.0006 (k = 9) and 0.0006 (k = 10) (f).