Table 2 Performance of PhenoBrain on rare disease datasets

From: A phenotype-based AI pipeline outperforms human experts in differentially diagnosing rare diseases using EHRs

 

Metrics

Public test set (95% CI)

PUMCH-L (95% CI)

PUMCH-ADM (95% CI)

Average

PhenoBrain (Ensemble method)

Top-1 recall

0.304 (0.273–0.334)

0.315 (0.286–0.344)

0.453 (0.347–0.573)

0.357 (0.317–0.397)

Top-3 recall

0.483 (0.450–0.517)

0.468 (0.436–0.499)

0.587 (0.480–0.693)

0.513 (0.471–0.552)

Top-10 recall

0.640 (0.608–0.672)

0.630 (0.599–0.660)

0.693 (0.587–0.800)

0.654 (0.616–0.691)

Median rank

4.0

4.0

2.0

3.3

Best of 12 benchmark methods

Top-1 recall

0.257 (0.229–0.286)

0.294 (0.265–0.322)

0.307 (0.200–0.413)

0.261 (0.226–0.297)

Top-3 recall

0.424 (0.391–0.457)

0.424 (0.394–0.454)

0.467 (0.360–0.573)

0.441 (0.400–0.480)

Top-10 recall

0.593 (0.561–0.625)

0.585 (0.554–0.615)

0.653 (0.547–0.760)

0.618 (0.580–0.656)

Median Rank

6.0

6.0

4.0

5.3

  1. Top-1, Top-3, Top-10 recall rates, and median ranks by PhenoBrain and comparisons with the best results from the 12 benchmark methods using the complete annotation of 9260 rare diseases on three rare disease datasets: the Public Test Set, PUMCH-L, and PUMCH-ADM. For PUMCH-ADM, HPO terms were manually extracted. It should be noted that different benchmark methods performed best on each individual dataset. Complete results can be found in Supplementary Tables 5, 12, 14 and 15.