Table 2 Performance of PhenoBrain on rare disease datasets

	Metrics	Public test set (95% CI)	PUMCH-L (95% CI)	PUMCH-ADM (95% CI)	Average
PhenoBrain (Ensemble method)	Top-1 recall	0.304 (0.273–0.334)	0.315 (0.286–0.344)	0.453 (0.347–0.573)	0.357 (0.317–0.397)
	Top-3 recall	0.483 (0.450–0.517)	0.468 (0.436–0.499)	0.587 (0.480–0.693)	0.513 (0.471–0.552)
	Top-10 recall	0.640 (0.608–0.672)	0.630 (0.599–0.660)	0.693 (0.587–0.800)	0.654 (0.616–0.691)
	Median rank	4.0	4.0	2.0	3.3
Best of 12 benchmark methods	Top-1 recall	0.257 (0.229–0.286)	0.294 (0.265–0.322)	0.307 (0.200–0.413)	0.261 (0.226–0.297)
	Top-3 recall	0.424 (0.391–0.457)	0.424 (0.394–0.454)	0.467 (0.360–0.573)	0.441 (0.400–0.480)
	Top-10 recall	0.593 (0.561–0.625)	0.585 (0.554–0.615)	0.653 (0.547–0.760)	0.618 (0.580–0.656)
	Median Rank	6.0	6.0	4.0	5.3

Top-1, Top-3, Top-10 recall rates, and median ranks by PhenoBrain and comparisons with the best results from the 12 benchmark methods using the complete annotation of 9260 rare diseases on three rare disease datasets: the Public Test Set, PUMCH-L, and PUMCH-ADM. For PUMCH-ADM, HPO terms were manually extracted. It should be noted that different benchmark methods performed best on each individual dataset. Complete results can be found in Supplementary Tables 5, 12, 14 and 15.

Quick links

Search