Fig. 2: Model evaluation and validation of positive-unlabeled random forest models.

a Score distributions of unlabeled proteins predicted by the tree-filtered model. The putative positive (red) and negative (blue) distribution groups were calculated by fitting a two-component Gaussian mixture model. A receiver operating characteristic curve (ROC) was calculated based on the putative distributions, and the area under the receiver operating characteristic curve (AUROC) was 0.99. b Evaluation of known antigen scores predicted by the tree-filtered model. Points represent known antigens. The x-axis shows the scaled ranks of the 52 known antigens. The y-axis notes percentile ranks (PR) of known antigens in the set containing all P. falciparum proteins. The dashed line indicates the 50th percentile rank. Gradient colors show probability scores. The area under the ranking curve was 0.90. c Distribution of mean differences in scores after known antigen label removal for the final tree-filtered ensemble. Dots represent the 48 validation iterations. The box plot shows median with first and third quartiles. The lower and upper whiskers indicate 1.5× interquartile range from the first and third quantiles, respectively. The gray dashed line conveys a zero-mean difference in scores. d Plot of overlapping antigens across the top 200 candidate sets generated from the validation models. The x-axis shows the number of validation models in reverse order, and the y-axis indicates the number of candidate antigens in agreement with the corresponding number of models. Line colors show data from non-tree-filtered (yellow) and tree-filtered (red) validation models, respectively.