Figure 2: Significance of predictions.
From: Prediction of human population responses to toxic compounds by a collaborative competition

(a–d) Submissions are compared with the null hypothesis for subchallenge 1 (a,b) and subchallenge 2 (c,d). For each metric used for scoring (Pearson correlation (a) and pCi (b) for subchallenge 1, and Pearson correlation (c) and Spearman correlation (d) for subchallenge 2), performances shown for submissions are computed compound by compound and then averaged across compounds. The null hypothesis is generated for random predictions computed by random sampling, compound by compound, from the training set. (e,f) Performance of individual predictions (first boxplot, in red) is compared with performances of randomly aggregated predictions (wisdom of the crowds, in green) and with the aggregation of all predictions (last black bar). Performances are shown in terms of average Pearson correlation computed between predicted and measured values separately for each compound. Predictions were aggregated by averaging them. To aggregate only independent predictions, only one submission for each team was considered as the average of all predictions submitted by the team.