Fig. 2: Calibration of output probabilities in implemented ML models.

Error bands show 95% confidence intervals for estimates of the mean obtained using bootstrapping (n = 10). We show calibration curves for models trained on the development set: a decision tree, b logistic regression, c random forest, and d eXtreme Gradient Boosting (XGBoost). A perfectly calibrated model would have a 1:1 relationship between the fraction of positive labels and mean probabilities (i.e., it would overlay the diagonal line). The Durbin-Watson statistic, DW, probes for correlations in the residuals; if DW is close to 2, then one can rule out correlations in the residuals, implying good linear behavior.