Fig. 2

Schematic overview of the data handling procedure. About 10% of the data were held out for testing (\({\cal D}^{{\mathrm{TEST}}}\)) and were unseen during the calibration and training phases. The remaining 90% of the data were used as calibration and training (\({\cal D}^{{\mathrm{TRAIN}}}\)). For DW, SA, and MLR, in the calibration step, hyper-parameter λ was tuned using a 100-fold Monte Carlo cross-validation. For XGB, several other hyper-parameters were tuned (see Methods for a list). During the training step, a procedure similar to bagging (bootstrap aggregating) was used by randomly sampling a 2% and 10% replacement of the data 50 times to give 50 training instances. In the testing step, the area under the precision-recall curve (AUPRC) of the best performing weights for each of the 50 training instances was evaluated on \({\cal D}^{{\mathrm{TEST}}}\), which was unseen during training and calibration, to evaluate generalization of classification performance. The calibration, training, and testing procedure was identical for ranking tasks, with the exception that Kendall’s τ was used as the metric of performance instead of the AUPRC