Figure 3

Effect of data preprocessing and imputation method on random forest performance. Similar performances are obtained for each combination, with multivariate imputation methods (k nearest neighbours (kNN) and missForest (mF)) performing slightly worse than univariate imputation methods (mean and median value). Depicted performances were obtained with random forest consisting of 100 trees, while running 10 repetitions and applying a 5-fold cross-validation. Selected data sets originally consisted of 18% missing data and underwent either no (‘None’) preprocessing or outlier removal (‘Outliers’) prior to model development.