Figure 3 | Scientific Reports

Figure 3

From: An integrated pipeline for prediction of Clostridioides difficile infection

Figure 3

Comparing the upsampling strategies in the MyCode testing dataset using F1 score as the metric for performance. The SMOTE function oversampled the minority class (a rare event) using bootstrapping (perc.over = 100) and k-nearest neighbor (k = 5) to synthetically create additional observations of that event and undersampling the majority class (perc.under = 200). For each case in the original dataset belonging to the minority class, perc.over/100 new examples of that class will be created. The ROSE function oversampled the minority class without undersampling the majority class. Here we make the case:control ratio in the training dataset equaled to 1:1 for both oversampling strategies. F1 score, the weighted average of Precision and Recall, was selected to determine the performance of oversampling. Summary of the sample sizes for training with or without upsampling (SMOTE or ROSE) and testing dataset stratified by genetic data availability. ROSE upsampling cases (n = 782) to 9931 so that case:control ratio is 1:1 with controls (n = 9931) for the training dataset. SMOTE upsampling cases (n = 782) to 1564 so that case:control ratio is 1:1 with controls (n = 1564) for the training dataset.

Back to article page