Fig. 2

Performances of trained random forest models to predict different hosts based on different datasets. (A) The balanced accuracy of models trained with DRSCU with different train-test-split ratios, which are better than blind guessing (0.5 accuracy) even with extremely low train data ratio of 0.05. (B) The model performances (Balanced accuracy and F1 score) and ROC curve of models trained with different datasets: DR (DRSCU), DRT (DRSCU−Taxonomy), DRTC (DRSCU−Taxonomy−CDS Length). The ROC-AUC scores are shown.