Fig. 2: Predictive performance when using the past viral genomes as inputs.
From: Hidden challenges in evaluating spillover risk of zoonotic viruses using machine learning models

A Dataset division for five-fold stratified cross-validation. The fold datasets were prepared to maintain similar proportions of infectivity labels and viral genera as the overall data. B Differences in inputs and outputs of each model (see “Methods”). C Comparison of precision-recall area under the curve (PR-AUC) scores when inputting the past viral genomes. Box plots show the median (center line), interquartile range (box), and data range within 1.5 × interquartile range (IQR; whiskers). The median PR-AUC score is shown on the right side of the plot. Each dot corresponds to a viral family (n = 26). D Changes in the PR-AUC score of each model according to the length of input sequences. Each line corresponds to a viral family (n = 26). The 250 bp input is not available in the humanVirusFinder model, and its result is not included.