Fig. 3: Prediction results for the top three models. | Communications Biology

Fig. 3: Prediction results for the top three models.

From: Statistical modeling of SARS-CoV-2 substitution processes: predicting the next variant

Fig. 3

We use the top three Poisson and Negative Binomial models from Fig. 1 for prediction on the test dataset. Results for the entire genome are in the first three rows, for the spike protein only in the last three. Results are shown separately for predicting non-synonymous amino acid substitutions (left half) and predicting synonymous substitutions (right half, these results are not discussed in the text). The first column in each table quarter shows the area under the ROC curve (AUC) for the corresponding prediction task and modeling approach. We highlighted the top-scoring model for every (substitution type, locus, approach) combination. Overall, we obtained high AUC scores, showing that the models successfully predicted many of the substitutions. Each quarter’s second and third columns are 3% lift scores of each model versus the random and more elaborate base models (see text and Methods). The top models significantly outperform both baselines, stressing our approach’s benefits over more naive statistical predictions. The model we analyzed further in the text (third Poisson model for non-synonymous amino acid substitutions) is also red-framed.

Back to article page