Fig. 2: An illustration of the training and testing dataset for prediction. | Communications Biology

Fig. 2: An illustration of the training and testing dataset for prediction.

From: Statistical modeling of SARS-CoV-2 substitution processes: predicting the next variant

Fig. 2

Our training data consists of a phylogenetic tree reconstruction based on sequences released before February 8th, 2021 (green dots). The test data is comprised of sequences that were released between February 10th and April 10th, 2021 (gray dots). For these, we did not infer a phylogeny or rely on any other phylogenetic information. To evaluate our ability to predict new substitutions, we considered only sites for which no substitutions had occurred in the training data. The table in the figure shows examples of which substitutions are included in the test dataset. For sites 1, 5, and 6, the base is not constant for the training data set, and therefore it is not included in the test dataset. In sites 4 and 9, there is only one sequence in the test set that shows a different base from the training sequences; these sites have not been included in the test set to avoid sequencing errors. For sites 2 and 7, the base is constant for both the training and the test dataset making them negative examples in the test dataset, whereas sites 3 and 8 are positive examples, where a confirmed substitution occurred in the test period.

Back to article page