Fig. 3: Prediction performance of CoVFit for unknown, future variants. | Nature Communications

Fig. 3: Prediction performance of CoVFit for unknown, future variants.

From: A protein language model for exploring viral fitness landscapes

Fig. 3

a Strategy for evaluating prediction performance on future variants. Model instances, referred to as CoVFitPast, were trained on variant data prior to a specified cutoff date (e.g., January 31, 2022). Prediction performance for future variants was then assessed using data from variants that emerged after this date. b Number of sequences from each clade in the past datasets with specific cutoff dates. c Fitness predictions for future (gray) and past (light gray) variants in the dataset with a cutoff date of February 28, 2022. Points represent results for each genotype, calculated as average values across countries and five-fold predictions. A dashed line with a slope of 1 and an intercept of 0 is included. d Fitness predictions for future variants, with colors indicating Nextclade clade classifications. In addition to the dashed line with a slope of 1 and intercept 0, a gray estimated regression line, based on mean prediction values, is displayed. e Scatter plot based on (d) but colored according to the minimum amino acid distance from variants in the past data. f Predicted fitness of genotypes within each Nextclade clade. Each clade’s distribution (violin) and median value (dot) are shown. Individual panels display results for datasets with different cutoff dates. Clades present in the past data are separated by a dashed vertical line from those absent in the past data. Additionally, the median observed fitness value of each clade is represented by a heatmap on the left side. g Comparison of prediction performance metrics across methods, including Spearman’s correlation score, R-squared value, mean absolute error (MAE), and estimated regression slope. Source data are provided as a Source Data file.

Back to article page