Abstract
Protein language models (PLMs) capture features of protein three-dimensional structure from amino acid sequences alone, without requiring multiple sequence alignments (MSA). The concepts of grammar and semantics from natural language have been suggested to have the potential to capture functional properties of proteins. Here, we investigate how these representations enable assessment of variation due to mutation. Applied to the SARS-CoV-2 spike protein via in silico deep mutational scanning (DMS), the PLM ESM-2 captures evolutionary constraints directly from sequence context, recapitulating what normally requires MSA data. Unlike other state-of-the-art methods which require protein structures or multiple sequences for training, we show what can be accomplished using an unmodified pretrained PLM. Applied to SARS-CoV-2 variants across the pandemic, we demonstrate that ESM-2 representations encode the evolutionary history between variants, as well as the distinct nature of variants of concern upon their emergence, associated with shifts in receptor binding and antigenicity. ESM-2 likelihoods can also identify epistatic interactions among sites in the protein. Our results here affirm that PLMs like ESM-2 are broadly useful for variant-effect prediction, including unobserved changes, and can be applied to understand novel viral pathogens with the potential to be applied to any protein sequence, pathogen or otherwise.
Similar content being viewed by others
Data availability
The SARS-CoV-2 reference sequence Wuhan-Hu-1 (GenBank accession NC_045512.2) was used for the in silico DMS. The SARS-CoV-2 sequences for each Pango lineage are from GISAID (https://doi.org/10.55876/gis8.240620pm). The SARS-CoV-2 haplotype spike sequences from Fig. 7 are also from GISAID (https://doi.org/10.55876/gis8.240621ma). Seven of the Sarbecovirus sequences are from GISAID (https://doi.org/10.55876/gis8.241002yd) and 58 are from GenBank (accession numbers can be found in the Supplementary Data 2). The EVEscape benchmarking data (Supplementary Fig. 10) were collected from the GitHub repository (commit 8238e4f, https://github.com/OATML-Markslab/EVEscape/blob/main/results/summaries_with_scores/full_spike_evescape.csv, https://github.com/OATML-Markslab/EVEscape/blob/main/results/summaries_with_gisaid/spike_dist_one_scores_gisaid.csv and https://github.com/OATML-Markslab/EVEscape/blob/main/data/gisaid/single_mutant_count_by_month.csv). Data from DCA and IND mutability scores were collected from the GitHub repository (commit aeffe23, https://github.com/GiancarloCroce/DCA_SARS-CoV-2/blob/main/data/data_dca_proteome.csv). Supplementary Fig. 10A uses variants mentioned in the spike_dist_one_scores_gisaid.csv file (https://github.com/OATML-Markslab/EVEscape/blob/main/results/summaries_with_gisaid/spike_dist_one_scores_gisaid.csv), but calculates the mutations from a representative set of sequences from GISAID (https://doi.org/10.55876/gis8.240620pm) so that all mutations (not just single nucleotide changes) could be identified. Supplementary Fig. 10B uses the data from single_mutant_count_by_month.csv (https://github.com/OATML-Markslab/EVEscape/blob/main/data/gisaid/single_mutant_count_by_month.csv), since this includes mutation frequency tracking for each month. This is restricted to just single nucleotide changes mutations. Epistasis benchmarking data from Innocenti et al. 30. was downloaded from their supplementary table S1 (https://link.springer.com/article/10.1186/s13059-024-03355-y#Sec19). Two PDB structures were used in the analysis, the 6VXX Spike structure42 (https://www.rcsb.org/structure/6VXX) and the Spike and ACE-2 simulated structure66 6vsb_1_1_1_6vw1 (https://charmm-gui.org/archive/covid19/6vsb_6vw1.pdb).
Code availability
The code for the analysis as well as data can be found on GitHub (https://github.com/kieran12lamb/PLM_SARS-CoV-2). Code used to produce figures and the post-processed data can be found on Observable (https://observablehq.com/@cvr-bioinfo/from-a-singlesequence-nature-communications).
References
Markov, P. V. et al. The evolution of SARS-CoV-2. Nat. Rev. Microbiol. 21, 361–379 (2023).
Harari, S. et al. Drivers of adaptive evolution during chronic SARS-CoV-2 infections. Nat. Med. 28, 1501–1508 (2022).
Elliott, P. et al. Exponential growth, high prevalence of SARS-CoV-2, and vaccine effectiveness associated with the Delta variant. Science 374, eabl9551 (2021).
Viana, R. et al. Rapid epidemic expansion of the SARS-CoV-2 Omicron variant in southern Africa. Nature 603, 679–686 (2022).
Lin, Z. et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379, 1123–1130 (2023).
Hie, B., Zhong, E. D., Berger, B. & Bryson, B. Learning the language of viral evolution and escape. Science 371, 284–288 (2021).
Rives, A. et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl. Acad. Sci. USA 118, e2016239118 (2021).
Tiberti, M. et al. MutateX: an automated pipeline for in silico saturation mutagenesis of protein structures and structural ensembles. Brief. Bioinform. 23, bbac074 (2022).
Dadonaite, B. et al. Spike deep mutational scanning helps predict success of SARS-CoV-2 clades. Nature 631, 617–626 (2024)
Lok, S.-M. An NTD supersite of attack. Cell Host Microbe 29, 744–746 (2021).
Marquet, C. et al. Embeddings from protein language models predict conservation and variant effects. Hum. Genet. 141, 1629–1647 (2022).
Cerutti, G. et al. Potent SARS-CoV-2 neutralizing antibodies directed against spike N-terminal domain target a single supersite. Cell Host Microbe 29, 819–833.e7 (2021).
Cui, Z. et al. Structural and functional characterizations of infectivity and immune evasion of SARS-CoV-2 Omicron. Cell 185, 860–871.e13 (2022).
Liu, Z. et al. Identification of SARS-CoV-2 spike mutations that attenuate monoclonal and serum antibody neutralization. Cell Host Microbe 29, 477–488.e4 (2021).
McCallum, M. et al. N-terminal domain antigenic mapping reveals a site of vulnerability for SARS-CoV-2. Cell 184, 2332–2347.e16 (2021).
Anishchenko, I., Ovchinnikov, S., Kamisetty, H. & Baker, D. Origins of coevolution between residues distant in protein 3D structures. Proc. Natl. Acad. Sci. USA 114, 9122–9127 (2017).
Harms, M. J. & Thornton, J. W. Evolutionary biochemistry: revealing the historical and physical causes of protein properties. Nat. Rev. Genet 14, 559–571 (2013).
Zhang, Z. et al. Protein language models learn evolutionary statistics of interacting sequence motifs. Proc. Natl. Acad. Sci. USA 121, e2406285121 (2024).
Zhang, J. et al. Structural impact on SARS-CoV-2 spike protein by D614G substitution. Science 372, 525–530 (2021).
Schröder, S. et al. Characterization of intrinsic and effective fitness changes caused by temporarily fixed mutations in the SARS-CoV-2 spike E484 epitope and identification of an epistatic precondition for the evolution of E484A in variant Omicron. Virol. J. 20, 257 (2023).
Peacock, T. P. et al. The altered entry pathway and antigenic distance of the SARS-CoV-2 Omicron variant map to separate domains of spike protein. Preprint at https://doi.org/10.1101/2021.12.31.474653 (2022).
Yang, K. et al. Structure-based design of a SARS-CoV-2 Omicron-specific inhibitor. Proc. Natl. Acad. Sci. USA 120, e2300360120 (2023).
Wang, P. et al. Antibody resistance of SARS-CoV-2 variants B.1.351 and B.1.1.7. Nature 593, 130–135 (2021).
Harvey, W. T. et al. SARS-CoV-2 variants, spike mutations and immune escape. Nat. Rev. Microbiol. 19, 409–424 (2021).
Korber, B. et al. Tracking changes in SARS-CoV-2 spike: evidence that D614G increases infectivity of the COVID-19 virus. Cell 182, 812–827.e19 (2020).
Hou, Y. J. et al. SARS-CoV-2 D614G variant exhibits efficient replication ex vivo and transmission in vivo. Science 370, 1464–1468 (2020).
McCallum, M. et al. Structural basis of SARS-CoV-2 Omicron immune evasion and receptor engagement. Science 375, 864–868 (2022).
Moulana, A. et al. Compensatory epistasis maintains ACE2 affinity in SARS-CoV-2 Omicron BA.1. Nat. Commun. 13, 7011 (2022).
Furnon, W. et al. Phenotypic evolution of SARS-CoV-2 spike during the COVID-19 pandemic. Nat. Microbiol. 10, 77–93 (2025).
Innocenti, G. et al. Real-time identification of epistatic interactions in SARS-CoV-2 from large genome collections. Genome Biol. 25, 228 (2024).
Watanabe, K. & Suzuki, Y. Protein thermostabilization by proline substitutions. J. Mol. Catal. B: Enzymatic 4, 167–180 (1998).
Choi, E. J. & Mayo, S. L. Generation and analysis of proline mutants in protein G. Protein Eng., Des. Selection 19, 285–289 (2006).
Wong, J. W. H., Ho, S. Y. W. & Hogg, P. J. Disulfide bond acquisition through eukaryotic protein evolution. Mol. Biol. Evol. 28, 327–334 (2011).
Andersen, K. G., Rambaut, A., Lipkin, W. I., Holmes, E. C. & Garry, R. F. The proximal origin of SARS-CoV-2. Nat. Med 26, 450–452 (2020).
Peacock, T. P. et al. The furin cleavage site in the SARS-CoV-2 spike protein is required for transmission in ferrets. Nat. Microbiol. 6, 899–909 (2021).
Thadani, N. N. et al. Learning from prepandemic data to forecast viral escape. Nature 622, 818–825 (2023).
Rodriguez-Rivas, J., Croce, G., Muscat, M. & Weigt, M. Epistatic models predict mutable sites in SARS-CoV-2 proteins and epitopes. Proc. Natl. Acad. Sci. USA 119, e2113118119 (2022).
Frazer, J. et al. Disease variant prediction with deep generative models of evolutionary data. Nature 599, 91–95 (2021).
Allman, B., Vieira, L., Diaz, D. J. & Wilke, C. O. A systematic evaluation of the language-of-viral-escape model using multiple machine learning frameworks. J. R. Soc. Interface 22, 20240598 (2025).
Yisimayi, A. et al. Repeated Omicron exposures override ancestral SARS-CoV-2 immune imprinting. Nature 625, 148–156 (2024).
Starr, T. N. et al. Shifting mutational constraints in the SARS-CoV-2 receptor-binding domain during viral evolution. Science 377, 420–424 (2022).
Walls, A. C. et al. Structure, function, and antigenicity of the SARS-CoV-2 spike glycoprotein. Cell 181, 281–292.e6 (2020).
Mizuguchi, K., Deane, C. M., Blundell, T. L., Johnson, M. S. & Overington, J. P. JOY: protein sequence-structure representation and analysis. Bioinformatics 14, 617–623 (1998).
Carabelli, A. M. et al. SARS-CoV-2 variant biology: immune escape, transmission and fitness. Nat. Rev. Microbiol. 21, 162–177 (2023).
Willett, B. J. et al. SARS-CoV-2 Omicron is an immune escape variant with an altered cell entry pathway. Nat. Microbiol. 7, 1161–1179 (2022).
Hie, B. L., Yang, K. K. & Kim, P. S. Evolutionary velocity with protein language models predicts evolutionary dynamics of diverse proteins. Cells Syst. 13, 274–285.e6 (2022).
Nemet, I. et al. Third BNT162b2 vaccination neutralization of SARS-CoV-2 omicron infection. N. Engl. J. Med. 386, 492–494 (2022).
Lauring, A. S. et al. Clinical severity of, and effectiveness of mRNA vaccines against, COVID-19 from omicron, delta, and alpha SARS-CoV-2 variants in the United States: prospective observational study. BMJ 376, e069761 (2022).
Yang, S. et al. Fast evolution of SARS-CoV-2 BA.2.86 to JN.1 under heavy immune pressure. Lancet Infect. Dis. 24, e70–e72 (2024).
Wei, J. et al. Risk of SARS-CoV-2 reinfection during multiple Omicron variant waves in the UK general population. Nat. Commun. 15, 1008 (2024).
Andrews, N. et al. Covid-19 Vaccine Effectiveness against the Omicron (B.1.1.529) Variant. N. Engl. J. Med. 386, 1532–1546 (2022).
Ito, J. et al. A protein language model for exploring viral fitness landscapes. Nat. Commun. 16, 4236 (2025)
Pan, Y.-F. et al. Predicting the evolutionary and functional landscapes of viruses with a unified nucleotide-protein language model: LucaVirus. Preprint at https://doi.org/10.1101/2025.06.14.659722 (2025).
He, Y. et al. Generalized biological foundation model with unified nucleic acid and protein language. Nat. Mach. Intell. 7, 942–953 (2025).
Livesey, B. J. & Marsh, J. A. Updated benchmarking of variant effect predictors using deep mutational scanning. Mol. Syst. Biol. 19, e11474 (2023).
Hie, B. L. et al. Efficient evolution of human antibodies from general protein language models. Nat. Biotechnol. 42, 275–283 (2024).
Madani, A. et al. Large language models generate functional protein sequences across diverse families. Nat. Biotechnol. 41, 1099–1106 (2023).
Tegally, H. et al. Emergence of SARS-CoV-2 Omicron lineages BA.4 and BA.5 in South Africa. Nat. Med. 28, 1785–1790 (2022).
Lytras, S. et al. Exploring the natural origins of SARS-CoV-2 in the light of recombination. Genome Biol. Evol. 14, evac018 (2022).
Martin, D. P. et al. Selection analysis identifies clusters of unusual mutational changes in omicron lineage BA.1 that likely impact spike function. Mol. Biol. Evol. 39, msac061 (2022).
Murrell, B. et al. Detecting individual sites subject to episodic diversifying selection. PLOS Genet. 8, e1002764 (2012).
Sweredoski, M. J. & Baldi, P. PEPITO: improved discontinuous B-cell epitope prediction using multiple distance thresholds and half sphere exposure. Bioinformatics 24, 1459–1460 (2008).
Overington, J., Donnelly, D., Johnson, M. S., Sali, A. & Blundell, T. L. Environment-specific amino acid substitution tables: tertiary templates and prediction of protein folds. Protein Sci. 1, 216–226 (1992).
Boratyn, G. M. et al. Domain enhanced lookup time accelerated BLAST. Biol. Direct 7, 12 (2012).
Schymkowitz, J. et al. The FoldX web server: an online force field. Nucleic Acids Res. 33, W382–W388 (2005).
Woo, H. et al. Developing a fully glycosylated full-length SARS-CoV-2 spike protein model in a viral membrane. J. Phys. Chem. B 124, 7128–7137 (2020).
Acknowledgements
We gratefully acknowledge all data contributors, i.e., the Authors and their Originating laboratories responsible for obtaining the specimens, and their Submitting laboratories for generating the genetic sequence and metadata and sharing via the GISAID Initiative, on which this research is based. The authors acknowledge funding from the UK Medical Research Council (MRC: MC_UU_12014/12 and MC_UU_00034/5 for D.L.R. and J.H.; MC_UU_00034/6 for D.L.R. and J.G.; MR/V01157X/1 and MR/Y002814/1 for D.L.R.) and a Doctoral Training Programme in Precision Medicine studentship (MR/N013166/1 for K.D.L.). D.L.R. and K.Y. acknowledge funding from the Wellcome Trust (220977/Z/20/Z). F.Y., D.L.R. and K.Y. acknowledge funding from the BBSRC (BB/V016067/1). D.L.R. also acknowledges support from the UK Research and Innovation (UKRI) to the G2P-UK consortium (MR/W005611/1) and G2P2 consortium (MR/Y004205), and the COVID-19 Genomics UK Consortium (COG-UK), which was supported by funding from the MRC, part of UKRI, the UK National Institute of Health and Care Research (MC_PC_19027) and Genome Research Limited, operating as the Wellcome Sanger Institute. K.Y. acknowledges support from Cancer Research UK (EDDPGM-Nov21\100001, DRCMDP-Nov23/100010 and core funding to the CRUK Scotland Institute (A31287)), Prostate Cancer UK (MA-TIA22-001) and EU Horizon 2020 (grant ID: 101016851). For the purpose of open access, the author has applied a Creative Commons Attribution (CC BY) licence to any Author Accepted Manuscript version arising.
Author information
Authors and Affiliations
Contributions
K.D.L. designed the experiments, collected datasets, wrote the code, contributed to the analysis of the experiments and prepared the manuscript. J.H., S.L., and J.C.H. contributed to the analysis of the experiments, collected datasets, and provided feedback on the experimental design. F.Y., O.K., S.C.L., and J.G. contributed to the analysis of the experiments. D.L.R. and K.Y. conceptualised the study, designed the experiments, edited the manuscript, and jointly supervised the research. All the authors discussed the results and commented on the manuscript.
Corresponding authors
Ethics declarations
Competing interests
Ke Yuan is a co-founder and shareholder of TileBio Ltd.
Peer review
Peer review information
Nature Communications thanks anonymous reviewers for their contribution to the peer review of this work. A peer review file is available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Lamb, K.D., Hughes, J., Lytras, S. et al. From single-sequences to evolutionary trajectories: protein language models capture the evolutionary potential of SARS-CoV-2. Nat Commun (2026). https://doi.org/10.1038/s41467-026-69569-9
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41467-026-69569-9


