Abstract
SARS-CoV-2 mutations accumulated during the COVID-19 pandemic, posing significant challenges for immune prevention. An optimistic perspective suggests that SARS-CoV-2 will become more tropic to humans with weaker virulence and stronger infectivity. However, tracing a quantified trajectory of this process remains difficult. Here we introduce an attentional recurrent network based on language embedding (ARNLE) framework to analyse the shift in SARS-CoV-2 host tropism towards humans. ARNLE incorporates a language model for self-supervised learning to capture the features of amino acid sequences, alongside a supervised bidirectional long-short-term-memory-based network to discern the relationship between mutations and host tropism among coronaviruses. We identified a shift in SARS-CoV-2 tropism from weak to strong, transitioning from an approximate Chiroptera coronavirus to a primate-tropic coronavirus. Delta variants were closer to other common primate coronaviruses than previous SARS-CoV-2 variants. A similar phenomenon was observed among the Omicron variants. We employed a Bayesian-based post hoc explanation method to analyse key mutations influencing the human tropism of SARS-CoV-2. ARNLE identified pivotal mutations in the spike proteins, including T478K, L452R, G142D and so on, as the top determinants of human tropism. Our findings suggest that language models like ARNLE will significantly facilitate the identification of potentially prevalent variants and provide important support for screening key mutations, aiding in timely update of vaccines to protect against future emerging SARS-CoV-2 variants.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$32.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 digital issues and online access to articles
$119.00 per year
only $9.92 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout





Similar content being viewed by others
Data availability
The trained, supervised bi-LSTM model and all sequence datasets (including the training, blind validation and test datasets) are available via GitHub at https://github.com/SN-1604/ARNLE and Zenodo at https://doi.org/10.5281/zenodo.13836159 (ref. 67). The trained ELMo model, the embedded training and blind validation data for the supervised bi-LSTM model and Supplementary Table 3 are available via Zenodo at https://doi.org/10.5281/zenodo.13826663 (ref. 68). The models can be implemented with code on GitHub. The embedded test data can be generated through code on GitHub. Source data are provided with this paper.
Code availability
All the codes in Python are available via GitHub at https://github.com/SN-1604/ARNLE and Zenodo at https://doi.org/10.5281/zenodo.13836159 (ref. 67). A tutorial on how to use ARNLE and reproduce the findings is available on GitHub.
References
Douam, F. et al. Genetic dissection of the host tropism of human-tropic pathogens. Ann. Rev. Genet. 49, 21–45 (2015).
Cui, J., Li, F. & Shi, Z. L. Origin and evolution of pathogenic coronaviruses. Nat. Rev. Microbiol. 17, 181–192 (2019).
Lim, Y. X. et al. Human coronaviruses: a review of virus-host interactions. Diseases 4, 26 (2016).
Choe, H. & Farzan, M. How SARS-CoV-2 first adapted in humans. Science 372, 466–467 (2021).
Zhou, P. et al. A pneumonia outbreak associated with a new coronavirus of probable bat origin. Nature 579, 270–273 (2020).
Grint, D. J. et al. Severity of severe acute respiratory system coronavirus 2 (SARS-CoV-2) alpha variant (B. 1.1. 7) in England. Clin. Infect. Dis. 75, e1120–e1127 (2022).
Liu, C. et al. The antibody response to SARS-CoV-2 Beta underscores the antigenic distance to other variants. Cell Host Microbe 30, 53–68.e12 (2022).
Liu, Y. & Rocklov, J. The reproductive number of the Delta variant of SARS-CoV-2 is far higher compared to the ancestral SARS-CoV-2 virus. J. Travel Med. 28, taab124 (2021).
Karim, S. & Karim, Q. A. Omicron SARS-CoV-2 variant: a new chapter in the COVID-19 pandemic. Lancet 398, 2126–2128 (2021).
Farinholt, T. et al. Transmission event of SARS-CoV-2 delta variant reveals multiple vaccine breakthrough infections. BMC Med. 19, 255 (2021).
Liu, L. et al. Striking antibody evasion manifested by the Omicron variant of SARS-CoV-2. Nature 602, 676–681 (2021).
Rossler, A. et al. SARS-CoV-2 Omicron variant neutralization in serum from vaccinated and convalescent persons. N. Engl. J. Med. 386, 698–700 (2022).
Ma, C. et al. Broad host tropism of ACE2-using MERS-related coronaviruses and determinants restricting viral recognition. Cell Discov. 9, 57 (2023).
Rambaut, A. et al. Addendum: a dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology. Nat. Microbiol. 6, 415 (2021).
Ozono, S. et al. SARS-CoV-2 D614G spike mutation increases entry efficiency with enhanced ACE2-binding affinity. Nat. Commun. 12, 848 (2021).
Cele, S. et al. Escape of SARS-CoV-2 501Y.V2 from neutralization by convalescent plasma. Nature 593, 142–146 (2021).
Li, J. et al. Machine learning methods for predicting human-adaptive influenza A viruses based on viral nucleotide compositions. Mol. Biol. Evol. 37, 1224–1236 (2020).
Babayan, S. A., Orton, R. J. & Streicker, D. G. Predicting reservoir hosts and arthropod vectors from evolutionary signatures in RNA virus genomes. Science 362, 577–580 (2018).
Li, J. et al. Genomic representation predicts an asymptotic host adaptation of bat coronaviruses using deep learning. Front. Microbiol. 14, 1157608 (2023).
Xia, X. Extreme genomic CpG deficiency in SARS-CoV-2 and evasion of host antiviral defense. Mol. Biol. Evol. 37, 2699–2705 (2020).
Pollock, D. D. et al. Viral CpG deficiency provides no evidence that dogs were intermediate hosts for SARS-CoV-2. Mol. Biol. Evol. 37, 2706–2710 (2020).
Li, J. et al. Deep learning based on biologically interpretable genome representation predicts two types of human adaptation of SARS-CoV-2 variants. Brief. Bioinform. 23, bbac036 (2022).
Asgari, E. & Mofrad, M. R. Continuous distributed representation of biological sequences for deep proteomics and genomics. PLoS ONE 10, e0141287 (2015).
Rao, R. et al. Evaluating protein transfer learning with TAPE. Adv. Neural Inf. Process. Syst. 32, 9689–9701 (2019).
Baek, M. et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science 373, 871–876 (2021).
Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).
Alley, E. C. et al. Unified rational protein engineering with sequence-based deep representation learning. Nat. Methods 16, 1315–1322 (2019).
Hie, B. et al. Learning the language of viral evolution and escape. Science 371, 284–288 (2021).
Roy, A. et al. Base composition and host adaptation of the SARS-CoV-2: insight from the codon usage perspective. Front. Microbiol. 12, 548275 (2021).
Kawashima, I. Y. et al. SARS-CoV-2 host prediction based on virus-host genetic features. Sci. Rep. 12, 4576 (2022).
Obermeyer, F. et al. Analysis of 6.4 million SARS-CoV-2 genomes identifies mutations associated with fitness. Science 376, 1327–1332 (2022).
JD, B. & RA, N. Fitness effects of mutations to SARS-CoV-2 proteins. Virus Evol. 9, vead055 (2023).
Li, L. et al. Machine learning detection of SARS-CoV-2 high-risk variants. Preprint at bioRxiv https://doi.org/10.1101/2023.04.19.537460 (2023).
Q, S. et al. VarEPS: an evaluation and prewarning system of known and virtual variations of SARS-CoV-2 genomes. Nucleic Acids Res. 50, D888–D897 (2022).
Marquet, C. et al. Embeddings from protein language models predict conservation and variant effects. Hum. Genet. 141, 1629–1647 (2021).
Peters, M. et al. Deep contextualized word representations. In Proc. Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) (eds Walker, M. et al.) 2227–2237 (ACL, 2018).
Devlin, J. et al. BERT: pre-training of deep bidirectional transformers for language understanding. Preprint at https://arxiv.org/abs/1810.04805 (2018).
Radford, A. et al. Improving language understanding by generative pre-training. Preprint at https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf (2018).
Theodoris, C. V. et al. Transfer learning enables predictions in network biology. Nature 618, 616–624 (2023).
Zvyagin, M. et al. GenSLMs: genome-scale language models reveal SARS-CoV-2 evolutionary dynamics. Int. J. High Perf. Comput. Appl. 37, 683–705 (2023).
Dao, T. et al. FlashAttention: fast and memory-efficient exact attention with IO-awareness. Adv. Neural Inf. Process. Syst. 35, 16344–16359 (2022).
Dai, Z. et al. Transformer-XL: attentive language models beyond a fixed-length context. Preprint at https://arxiv.org/abs/1901.02860 (2019).
Li, Y. et al. Localvit: bringing locality to vision transformers. Preprint at https://arxiv.org/abs/2104.05707 (2021).
Lin, Z. et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379, 1123–1130 (2023).
Yang, M. et al. Integrating convolution and self-attention improves language model of human genome for interpreting non-coding regions at base-resolution. Nucleic Acids Res. 50, e81 (2022).
Chen, Z. et al. Integration of a deep learning classifier with a random forest approach for predicting malonylation sites. Genom. Proteom. Bioinform. 16, 451–459 (2018).
Fu, L., Peng, Q. & Chai, L. Predicting DNA methylation states with hybrid information based deep-learning model. IEEE/ACM Trans. Comput. Biol. Bioinform. 17, 1721–1728 (2020).
Pan, X. et al. Prediction of RNA-protein sequence and structure binding preferences using deep convolutional and recurrent neural networks. BMC Genomics 19, 511 (2018).
Lu, M. et al. Deep attention network for egocentric action recognition. IEEE Trans. Image Process. 28, 3703–3713 (2019).
De, S. et al. Griffin: mixing gated linear recurrences with local attention for efficient language models. Preprint at https://arxiv.org/abs/2402.19427 (2024).
Peng, B. et al. Rwkv: reinventing rnns for the transformer era. Preprint at https://arxiv.org/abs/2305.13048 (2023).
Gu, A. & Dao, T. Mamba: linear-time sequence modeling with selective state spaces. Preprint at https://arxiv.org/abs/2312.00752 (2023).
Hammelman, J. & Gifford, D. K. Discovering differential genome sequence activity with interpretable and efficient deep learning. PLoS Comput. Biol. 17, e1009282 (2021).
Y, W. et al. Structural basis for SARS-CoV-2 Delta variant recognition of ACE2 receptor and broadly neutralizing antibodies. Nat. Commun. 13, 871 (2022).
Bugatti, A. et al. The D405N mutation in the spike protein of SARS-CoV-2 Omicron BA.5 inhibits spike/integrins interaction and viral infection of human lung microvascular endothelial cells. Viruses 15, 332 (2023).
He, P. et al. SARS-CoV-2 Delta and Omicron variants evade population antibody response by mutations in a single spike epitope. Nature Microbiol. 7, 1635–1649 (2022).
Supasa, P. et al. Reduced neutralization of SARS-CoV-2 B.1.1.7 variant by convalescent and vaccine sera. Cell 184, 2201–2211.e7 (2021).
Lista, M. J. et al. The P681H mutation in the spike glycoprotein of the Alpha variant of SARS-CoV-2 escapes IFITM restriction and is necessary for Type I interferon resistance. J. Virol. 96, e0125022 (2022).
Liu, Y. et al. Delta spike P681R mutation enhances SARS-CoV-2 fitness over Alpha variant. Cell Rep. 39, 110829 (2022).
Mishra, T. et al. SARS-CoV-2 spike E156G/Delta157-158 mutations contribute to increased infectivity and immune escape. Life Sci. Alliance 5, e202201415 (2022).
Selvavinayagam, S. T. et al. Low SARS-CoV-2 viral load among vaccinated individuals infected with Delta B. 1.617. 2 and Omicron BA. 1.1. 529 but not with Omicron BA. 1.1 and BA. 2 variants. Front. Public Health 10, 1018399 (2022).
He, X. et al. Research progress in spike mutations of SARS‐CoV‐2 variants and vaccine development. Med. Res. Rev. 43, 932–971 (2023).
Ai, J. et al. Antibody evasion of SARS-CoV-2 Omicron BA. 1, BA. 1.1, BA. 2, and BA. 3 sub-lineages. Cell Host Microbe 30, 1077–1083.e4 (2022).
Yunlong, C. et al. Characterization of the enhanced infectivity and antibody evasion of Omicron BA.2.75. Cell Host Microbe 30, 1527–1539.e5 (2022).
Marcinkiewicz, A. L. et al. Structural evolution of an immune evasion determinant shapes pathogen host tropism. Proc. Natl Acad. Sci. 120, e2301549120 (2023).
Yunlong, C. et al. Imprinted SARS-CoV-2 humoral immunity induces convergent Omicron RBD evolution. Nature 614, 521–529 (2023).
Liu, Y. SN-1604/ARNLE: attentional recurrent network based on language embedding (ARNLE). Zenodo https://doi.org/10.5281/zenodo.13836159 (2024).
Liu, Y. An attentional recurrent network based on language embedding framework for human tropism prediction reveals trends in the prevalence of SARS-CoV-2 variants. Zenodo https://doi.org/10.5281/zenodo.13826663 (2024).
Acknowledgements
This study was supported by grants from the National Key Research and Development Program (grant no. 2023YFC2604400, Peng Li), the Chinese Academy of Medical Sciences Innovation Fund for Medical Sciences (grant no. CIFMS2022-I2M-1-011, J.T.Y.), the National Science and Technology Major Project of China (grant no. 2018ZX10201001, Peng Li) and the Biomedical High Performance Computing Platform, Chinese Academy of Medical Sciences.
Author information
Authors and Affiliations
Contributions
H.S., Peng Li, J.Y. and A.W. conceived the study. Y.L., K.W., Jinhui Li and Jiangfeng Liu managed the data. Y.L. and Jing Li built the model. Y.L., Peihan Li, Y.Y. and L.Y. performed the data analysis. Peng Li, L.J. and H.S. supervised the project. Y.L. and Jing Li wrote the manuscript. H.S., Peng Li, J.Y. and A.W. revised the paper.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Machine Intelligence thanks Jianyang Zeng, Wei Zheng and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Supplementary Information
Methods used for Supplementary figures and Supplementary Figs. 1–10 and Tables 1–7.
Supplementary Tables
Supplementary Tables 1–7.
Supplementary Data 1
Statistical source data for Supplementary Fig. 1.
Supplementary Data 2
Statistical source data for Supplementary Fig. 2.
Supplementary Data 3
Statistical source data for Supplementary Fig. 3.
Supplementary Data 4
Statistical source data for Supplementary Fig. 4.
Supplementary Data 5
Statistical source data for Supplementary Fig. 5.
Supplementary Data 6
Statistical source data for Supplementary Fig. 6.
Supplementary Data 7
Statistical source data for Supplementary Fig. 7.
Supplementary Data 8
Statistical source data for Supplementary Fig. 8.
Supplementary Data 9
Statistical source data for Supplementary Fig .9.
Supplementary Data 10
Statistical source data for Supplementary Fig. 10.
Source data
Source Data Fig. 2
Statistical source data.
Source Data Fig. 3
Statistical source data.
Source Data Fig. 4
Statistical source data.
Source Data Fig. 5
Statistical source data.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Liu, Y., Li, J., Li, P. et al. ARNLE model identifies prevalence potential of SARS-CoV-2 variants. Nat Mach Intell 7, 18–28 (2025). https://doi.org/10.1038/s42256-024-00919-2
Received:
Accepted:
Published:
Issue date:
DOI: https://doi.org/10.1038/s42256-024-00919-2