Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

ARNLE model identifies prevalence potential of SARS-CoV-2 variants

Abstract

SARS-CoV-2 mutations accumulated during the COVID-19 pandemic, posing significant challenges for immune prevention. An optimistic perspective suggests that SARS-CoV-2 will become more tropic to humans with weaker virulence and stronger infectivity. However, tracing a quantified trajectory of this process remains difficult. Here we introduce an attentional recurrent network based on language embedding (ARNLE) framework to analyse the shift in SARS-CoV-2 host tropism towards humans. ARNLE incorporates a language model for self-supervised learning to capture the features of amino acid sequences, alongside a supervised bidirectional long-short-term-memory-based network to discern the relationship between mutations and host tropism among coronaviruses. We identified a shift in SARS-CoV-2 tropism from weak to strong, transitioning from an approximate Chiroptera coronavirus to a primate-tropic coronavirus. Delta variants were closer to other common primate coronaviruses than previous SARS-CoV-2 variants. A similar phenomenon was observed among the Omicron variants. We employed a Bayesian-based post hoc explanation method to analyse key mutations influencing the human tropism of SARS-CoV-2. ARNLE identified pivotal mutations in the spike proteins, including T478K, L452R, G142D and so on, as the top determinants of human tropism. Our findings suggest that language models like ARNLE will significantly facilitate the identification of potentially prevalent variants and provide important support for screening key mutations, aiding in timely update of vaccines to protect against future emerging SARS-CoV-2 variants.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: ARNLE process and structure.
Fig. 2: Human tropism prediction of SARS-CoV-2 by the ARNLE classifier.
Fig. 3: Human tropism shift of SARS-CoV-2 variants associated with COVID-19 pandemic severity.
Fig. 4: Compositional and site-tropism shift indices of SARS-CoV-2 variants based on important AAs.
Fig. 5: Interpretability of features for SARS-CoV-2 tropism classification.

Similar content being viewed by others

Data availability

The trained, supervised bi-LSTM model and all sequence datasets (including the training, blind validation and test datasets) are available via GitHub at https://github.com/SN-1604/ARNLE and Zenodo at https://doi.org/10.5281/zenodo.13836159 (ref. 67). The trained ELMo model, the embedded training and blind validation data for the supervised bi-LSTM model and Supplementary Table 3 are available via Zenodo at https://doi.org/10.5281/zenodo.13826663 (ref. 68). The models can be implemented with code on GitHub. The embedded test data can be generated through code on GitHub. Source data are provided with this paper.

Code availability

All the codes in Python are available via GitHub at https://github.com/SN-1604/ARNLE and Zenodo at https://doi.org/10.5281/zenodo.13836159 (ref. 67). A tutorial on how to use ARNLE and reproduce the findings is available on GitHub.

References

  1. Douam, F. et al. Genetic dissection of the host tropism of human-tropic pathogens. Ann. Rev. Genet. 49, 21–45 (2015).

    Article  Google Scholar 

  2. Cui, J., Li, F. & Shi, Z. L. Origin and evolution of pathogenic coronaviruses. Nat. Rev. Microbiol. 17, 181–192 (2019).

    Article  MATH  Google Scholar 

  3. Lim, Y. X. et al. Human coronaviruses: a review of virus-host interactions. Diseases 4, 26 (2016).

  4. Choe, H. & Farzan, M. How SARS-CoV-2 first adapted in humans. Science 372, 466–467 (2021).

    Article  MATH  Google Scholar 

  5. Zhou, P. et al. A pneumonia outbreak associated with a new coronavirus of probable bat origin. Nature 579, 270–273 (2020).

    Article  MATH  Google Scholar 

  6. Grint, D. J. et al. Severity of severe acute respiratory system coronavirus 2 (SARS-CoV-2) alpha variant (B. 1.1. 7) in England. Clin. Infect. Dis. 75, e1120–e1127 (2022).

    Article  Google Scholar 

  7. Liu, C. et al. The antibody response to SARS-CoV-2 Beta underscores the antigenic distance to other variants. Cell Host Microbe 30, 53–68.e12 (2022).

    Article  MATH  Google Scholar 

  8. Liu, Y. & Rocklov, J. The reproductive number of the Delta variant of SARS-CoV-2 is far higher compared to the ancestral SARS-CoV-2 virus. J. Travel Med. 28, taab124 (2021).

  9. Karim, S. & Karim, Q. A. Omicron SARS-CoV-2 variant: a new chapter in the COVID-19 pandemic. Lancet 398, 2126–2128 (2021).

    Article  MATH  Google Scholar 

  10. Farinholt, T. et al. Transmission event of SARS-CoV-2 delta variant reveals multiple vaccine breakthrough infections. BMC Med. 19, 255 (2021).

    Article  MATH  Google Scholar 

  11. Liu, L. et al. Striking antibody evasion manifested by the Omicron variant of SARS-CoV-2. Nature 602, 676–681 (2021).

    Article  MATH  Google Scholar 

  12. Rossler, A. et al. SARS-CoV-2 Omicron variant neutralization in serum from vaccinated and convalescent persons. N. Engl. J. Med. 386, 698–700 (2022).

  13. Ma, C. et al. Broad host tropism of ACE2-using MERS-related coronaviruses and determinants restricting viral recognition. Cell Discov. 9, 57 (2023).

    Article  MATH  Google Scholar 

  14. Rambaut, A. et al. Addendum: a dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology. Nat. Microbiol. 6, 415 (2021).

    Article  MATH  Google Scholar 

  15. Ozono, S. et al. SARS-CoV-2 D614G spike mutation increases entry efficiency with enhanced ACE2-binding affinity. Nat. Commun. 12, 848 (2021).

    Article  Google Scholar 

  16. Cele, S. et al. Escape of SARS-CoV-2 501Y.V2 from neutralization by convalescent plasma. Nature 593, 142–146 (2021).

    Article  MATH  Google Scholar 

  17. Li, J. et al. Machine learning methods for predicting human-adaptive influenza A viruses based on viral nucleotide compositions. Mol. Biol. Evol. 37, 1224–1236 (2020).

    Article  MATH  Google Scholar 

  18. Babayan, S. A., Orton, R. J. & Streicker, D. G. Predicting reservoir hosts and arthropod vectors from evolutionary signatures in RNA virus genomes. Science 362, 577–580 (2018).

    Article  Google Scholar 

  19. Li, J. et al. Genomic representation predicts an asymptotic host adaptation of bat coronaviruses using deep learning. Front. Microbiol. 14, 1157608 (2023).

    Article  Google Scholar 

  20. Xia, X. Extreme genomic CpG deficiency in SARS-CoV-2 and evasion of host antiviral defense. Mol. Biol. Evol. 37, 2699–2705 (2020).

  21. Pollock, D. D. et al. Viral CpG deficiency provides no evidence that dogs were intermediate hosts for SARS-CoV-2. Mol. Biol. Evol. 37, 2706–2710 (2020).

    Article  MATH  Google Scholar 

  22. Li, J. et al. Deep learning based on biologically interpretable genome representation predicts two types of human adaptation of SARS-CoV-2 variants. Brief. Bioinform. 23, bbac036 (2022).

  23. Asgari, E. & Mofrad, M. R. Continuous distributed representation of biological sequences for deep proteomics and genomics. PLoS ONE 10, e0141287 (2015).

    Article  MATH  Google Scholar 

  24. Rao, R. et al. Evaluating protein transfer learning with TAPE. Adv. Neural Inf. Process. Syst. 32, 9689–9701 (2019).

    MATH  Google Scholar 

  25. Baek, M. et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science 373, 871–876 (2021).

    Article  MATH  Google Scholar 

  26. Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).

    Article  MATH  Google Scholar 

  27. Alley, E. C. et al. Unified rational protein engineering with sequence-based deep representation learning. Nat. Methods 16, 1315–1322 (2019).

    Article  MATH  Google Scholar 

  28. Hie, B. et al. Learning the language of viral evolution and escape. Science 371, 284–288 (2021).

    Article  MathSciNet  MATH  Google Scholar 

  29. Roy, A. et al. Base composition and host adaptation of the SARS-CoV-2: insight from the codon usage perspective. Front. Microbiol. 12, 548275 (2021).

  30. Kawashima, I. Y. et al. SARS-CoV-2 host prediction based on virus-host genetic features. Sci. Rep. 12, 4576 (2022).

    Article  MATH  Google Scholar 

  31. Obermeyer, F. et al. Analysis of 6.4 million SARS-CoV-2 genomes identifies mutations associated with fitness. Science 376, 1327–1332 (2022).

    Article  MATH  Google Scholar 

  32. JD, B. & RA, N. Fitness effects of mutations to SARS-CoV-2 proteins. Virus Evol. 9, vead055 (2023).

    MATH  Google Scholar 

  33. Li, L. et al. Machine learning detection of SARS-CoV-2 high-risk variants. Preprint at bioRxiv https://doi.org/10.1101/2023.04.19.537460 (2023).

  34. Q, S. et al. VarEPS: an evaluation and prewarning system of known and virtual variations of SARS-CoV-2 genomes. Nucleic Acids Res. 50, D888–D897 (2022).

    Article  Google Scholar 

  35. Marquet, C. et al. Embeddings from protein language models predict conservation and variant effects. Hum. Genet. 141, 1629–1647 (2021).

  36. Peters, M. et al. Deep contextualized word representations. In Proc. Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) (eds Walker, M. et al.) 2227–2237 (ACL, 2018).

  37. Devlin, J. et al. BERT: pre-training of deep bidirectional transformers for language understanding. Preprint at https://arxiv.org/abs/1810.04805 (2018).

  38. Radford, A. et al. Improving language understanding by generative pre-training. Preprint at https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf (2018).

  39. Theodoris, C. V. et al. Transfer learning enables predictions in network biology. Nature 618, 616–624 (2023).

    Article  MATH  Google Scholar 

  40. Zvyagin, M. et al. GenSLMs: genome-scale language models reveal SARS-CoV-2 evolutionary dynamics. Int. J. High Perf. Comput. Appl. 37, 683–705 (2023).

    Article  MATH  Google Scholar 

  41. Dao, T. et al. FlashAttention: fast and memory-efficient exact attention with IO-awareness. Adv. Neural Inf. Process. Syst. 35, 16344–16359 (2022).

    MATH  Google Scholar 

  42. Dai, Z. et al. Transformer-XL: attentive language models beyond a fixed-length context. Preprint at https://arxiv.org/abs/1901.02860 (2019).

  43. Li, Y. et al. Localvit: bringing locality to vision transformers. Preprint at https://arxiv.org/abs/2104.05707 (2021).

  44. Lin, Z. et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379, 1123–1130 (2023).

    Article  MathSciNet  MATH  Google Scholar 

  45. Yang, M. et al. Integrating convolution and self-attention improves language model of human genome for interpreting non-coding regions at base-resolution. Nucleic Acids Res. 50, e81 (2022).

    Article  MATH  Google Scholar 

  46. Chen, Z. et al. Integration of a deep learning classifier with a random forest approach for predicting malonylation sites. Genom. Proteom. Bioinform. 16, 451–459 (2018).

    Article  MATH  Google Scholar 

  47. Fu, L., Peng, Q. & Chai, L. Predicting DNA methylation states with hybrid information based deep-learning model. IEEE/ACM Trans. Comput. Biol. Bioinform. 17, 1721–1728 (2020).

    Article  MATH  Google Scholar 

  48. Pan, X. et al. Prediction of RNA-protein sequence and structure binding preferences using deep convolutional and recurrent neural networks. BMC Genomics 19, 511 (2018).

    Article  MATH  Google Scholar 

  49. Lu, M. et al. Deep attention network for egocentric action recognition. IEEE Trans. Image Process. 28, 3703–3713 (2019).

    Article  MathSciNet  MATH  Google Scholar 

  50. De, S. et al. Griffin: mixing gated linear recurrences with local attention for efficient language models. Preprint at https://arxiv.org/abs/2402.19427 (2024).

  51. Peng, B. et al. Rwkv: reinventing rnns for the transformer era. Preprint at https://arxiv.org/abs/2305.13048 (2023).

  52. Gu, A. & Dao, T. Mamba: linear-time sequence modeling with selective state spaces. Preprint at https://arxiv.org/abs/2312.00752 (2023).

  53. Hammelman, J. & Gifford, D. K. Discovering differential genome sequence activity with interpretable and efficient deep learning. PLoS Comput. Biol. 17, e1009282 (2021).

    Article  MATH  Google Scholar 

  54. Y, W. et al. Structural basis for SARS-CoV-2 Delta variant recognition of ACE2 receptor and broadly neutralizing antibodies. Nat. Commun. 13, 871 (2022).

    Article  MathSciNet  MATH  Google Scholar 

  55. Bugatti, A. et al. The D405N mutation in the spike protein of SARS-CoV-2 Omicron BA.5 inhibits spike/integrins interaction and viral infection of human lung microvascular endothelial cells. Viruses 15, 332 (2023).

  56. He, P. et al. SARS-CoV-2 Delta and Omicron variants evade population antibody response by mutations in a single spike epitope. Nature Microbiol. 7, 1635–1649 (2022).

    Article  MATH  Google Scholar 

  57. Supasa, P. et al. Reduced neutralization of SARS-CoV-2 B.1.1.7 variant by convalescent and vaccine sera. Cell 184, 2201–2211.e7 (2021).

    Article  MATH  Google Scholar 

  58. Lista, M. J. et al. The P681H mutation in the spike glycoprotein of the Alpha variant of SARS-CoV-2 escapes IFITM restriction and is necessary for Type I interferon resistance. J. Virol. 96, e0125022 (2022).

    Article  MATH  Google Scholar 

  59. Liu, Y. et al. Delta spike P681R mutation enhances SARS-CoV-2 fitness over Alpha variant. Cell Rep. 39, 110829 (2022).

    Article  Google Scholar 

  60. Mishra, T. et al. SARS-CoV-2 spike E156G/Delta157-158 mutations contribute to increased infectivity and immune escape. Life Sci. Alliance 5, e202201415 (2022).

  61. Selvavinayagam, S. T. et al. Low SARS-CoV-2 viral load among vaccinated individuals infected with Delta B. 1.617. 2 and Omicron BA. 1.1. 529 but not with Omicron BA. 1.1 and BA. 2 variants. Front. Public Health 10, 1018399 (2022).

    Article  Google Scholar 

  62. He, X. et al. Research progress in spike mutations of SARS‐CoV‐2 variants and vaccine development. Med. Res. Rev. 43, 932–971 (2023).

    Article  MATH  Google Scholar 

  63. Ai, J. et al. Antibody evasion of SARS-CoV-2 Omicron BA. 1, BA. 1.1, BA. 2, and BA. 3 sub-lineages. Cell Host Microbe 30, 1077–1083.e4 (2022).

    Article  MATH  Google Scholar 

  64. Yunlong, C. et al. Characterization of the enhanced infectivity and antibody evasion of Omicron BA.2.75. Cell Host Microbe 30, 1527–1539.e5 (2022).

    Article  MATH  Google Scholar 

  65. Marcinkiewicz, A. L. et al. Structural evolution of an immune evasion determinant shapes pathogen host tropism. Proc. Natl Acad. Sci. 120, e2301549120 (2023).

    Article  MATH  Google Scholar 

  66. Yunlong, C. et al. Imprinted SARS-CoV-2 humoral immunity induces convergent Omicron RBD evolution. Nature 614, 521–529 (2023).

    Google Scholar 

  67. Liu, Y. SN-1604/ARNLE: attentional recurrent network based on language embedding (ARNLE). Zenodo https://doi.org/10.5281/zenodo.13836159 (2024).

  68. Liu, Y. An attentional recurrent network based on language embedding framework for human tropism prediction reveals trends in the prevalence of SARS-CoV-2 variants. Zenodo https://doi.org/10.5281/zenodo.13826663 (2024).

Download references

Acknowledgements

This study was supported by grants from the National Key Research and Development Program (grant no. 2023YFC2604400, Peng Li), the Chinese Academy of Medical Sciences Innovation Fund for Medical Sciences (grant no. CIFMS2022-I2M-1-011, J.T.Y.), the National Science and Technology Major Project of China (grant no. 2018ZX10201001, Peng Li) and the Biomedical High Performance Computing Platform, Chinese Academy of Medical Sciences.

Author information

Authors and Affiliations

Contributions

H.S., Peng Li, J.Y. and A.W. conceived the study. Y.L., K.W., Jinhui Li and Jiangfeng Liu managed the data. Y.L. and Jing Li built the model. Y.L., Peihan Li, Y.Y. and L.Y. performed the data analysis. Peng Li, L.J. and H.S. supervised the project. Y.L. and Jing Li wrote the manuscript. H.S., Peng Li, J.Y. and A.W. revised the paper.

Corresponding authors

Correspondence to Aiping Wu, Juntao Yang, Peng Li or Hongbin Song.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Machine Intelligence thanks Jianyang Zeng, Wei Zheng and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Methods used for Supplementary figures and Supplementary Figs. 1–10 and Tables 1–7.

Reporting Summary

Supplementary Tables

Supplementary Tables 1–7.

Supplementary Data 1

Statistical source data for Supplementary Fig. 1.

Supplementary Data 2

Statistical source data for Supplementary Fig. 2.

Supplementary Data 3

Statistical source data for Supplementary Fig. 3.

Supplementary Data 4

Statistical source data for Supplementary Fig. 4.

Supplementary Data 5

Statistical source data for Supplementary Fig. 5.

Supplementary Data 6

Statistical source data for Supplementary Fig. 6.

Supplementary Data 7

Statistical source data for Supplementary Fig. 7.

Supplementary Data 8

Statistical source data for Supplementary Fig. 8.

Supplementary Data 9

Statistical source data for Supplementary Fig .9.

Supplementary Data 10

Statistical source data for Supplementary Fig. 10.

Source data

Source Data Fig. 2

Statistical source data.

Source Data Fig. 3

Statistical source data.

Source Data Fig. 4

Statistical source data.

Source Data Fig. 5

Statistical source data.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, Y., Li, J., Li, P. et al. ARNLE model identifies prevalence potential of SARS-CoV-2 variants. Nat Mach Intell 7, 18–28 (2025). https://doi.org/10.1038/s42256-024-00919-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue date:

  • DOI: https://doi.org/10.1038/s42256-024-00919-2

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing