Mapping glycoprotein structure reveals Flaviviridae evolutionary history

Mifsud, Jonathon C. O.; Lytras, Spyros; Oliver, Michael R.; Toon, Kamilla; Costa, Vincenzo A.; Holmes, Edward C.; Grove, Joe

doi:10.1038/s41586-024-07899-8

Download PDF

Article
Open access
Published: 04 September 2024

Mapping glycoprotein structure reveals Flaviviridae evolutionary history

Nature volume 633, pages 695–703 (2024)Cite this article

40k Accesses
49 Citations
189 Altmetric
Metrics details

Subjects

Abstract

Viral glycoproteins drive membrane fusion in enveloped viruses and determine host range, tissue tropism and pathogenesis¹. Despite their importance, there is a fragmentary understanding of glycoproteins within the Flaviviridae², a large virus family that include pathogens such as hepatitis C, dengue and Zika viruses, and numerous other human, animal and emergent viruses. For many flaviviruses the glycoproteins have not yet been identified, for others, such as the hepaciviruses, the molecular mechanisms of membrane fusion remain uncharacterized³. Here we combine phylogenetic analyses with protein structure prediction to survey glycoproteins across the entire Flaviviridae. We find class II fusion systems, homologous to the Orthoflavivirus E glycoprotein in most species, including highly divergent jingmenviruses and large genome flaviviruses. However, the E1E2 glycoproteins of the hepaciviruses, pegiviruses and pestiviruses are structurally distinct, may represent a novel class of fusion mechanism, and are strictly associated with infection of vertebrate hosts. By mapping glycoprotein distribution onto the underlying phylogeny, we reveal a complex evolutionary history marked by the capture of bacterial genes and potentially inter-genus recombination. These insights, made possible through protein structure prediction, refine our understanding of viral fusion mechanisms and reveal the events that have shaped the diverse virology and ecology of the Flaviviridae.

Flavivirus maturation leads to the formation of an occupied lipid pocket in the surface glycoproteins

Article Open access 23 February 2021

In situ structure and organization of the influenza C virus surface glycoprotein

Article Open access 16 March 2021

Deep mining reveals the diversity of endogenous viral elements in vertebrate genomes

Article Open access 22 October 2024

Main

The Flaviviridae² is a highly diverse family of enveloped positive-sense RNA viruses that includes important pathogens of humans (for example, dengue virus, Zika virus and hepatitis C virus) and other animals (for example, classical swine fever virus and bovine viral diarrhoea virus), as well as many viruses that pose emerging threats to human health (for example, West Nile virus, Alongshan virus and Haseki tick virus^4,5,6). The Flaviviridae is currently classified into four genera: Orthoflavivirus, Pestivirus, Pegivirus and Hepacivirus⁷. In recent years a remarkable diversity of novel flaviviruses with varied genome structures have been discovered, including the jingmenvirus group, which are unique in being both segmented and potentially multicomponent^8,9. Another group, tentatively known as the large genome flaviviruses (LGF) are primarily associated with invertebrates¹⁰, but have also been linked to plants^11,12 and vertebrates⁶. LGFs have genomes up to 39.8 kb in length, challenging previous assumptions about the maximum genome size achievable by RNA viruses that lack proofreading mechanisms^13,14. The jingmenviruses and LGFs have yet to receive taxonomic ratification, and consensus is lacking regarding their placement within the flavivirus phylogeny.

Previous efforts to reconstruct the evolutionary history of large and diverse families of RNA viruses such as the Flaviviridae have relied largely on the phylogenetic analysis of highly conserved viral proteins, most notably the RNA-dependent RNA polymerase^10,15 (RdRp). Although of considerable utility, the functions and features that define virus biology and pathogenesis are typically encoded by highly divergent sequences outside of the conserved replication machinery. In these regions, it is difficult to detect deep sequence homology and hence perform reliable multiple sequence alignment (MSA) or phylogenetic analysis. As a consequence, our understanding of long-term virus evolution is generally based on the analysis of a single protein (the RdRp), such that we lack an understanding of genome-wide relationships and hence of the genesis and evolution of viral genera and species.

Glycoproteins are likely to be important determinants of phenotypic characteristics across the Flaviviridae. They are essential for virus entry, influence host range and spillover potential, and are primary targets for host immune responses. However, glycoproteins have yet to be identified and/or classified for many species in the Flaviviridae³. Owing to high levels of sequence divergence, this cannot be resolved by even the most sensitive of sequence-based approaches^13,16, and classical structural biology lacks the speed and scalability to sample enough species. This knowledge gap limits the investigation of molecular mechanisms, which in turn hinders the development of interventions such as vaccines.

Here we have augmented conventional phylogenetics with machine learning-enabled protein structure prediction to comprehensively map glycoprotein structures across the Flaviviridae. This provides an evolutionary and genomic-scale perspective of the entire family, revealing molecular signatures that define the diverse virology and ecology found within the Flaviviridae.

Flaviviridae contains three major clades

Understanding the evolution of molecular features across the Flaviviridae requires a proper gauge of the phylogenetic and genomic diversity of this family. To achieve this, we first constructed a comprehensive data set of flavivirus sequences, which after clustering and manual curation, comprised 458 flavivirus genomes with complete coding sequences, including 11 that were novel taxa identified in this study (Supplementary Table 1). We next inferred a robust family-level phylogenetic tree for these data. Using conserved NS5 gene sequences that encode the RdRp, we applied various sequence alignment methods, quality trimming protocols and amino acid substitution models, to infer a total of 225 phylogenetic trees for this family (Supplementary Table 2). Using distance-based approaches and manual inspection of the alignments and trees, we identified a phylogeny, denoted Tree 18 (Extended Data Fig. 1, Supplementary Table 2 and Supplementary Fig. 1), that appeared to best represent the consensus topology of the Flaviviridae. Specifically, the topological placement of the major Flaviviridae clades in Tree 18 was consistent with 93% of phylogenies derived from MUSCLE and MAFFT MSAs, although this percentage dropped to 70% when Clustal Omega was included (Methods).

Our best-fit RdRp phylogeny supported the division of the Flaviviridae into three distinct clades: (1) an Orthoflavivirus/jingmenvirus group (that also contains ‘orthoflavirus-like’ viruses—for example, Cnidaria flavivirus and Tamana bat virus); (2) a clade comprising the large genome flaviviruses and members of the genus Pestivirus; and (3) a Pegivirus/Hepacivirus clade (Fig. 1a). Regardless of whether the tree was unrooted or rooted on a Tombusviridae outgroup, the LGF/Pestivirus and Orthoflavivirus/jingmenvirus groups clustered together and formed a sister group to the Pegivirus/Hepacivirus clade. The Orthoflavivirus/jingmenvirus clade had the largest number of taxa (n = 182), followed by Pegivirus/Hepacivirus (n = 157) and LGF/Pestivirus (n = 119). All novel taxa (n = 11) identified in this study fell within the LGF/Pestivirus clade.

Fig. 1: Generation of a protein foldome for the *Flaviviridae.*

Building a Flaviviridae protein foldome

We next aimed to explore protein functionality across the Flaviviridae using machine learning-enabled protein structure prediction. All flaviviruses encode polyproteins that undergo proteolytic maturation to liberate the constituent viral proteins. However, incomplete and ambiguous genome annotations combined with extensive sequence divergence make it very difficult to reliably identify the regions encoding each mature protein in all species. We therefore took a genome-agnostic approach, in which polyprotein coding sequences were split into sequentially overlapping 300-residue blocks for structural inference by two leading prediction models—ColabFold–AlphaFold2 (hereafter referred to as ColabFold) and ESMFold^17,18,19. This provided a comprehensive survey of protein structure across the Flaviviridae (458 species, more than 16,000 sequence blocks and more than 33,000 predicted structures), referred to here as the ‘protein foldome’.

As protein structural prediction has yet to be systematically applied in virology, we first evaluated folding performance. ColabFold performed extremely well for many virus species (for example, dengue virus 2; Fig. 1b and Extended Data Fig. 2a–c). However, its accuracy is directly proportional to the depth of the MSAs that guide structural inference²⁰, with shallow MSAs producing low confidence predictions (Fig. 1c). This becomes particularly problematic for the LGF, which are poorly sampled and consequently underrepresented in sequence databases, resulting in consistently shallow MSAs (Fig. 1d).

Structural inference by ESMFold is driven by a protein language model and does not require MSAs, but is less accurate than ColabFold. A comparison of folding confidence demonstrated that ColabFold consistently outperformed ESMFold across the three major Flaviviridae clades (Fig. 1e and Extended Data Fig. 2d). However, for the LGF, ESMFold yields informative predictions from sequences for which ColabFold fails, and this proved important for downstream analysis.

Flaviviridae glycoprotein discovery

Orthoflaviviruses, including yellow fever virus (the canonical species for the group), tick-borne encephalitis virus (TBEV) and dengue virus (DENV), are predominantly vector-borne, although exceptions exist²¹. These viruses possess the E glycoprotein, a prototypical class II fusion protein. Structurally and functionally homologous class II fusion proteins have been identified both in viruses (for example, Gc in the bunyaviruses) and in eukaryotes (for example, HAP2 in plants, protists and invertebrates) and they are expected to share a common ancestor^{22,23,24,25,26}. In viruses, class II fusion proteins are accompanied by a partner glycoprotein that is responsible for regulation and/or chaperoning of the fusogenic component. In the orthoflaviviruses this function is performed by the small glycoprotein prM²⁷.

Identifying membrane fusion mechanisms in the hepaciviruses, pegiviruses and pestiviruses has proved more challenging. These viruses possess E1 and E2 glycoproteins that work in concert to achieve pH-dependent membrane fusion. The sequences of E1E2 bear no similarity to prM/E, and experimental structures of E2 from prototypical hepaciviruses and pestiviruses reveals folds that are broadly dissimilar from one another and from the E glycoprotein in orthoflaviviruses^{28,29,30,31,32}. Recent cryo-electron microscopy analyses suggest that E1 adopts a unique fold, unlike that of any other known protein^33,34.

Whether E1E2 represents a novel and as yet uncharacterized fusion mechanism or a highly divergent iteration of a class II system remains to be determined. Understanding the distribution and characteristics of glycoproteins across the Flaviviridae would likely provide insights on this. To achieve this we performed pairwise Foldseek³⁵ structure similarity searches against the Flaviviridae protein foldome using a custom reference library comprising selected experimental glycoprotein structures from the Protein Data Bank (PDB) and published ColabFold models³⁶.

To benchmark our approach we performed parallel analyses using state-of-the-art sequence-based approaches (DIAMOND and InterProScan^37,38), which did not detect deep homology, even for highly conserved targets (Extended Data Fig. 3). By contrast, Foldseek demonstrated strong sensitivity, successfully detecting unambiguous structural homology between Pegivirus/Hepacivirus and Pestivirus E1 even though they share only 10–15% amino acid sequence identity (Fig. 2b,d). Relative to E1, the Pegivirus/Hepacivirus and Pestivirus E2 are structurally divergent³⁶, nonetheless, Foldseek identified reciprocal structural similarity focussed on the C-terminal portion of E2 where sequence identity ranged from 8.5 to 15% (Fig. 2b,e). The distribution of E1 and E2 were in near-perfect correlation, consistent with mechanistic interdependence, and we found no evidence for E1E2-like folds outside of Pegivirus/Hepacivirus and Pestivirus groups.

Fig. 2: Discovery of glycoproteins across the *Flaviviridae.*

We mapped structural homologues of E glycoprotein to the orthoflaviviruses, orthoflavivirus-like, jingmenviruses, LGF and pestivirus-like species that sit basal to the classical Pestivirus genus (for example, Xinzhou spider virus 3) (Fig. 2b,f). For the most divergent sequences (for example, LGF and pestivirus-like viruses) detection required ESMFold structures, emphasizing the value of using complementary prediction methods (Extended Data Fig. 4). A notable exception was a group of viruses of unknown hosts discovered in environmental samples^39,40,41 (for example, Inner Mongolia sediment flavi-like virus 3) for which no glycoproteins were identified (Extended Data Fig. 5). Whether these represent species without structural proteins or partial genomes remains to be determined.

For most E glycoprotein homologues, the predicted structures were sufficient to identify the hydrophobic fusion loop at the tip of domain II, which inserts into host membranes and is central to the class II fusion mechanism (Extended Data Fig. 6). However, the fusion loop was absent from the jingmenvirus E homologues, suggesting substantial mechanistic divergence in these viruses. We were only able to detect the prM partner glycoprotein within the orthoflaviviruses and some orthoflavivirus-like viruses (Fig. 2b,g). A critical function of prM is occluding the fusion loop of E during particle maturation and we may yet expect to find orthologous partners in other clades.

Glycoproteins follow ecological niche

We could assign either E1E2 or E glycoproteins to the vast majority of species in the Flaviviridae (Fig. 2b). Their distributions divide the family broadly in two, although this division is incongruent with the RdRp phylogeny, suggesting a complex evolutionary history. In particular, the pestiviruses and LGF, which represent sister clades on the RdRp tree, possess E1E2 and E, respectively. Although mapping glycoproteins was the primary focus of our study, we also compared the foldome against other Flaviviridae proteins from the PDB. In doing so we observed that all species with a methyltransferase (MTase) also possess an E glycoprotein homologue, and all species with E1E2 (that is, the hepaciviruses, pegiviruses and pestiviruses) lack MTase (Fig. 2b and Extended Data Fig. 7). Viruses with MTase undergo cap-dependent translation, whereas those without MTase rely on an internal ribosome entry site⁴² (IRES). We reasoned that E–MTase and E1E2–IRES may represent divergent co-adaptations to particular ecological niches, so we compared virus–host associations across the phylogeny (Fig. 2c). Viruses with E–MTase infect a variety of hosts, including those that are transmitted between vertebrates by invertebrate vectors (for example, DENV by Aedes mosquitoes). By contrast, E1E2–IRES was strictly correlated with vertebrate hosts. This suggests that the gain of E1E2 and an IRES, with the concomitant loss of E and MTase, represent a molecular commitment to the vertebrate niche. Moreover, on the basis of the underlying RdRp phylogeny, this commitment to vertebrate infection is likely to have occurred twice in the Flaviviridae, once for Pegivirus/Hepacivirus and once for the pestiviruses.

LGFs harbour novel and acquired proteins

Our structure-guided approach can offer new insights into divergent and/or poorly characterized viruses such as those found in the LGF. Whereas the majority of LGF species are likely to infect invertebrates, there is evidence that one subclade—the bole tick virus group—are capable of tick-borne infection of mammals, including humans⁶. This group may represent an emergent threat to public health and warrants closer scrutiny.

Focusing on Bole tick virus 4 (BTV4), we examined the N-terminal portion of the polyprotein proximal to the E glycoprotein homologue. The N-terminal structural proteins of Flaviviridae polyproteins are typically processed by host signal peptidases to liberate the mature proteins. We used cleavage site prediction to identify five putative protein coding sequences (labelled A–E; Fig. 3a). Protein structures were predicted using three approaches—ESMFold, ColabFold and ColabFold with manually curated MSAs (Methods). For each sequence, we provide the highest confidence model produced by any of these methods (Fig. 3a).

**Fig. 3: Novel and acquired proteins in a large genome flavivirus.**

The largest and most C-terminal protein is the E glycoprotein homologue. ESMFold produces a model in which the transmembrane domain and fusion loop are juxtaposed; this is consistent with experimental structures of the post-fusion conformation of E⁴³. Directly upstream of the E homologue are two smaller proteins for which custom ColabFold yielded the optimal predictions. Although neither protein shares direct homology with prM, they both have a similar organization, consisting of a small globular domain anchored by a putative transmembrane domain. Given their direct proximity to the E homologue, we suggest that these are partner proteins that provide chaperoning to the class II fusogen. Indeed, the putative partner 1 possesses a furin cleavage site, analogous to prM, that would enable proteolysis during secretion, akin to the maturation of Orthoflavivirus particles²⁷. Protein coding sequence B yielded low confidence predictions from each folding approach (not shown). The most N-terminal sequence was identified as a T2 family ribonuclease (RNase T2), with homologues across the tree of cellular life⁴⁴.

We used Foldseek to investigate the distribution of these protein structures across the LGF/Pestivirus clade (Fig. 3b). Homologues of BTV4 E glycoprotein were detected throughout the LGF and in pestivirus-like viruses identified in spiders and cartilaginous fish that fall basal to members of the classical genus Pestivirus. Therefore, using a proximal reference (BTV4 E glycoprotein), we provide evidence for the loss of E and gain of E1E2 at the genesis of the pestiviruses. By contrast, the putative partner proteins were confined to the Bole tick virus subclade, and structural similarity searches against current protein databases (for example, PDB and AlphaFoldDB) revealed no homologues. Therefore, these proteins are likely adaptive features, specific to these viruses.

BTV4 RNase T2 has homologues throughout the Bole tick virus subclade and, notably, across the genus Pestivirus, where the homology maps to the E^rns ribonuclease. Foldseek searches against pestivirus E^rns revealed a reciprocal distribution of homology (Fig. 3b,c). Phylogenetically, the LGF/pestivirus E^rns form a deep branch amongst homologous RNase T2 sequences from viruses, bacteria, plants and animals (Fig. 3d,e). Together, this indicates that E^rns originated in a distant ancestor of the pestiviruses and LGFs, probably from a single horizontal gene transfer of a bacterial RNase T2. Moreover, the distribution of E^rns is broadly concordant with the RdRp phylogeny, suggesting that E^rns has been continuously retained in certain species and lost in others (Fig. 3e), rather than undergoing genetic exchange within the clade. Further instances of RNase T2 from nidovirus-like and virgavirus-like viruses were also nested within the E^rns tree, indicating onward genetic transfer to other RNA viruses.

Evolutionary history of the Flaviviridae

Our approach enabled the discovery of glycoproteins (and other features) across the entire Flaviviridae. To better understand the evolutionary events that gave rise to this distribution of molecular characteristics we utilized a method⁴⁵ that leverages structural conservation (as represented in the Foldseek 3Di alphabet; Extended Data Fig. 8) to guide and augment traditional amino acid-based evolutionary analyses (Methods). This revealed consensus-level glycoprotein sequence similarities, indicative of shared ancestries, that can be estimated through phylogenetic modelling (Fig. 4a–c and Extended Data Fig. 9–11).

**Fig. 4: Structurally informed phylogenetics.**

The optimal E glycoprotein phylogeny largely reflects the RdRp tree, with E homologues from orthoflaviviruses, orthoflavirus-like viruses jingmenviruses and LGFs distributed across various subclades (Fig. 4a). Notably, the E protein homologues in pestivirus-like viruses of spiders fell within the LGF glycoprotein clade, similar to the RdRp tree topology (Fig. 3b). This again suggests that the acquisition of E1E2, accompanied by loss of E, was a defining event in the emergence of the pestiviruses from an LGF-like progenitor.

Both the E1 and E2 phylogenies indicate a common glycoprotein ancestry in Pegivirus/Hepacivirus and Pestivirus groups (Fig. 4b,c), even though they are paraphyletic in the RdRp phylogeny (Fig. 1a). Of note, the Wenling moray eel hepacivirus (which is basal to the pegivirus/hepacivirus RdRp lineage) sits at the intersection of the Pegivirus/Hepacivirus and Pestivirus E1 and E2 clades, consistent with deep ancestry. We could not detect any significant structural homology between E1E2 and E (Fig. 2), or identify intermediate forms between these glycoprotein systems, further suggesting they are mechanistically distinct. We therefore propose that E1E2 represents a novel class of fusion protein. Moreover, the structural and sequence conservation within E1 and the basal portion of E2 (Figs. 2 and 4b,c and Extended Data Fig. 8) suggests a mechanistic role requiring experimental exploration.

On the basis of our combined analyses, we propose a Flaviviridae evolutionary history shaped by gains and losses of defining protein functions, as summarized in Fig. 5. The most parsimonious interpretation of the data is that Orthoflavivirus/jingmenvirus and LGF/Pestivirus clades (lineage 1) arose from an ancestor that possessed the E glycoprotein and performed cap-dependent translation, necessitating MTase. By contrast, the Pegivirus/Hepacivirus clade (lineage 2) arose from an ancestor that possessed E1E2 glycoproteins and lacked MTase, implying reliance on IRES-dependent translation.

Fig. 5: Proposed evolutionary history of the *Flaviviridae.*

Compared with lineage 2 (where the hepaciviruses and pegiviruses share relatively conserved RdRp and glycoproteins), lineage 1 has undergone extensive diversification. Genome segmentation occurred in the jingmenviruses, with coincident divergence in their E glycoprotein¹⁶, including the apparent loss of its canonical fusion loop. The orthoflaviviruses gained prM, a partner to the E glycoprotein, probably derived from a host chaperonin⁴⁶. A sister lineage gives rise to the LGF and Pestivirus clades, in which an ancestral species gained RNase T2 from bacteria. Whereas the LGF, including basal pesti-like viruses, possess E glycoprotein, all pestiviruses possess E1E2 glycoproteins, homologous to those found in the hepaciviruses and pegiviruses. This indicates a switch in glycoprotein systems through an inter-genus horizontal gene transfer; with concomitant loss of MTase. In lineages one and two, the presence of E1E2 (and IRES-dependent translation) is strictly associated with viruses of vertebrates, suggesting a molecular commitment to ecological niche.

The characteristics of the common ancestor of the entire Flaviviridae remain speculative and, on the basis of the taxonomic distribution of infected hosts and the existence of endogenous viral elements, may have originated over 900 million years ago⁴⁷. This ancestor possibly contained core NS3 and NS5 proteins and potentially an MTase. Although the absence of RrmJ-like methyltransferases (as found in the Flaviviridae) in other members of the phylum Kitrinoviricota⁴⁸, hints that MTase might have been acquired at the base of lineage 1. It remains unclear whether this progenitor possessed an envelope, therefore necessitating fusion glycoproteins; although it is noteworthy that, with the exception of Togaviridae, Matonaviridae and Flaviviridae, all Kitrinoviricota families (n = 21) are non-enveloped.

Ultimately, the origins of the E and E1E2 glycoproteins remain uncertain. We cannot exclude the possibility of hidden E glycoprotein horizontal gene transfer within the Flaviviridae. For instance, alternative phylogenetic models to the one in Fig. 4a place the jingmenvirus E within the LGF clade (although this may be an artefact of long branch attraction; see Methods and Extended Data Fig. 9). Moreover, we have observed apparent ‘genetic piracy’ by LGFs (for example, RNase T2 (Fig. 3) and in the recent work of Petrone et al.¹³) and therefore these viruses may have acquired their glycoproteins by horizontal gene transfer from within the Flaviviridae or beyond. Resolving these questions will probably require the discovery of further novel species and the inclusion of diverse taxa from beyond the Flaviviridae, such as the Matonaviridae, Togaviridae and Peribunyaviridae, which also possess class II fusion proteins. However, these analyses, across a wide diversity of the RNA virosphere, are likely to challenge even the highly sensitive structure-driven approaches outlined here.

Discussion

The limited ability of sequence-based methods to detect deep homology has resulted in significant ambiguity regarding the distribution and classification of fusion glycoproteins across the Flaviviridae. Our work, using protein structure prediction, has discovered previously unknown glycoproteins in more than 100 species and reveal unambiguous structural and sequence similarity between E1E2 in the hepaciviruses, pegiviruses and pestiviruses, indicative of inter-genus genetic exchange. The absence of homology between the E glycoprotein (class II fusion mechanism) and E1E2, even in basal species, provides the strongest evidence yet of a novel fusion mechanism in the Pegivirus/Hepacivirus and Pestivirus groups. Through comparison to host tropism, we found that E1E2 is strictly correlated with infection of vertebrates, suggesting a molecular commitment to virological niche.

Beyond biological insights, our work demonstrates that protein structure prediction and structure-guided homology searches outperform the gold standard sequence-based approaches to provide unprecedented clarity to the evolution of viruses. Whereas AlphaFold-based methods offer unparalleled accuracy²⁰, protein language model-based systems such as ESMFold may be more capable of exploring the ‘viral dark matter’ revealed by metatranscriptomics. In sum, our study offers a new state-of-the-art approach for understanding the diversity and distribution of protein functions throughout the virosphere.

Methods

Compilation of Flaviviridae sequence set

Retrieval of flavivirus genomes

Flavivirus sequences were collected using the search phrase “Flaviviridae taxid 11050 and Unclassified Flaviviridae taxid 38144” in the NCBI Virus Database on 15 December 2022. The search was complemented by referencing sequences from Mifsud et al.⁵⁰ and supplemented with sequences from the NCBI nucleotide database using the search phrase “flavi[All Fields] OR pesti[All Fields] OR hepaci[All Fields] OR pegi[All Fields] AND viruses[filter]” on the same date. Additional sequences were later retrieved from publications that had sequences not available in GenBank at the time^{12,40,51,52,53,54,55,56}.

Sequence set curation

Sequences were clustered to a 95% nucleotide identity threshold to approximate a species-level distinction, excluding the LGF tick-associated clade. Clustering was performed using CD-HIT (v4.6.1)⁵⁷ with non-default parameters “cd-hit-est -c 0.95 -n 9”. Subsequently, the clustered sequence set was manually curated by removing incomplete coding regions. Sequences shorter than 2,000 nucleotides in length were removed, with the exception of the jingmenviruses where segments are known to be <2000 nucleotides in length. These nucleotide sequences were translated using the Geneious Prime Find ORFs tool (v2022.0) (https://www.geneious.com/)⁵⁸ and along with protein sequences aligned to annotated reference sequences (where available) using MAFFT FFT-NS-I X2 (v7.402) to assess genome completeness⁵⁹. This was complemented by predicting conserved domains using the InterProScan software package (v5.56-89.0) with the SFLD (v4.0), PANTHER (v17.0), SuperFamily (v1.75), PROSITE (v2022_01), CDD (v3.18), Pfam (v34.0), SMART (v7.1), PRINTS (v42.0), and CATH-Gene3D databases (v4.3.0)³⁸. Sequences determined to contain partial coding sequences were removed from the subsequent analyses.

Discovery of novel LGF sequences

Tick-associated LGFs are of particular interest due to the recently reported association between Haseki tick virus and tick-borne infectious disease in humans⁶. To identify related viruses, we screened the Sequence Read Archive (SRA) RdRp microassemblies generated by Serratus⁶⁰ using DIAMOND BLASTx (v2.0.9)³⁷ (e-value threshold of 10⁻⁵ and the “--ultra-sensitive” flag)³⁷ with Haseki tick virus (UTQ11742) as the query. An e-value threshold of 1.6 × 10⁻¹⁵ was established to restrict the number of libraries for reassembly to a manageable quantity. This threshold was determined based on the organism associated with the SRA library and the percent identity values. The 319 SRA libraries that meet this threshold were processed following the BatchArtemisSRAMiner pipeline⁶¹. In brief, raw FASTQ files were retrieved using Kingfisher (v0.3.0) (https://github.com/wwood/kingfisher-download), quality trimming and adapter removal using Trimmomatic (v0.38)⁶² with parameters SLIDINGWINDOW:4:5, LEADING:5, TRAILING:5, and MINLEN:25 and de novo assembly using MEGAHIT (v1.2.9)⁶³ with default parameters. The assembled contigs were compared to the NCBI non-redundant protein database (as of March 2023) and a custom Flaviviridae protein database using DIAMOND BLASTx as described above. All novel flaviviruses predicted to contain complete coding sequences identified by this method (including those outside of the LGF group) were included in phylogenetic analyses.

Structure prediction and homology search

Systematic protein structure prediction

We adopted a strategy to overcome incomplete and ambiguous genome annotations, and generate sequence lengths that are amenable to rapid inference of structure. Flaviviridae polyprotein amino acid sequences were broken into sequential 300-residue blocks with a 100-residue overlap. However, most polyproteins are not equally divisible by 300, therefore, we set the final sequence block to cover the final 300 residues of the polyprotein, irrespective of overlap with the penultimate block. This resulted in 16,463 sequence blocks from 561 species (558 from the Flaviviridae and 3 from the Tombusvirus outgroup). Structures were predicted for each sequence using the ColabFold (v1.5.1) implementation of AlphaFold2 (v2.3)¹⁹, with default settings but only generating a single model per target, performed using Google Colab cloud computing. Structural inference was also performed with ESMFold (v1)¹⁸ (using the 3 billion parameter ESM-2 model), on local compute (Nvidia V100 GPU + 32GB vRAM). This resulted in a total of 32,926 structural models. Custom Python scripts were used to break up sequences for folding and extract metrics from outputs (that is, pLDDT confidence and MSA depth). For inference of putative mature protein sequences (Fig. 3) the SignalP server (v6.0) was used to predict the junctions between viral proteins⁶⁴. For custom ColabFold inference (Fig. 4), whole polyprotein sequences of the Bole tick virus group were aligned using MAFFT, MUSCLE (v5.1)⁶⁵, and subalignments covering only the putative protein sequences were converted to the.a3m format and used as input for ColabFold structure prediction¹⁹. All predicted structures and summary statistics are included in the associated Zenodo repository (https://doi.org/10.5281/zenodo.11092288)⁶⁶. Representative structural superpositions (Fig. 4) were performed using FATCAT (v2.0)⁴⁹. All structural visualizations were prepared for publication using UCSF ChimeraX⁶⁷.

Structural homology searches

We used Foldseek in exhaustive search mode to cross compare the Flaviviridae protein foldome with a library of reference structures drawn from the protein database and ColabFold models of particular targets (see below). Foldseek was set to output e values, structurally aligned amino acid sequences, % identity of aligned residues, bit score and lddt structural similarity, with an e-value cut-off of 0.1 to eliminate low probability hits and reduce the size of the output datafile. To interrogate the output data, the lowest e-value scores for any given species against any given reference structure were extracted using a custom python script. Where multiple references were used for a single protein the lowest e value against any given species was chosen. This data was plotted against sequence-based phylogenies using the Interactive Tree Of Life⁶⁸. Representative hits (Fig. 2d–g and Extended Data Fig. 7) were selected manually to reflect the levels of similarity and divergence in structure and sequence. All reference structures are included in the underlying data, the following experimental structures were used from the PDB: 6ZQI (Spondweni virus E and prM), 1L9K (DENV-2 MTase), 5F3Z (DENV-3 RdRp), 7QRF (TBEV E and prM), 7V1E (Omsk haemorrhagic fever virus MTase), 7T6X (HCV E1 and E2), 6VYB (SARS-CoV-2 spike, negative control), 2YQ2 (BVDV E2) and 4DVK (BVDV E^rns)^{28,33,46,69,70,71,72,73,74}. To increase reference coverage, additional ColabFold models of DENV-1 prM E and diverse Hepacivirus, Pegivirus and Pestivirus E1 and E2 (from Oliver et al.³⁶) were also used. ColabFold or ESMFold structures of BTV4 proteins (Fig. 3a) were used in downstream Foldseek analysis and as references in the assembly of continuous glycoprotein structures for the structural phylogeny work presented in Fig. 4 (see below).

Sequence homology search benchmarking

To demonstrate the increased sensitivity achieved through structure prediction approaches, we conducted two benchmarking analyses. In the first analysis we recapitulated the Foldseek search by querying the 300-residue blocks against the cognate protein sequences underlying the reference structure database using DIAMOND BLASTp (e-value threshold of 0.1 and the “--ultra-sensitive” flag)³⁷. We then filtered the results to select the block with the lowest e value for each flavivirus and reference sequence pair. The second analysis involved annotating the complete Flaviviridae polyprotein sequences using the InterProScan software package (v5.65-97.0) with the AntiFam (v7.0), FunFAM (4.3.0), MobiDBLite (v2.0), NCBIfam (v13.0), SFLD (v4.0), PANTHER (v18.0), SuperFamily (v1.75), CDD (v3.20), Pfam (v36.0), SMART (v9.0), PRINTS (v42.0) and PIRSF (v2023_05) databases³⁸. As e values are specific to each InterPro database and each utilizes their own e-value post-processing, direct comparisons are not feasible. Consequently, as advised, all matches were considered tentative hits (https://interproscan-docs.readthedocs.io/en/latest/FAQ.html).

Phylogenetic analysis

NS5b phylogeny

The evolutionary relationships among the Flaviviridae were inferred using maximum likelihood phylogenies derived from MSAs of the highly conserved NS5b region (which encodes the RdRp). This region was extracted from each sequence by aligning polyprotein sequence subsets according to their taxonomy and using both pre-existing and newly generated NS5b annotations from InterProScan as a guide. As alignment and trimming parameters have been shown to influence the topology of the Flaviviridae⁷⁵ we compared several methods resulting in 225 phylogenies. In brief, flavivirus sequences were aligned using MAFFT, MUSCLE (v5.1)⁶⁵ and Clustal Omega (v1.2.4)⁷⁶ with default parameters. Ambiguously aligned regions were removed using trimAl (v1.4.1)⁷⁷ with 8 conservation thresholds (that is, minimum percentage of alignment columns to retain): 5, 7.5, 10, 12.5, 15, 17.5, 20 and 25; and 3 gap thresholds (that is, the minimum fraction of sequences without a gap needed to keep a column): 0.7, 0.8, and 0.9—as well as the automated parameter selection method gappyout.

All maximum likelihood phylogenetic trees were estimated using IQ-TREE 2 (v2.1.0)⁷⁸. Selection of the best-fit model of amino acid substitution was inferred for a subset of phylogenies using the ModelFinder function in IQ-TREE 2⁷⁹. In addition to the model chosen by ModelFinder (LG + F + R10) two additional models, the Le-Gascual model (LG) and FLAVI⁸⁰ were compared. Branch support was calculated using 1,000 bootstrap replicates with the UFBoot2 algorithm and an implementation of the SH-like approximate likelihood ratio test within IQ-TREE 2^81,82. To root the phylogeny, three members of the Tombusviridae family were chosen given their remote sequence similarity to the NS5 region of the Flaviviridae^2,10. Phylogenetic trees were annotated using the R packages ape (v5.6.2)⁸³, phytools (v1.5-1)⁸⁴, and ggtree (v3.3.0.9)⁸⁵ and further edited in Adobe Illustrator. Genome diagrams were constructed using a manually curated selection of predicted functional domains and visualized using gggenomes (v0.9.8.9)⁸⁶.

For each virus sequence, host information was pulled from the corresponding GenBank ‘host’ field using Rentrez (v1.2.3)⁸⁷ and standardized using Taxize (v0.9.1)⁸⁸. Vector status, defined as ‘yes’, ‘no’ or ‘potentially’, was assigned by first querying the Arbovirus Catalog (https://wwwn.cdc.gov/arbocat/). Where a taxon was identified as an ‘Arbovirus’ by the Arbovirus Catalog it was assigned ‘yes’, otherwise for those listed as ‘potential arboviruses’, ‘probable arboviruses’, or those not present in the catalogue, literature on this taxa was reviewed for evidence of vector association. Three main criteria were considered: (1) whether the virus replicated in both invertebrate and vertebrate cells; (2) the phylogenetic position of the virus—that is, is the virus in the middle of an insect-specific clade?; and (3) consensus among the literature on the possibility of the virus being vectored. The assigned vector status for each taxon and the underlying evidence for this is provided in Supplementary Table 3.

Evaluation of topological concordance

To determine the most robust NS5b phylogeny, alignments (pre- and post- trimming) (n = 225) were examined for the presence of canonical RdRp motifs, misalignments, and overall pairwise identity and length. The resultant tree topology and branch support were examined in FigTree (v1.4.4)⁸⁹. This analysis was combined with comparisons of genome composition and to previous Flaviviridae phylogenies^10,47,75 to identify the most concordant topology across the multiple parameters tested. To supplement this, the R package treespace (v1.1.4.2)⁹⁰ was used to conduct a principal component analysis (PCA) with the goal of identifying clusters of similar trees and assessing whether the selected topology is consistent with the median topology. Accordingly, Kendall–Colijn distance was calculated for each tree and used for the PCA, with two principal components retained⁹¹. To identify discrete clusters of related trees, pairwise distances were mapped into four clusters using hierarchical clustering (Ward’s method)⁹². Manual and distance-based inspection revealed that the alignment method drove variation in tree topology and branch lengths between phylogenies. Specifically, tree topologies and their corresponding phylogenetic distances derived from Clustal Omega were frequently topologically discordant compared to those generated by MAFFT and MUSCLE; as such, these phylogenies were excluded and the PCA was recalculated. Geometric median trees were generated from each cluster and alignment method and used to inform the selection of the final phylogeny. This phylogeny was aligned using MUSCLE with a trimAl consensus and gap threshold value of 5 and 0.9, respectively, and based on the LG + F + R10 amino acid substitution model. We further conducted an extensive stratified MUSCLE alignment analysis to validate the robustness of our NS5b alignment and resulting phylogenies which considered variations in hidden Markov model (HMM) parameters and guide tree merge orders (Supplementary Note 1 and Supplementary Figs. 2–7).

RNase T2 phylogeny

To infer the evolutionary history of the RNase T2 protein, sequences were obtained from the GenBank protein database for conserved domains using the queries “taxid 238513” and “taxid 238220”, literature searches^93,94, these were supplemented with structurally homologous protein clusters identified using the AlphaFold database Foldseek clusters server⁹⁵.

To identify unannotated RNase T2-like sequences in virus genomes, a NCBI web protein BLAST (https://blast.ncbi.nlm.nih.gov/Blast.cgi) was used with RNase T2 sequences used as a query against the clustered non-redundant protein database (as of June 2023)⁹⁶, using the BLOSUM45 matrix and with taxonomy limited to the group ‘Viruses’ (taxid:10239). The HMM search web server (v2.41.2)⁹⁷ was used to identify additional viral T2 RNase-like sequences. An alignment of RNase T2 sequences was used as a query against the Reference Proteomes, UniProtKB, SwissProt and PDB databases (as of June 2023), with results again limited to ‘Viruses’ (taxid:10239). This was further repeated for the PDB, SCOPe, SMART, Pfam, UniProt-SwissProt-Viral, PHROG, COG and KOG databases using the HHpred web server^98,99 as of April 2024 and the Uniclust30 using HHsearch (v3.3.0)^100,101. For all methods, if new virus sequences were detected, they were manually inspected for the presence of RNase T2 motifs and, in turn, used as queries. To estimate the RNase T2 phylogeny, non-viral sequences were clustered at 80% amino acid identity using DIAMOND cluster (v.2.0.9)¹⁰² with default parameters and aligned with the viral sequences using MAFFT and a maximum likelihood phylogeny as described above.

Structure-guided glycoprotein phylogenies

We implemented the approach described previously⁴⁵ to infer glycoprotein phylogenies based on both structural and sequence homology. Owing to the arbitrary fragmentation of protein sequences into the 300-residue blocks, our predicted structures represent overlapping truncated segments of the true glycoproteins. To generate full glycoprotein structures we filtered our Foldseek results by an e-value cut-off of 0.001 and selected the E, E1 or E2 reference structure that had the highest bit score value for any protein block of each query virus. This reference was then used for determining the putative coordinates of each glycoprotein in the viruses’ polyproteins, defined as the start position of the earliest block’s Foldseek hit to the reference and the end position of the latest block’s hit. This process yielded 247 E, 190 E1 and 189 E2 protein sequences, the majority of which appeared to be full length glycoproteins, but with a minority of truncated sequences likely due to low-accuracy structure prediction in the protein foldome. The structure of the full glycoprotein sequences was predicted using ColabFold and ESMFold as described above, but with five ColabFold models produced for each target. The most confident prediction based on their average pLDDT values was chosen for downstream analysis.

We modified the FAMSA alignment program¹⁰³ to use the Foldseek 3Di character substitution matrix as described previously⁴⁵ (https://github.com/nmatzke/3diphy). We then converted our predicted full glycoprotein structures to 3Di sequences with the Foldseek ‘structureto3didescriptor’ option and used the modified FAMSA aligner to infer structural, 3Di sequence alignments of the E, E1 and E2 protein sets. These MSAs are based on the homology between the 3Di characters corresponding to each protein residue and should represent the overall structural homology between these proteins. Consistent with the methodology of Puente-Lelievre et al. we used trimAl⁷⁷ with a gap threshold of 35% to create trimmed versions of the 3Di MSAs. In addition to the 3Di character alignments, we replaced the 3Di characters with the protein amino acid residues in both the complete and trimmed versions of the MSAs. This resulted in a total of four MSAs (3Di, trimmed 3Di, amino acid, trimmed amino acid) for each E, E1 and E2. Modelfinder⁷⁹ implemented in IQ-TREE 2⁷⁸ was used to determine the best substitution model for each alignment. All possible models were tested, including the custom 3Di substitution model (-mset Blosum62,Dayhoff,DCMut,JTT,JTTDCMut,LG,Poisson,Poisson+FQ,Poisson,PMB,WAG,EX2,EX3,EHO,EX_EHO,3DI -mfreq FU,F -mrate E,G,R). Selected substitution models for all alignments are included in the associated Zenodo repository (https://doi.org/10.5281/zenodo.11092288 (ref. ⁶⁶)). Phylogenetic trees based on each MSA were inferred using IQ-TREE 2 (v2.2.2.6)⁷⁸ under each corresponding best-fit substitution model, with node support assessed using 1000 ultrafast bootstrap replicates⁸². Finally, we performed phylogenetic inferences based on both 3Di character and amino acid homology by combining the corresponding pairs of 3Di and amino acid MSAs and performing a partition model IQ-TREE 2 phylogenetic inference¹⁰⁴, in which the two partitions correspond to the 3Di and the amino acid MSAs and each partition uses the best-fit substitution model of its corresponding MSA. The contribution of each partition in the combined MSA phylogenetic inferences was determined based on the partition-wise log-likelihoods, inferred with IQ-TREE’s -wpl option. Manual inspection and analysis of partition contribution was used to select trees for display in Fig. 4 (all resulting phylogenetic trees are provided in Extended Data Figs. 9–11). In brief, the 3Di partition had a consistently larger contribution to the joint phylogenetic inference compared to the amino acid partition, although using both alignments (instead of either alone) generally aids phylogenetic reconstruction⁴⁵. However, in the case of the E glycoprotein phylogeny, the contribution of the amino acid partition was markedly lower than that of the 3Di partition (Supplementary Table 4). Moreover, we found clear evidence of long branch attraction in the amino acid only reconstructions of the E phylogeny (Extended Data Fig. 9), and reasoned that these artefacts may carry over into the combined 3Di–amino acid reconstruction. Therefore, the 3Di-based phylogeny was selected for the E protein whereas 3Di–amino acid trees were used for E1 and E2 (Fig. 4).

As a point of comparison, additional structural phylogenies were generated from our custom full length glycoprotein structures using FoldTree¹⁰⁵. For each structure set (E, E1 and E2 protein sets) phylogenies were inferred using three metrics FoldTree, LDDT and TM-score with default parameters (see Supplementary Fig. 9). However, given the limitations associated with the use of neighbour joining methods on structural distances (outlined in Puente-Lelievre et al.⁴⁵), we reasoned that the 3Di-guided approach, outlined above, is likely to yield more robust results.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

All underlying data, including sequences and structures, are available on Zenodo at https://doi.org/10.5281/zenodo.11092288 (ref. ⁶⁶). The virus sequences assembled from SRA mining in this study are available in the Third-Party Annotation Section of the DDBJ/ENA/GenBank databases under the accession numbers TPA: BK067806–BK067816.

Code availability

Original code and scripts related to the phylogenetic analyses are available at https://doi.org/10.5281/zenodo.11092288.

References

Grove, J. & Marsh, M. The cell biology of receptor-mediated virus entry. J. Cell Biol. 195, 1071–1082 (2011).
Article CAS PubMed PubMed Central Google Scholar
Simmonds, P. et al. ICTV virus taxonomy profile: Flaviviridae. J. Gen. Virol. 98, 2–3 (2017).
Article CAS PubMed PubMed Central Google Scholar
Rey, F. A. & Lok, S.-M. Common features of enveloped viruses and implications for immunogen design for next-generation vaccines. Cell 172, 1319–1334 (2018).
Article CAS PubMed PubMed Central Google Scholar
Hubálek, Z. & Halouzka, J. West Nile fever—a reemerging mosquito-borne viral disease in Europe. Emerg. Infect. Dis. 5, 643–650 (1999).
Article PubMed PubMed Central Google Scholar
Wang, Z.-D. et al. A new segmented virus associated with human febrile illness in China. N. Engl. J. Med. 380, 2116–2125 (2019).
Article CAS PubMed Google Scholar
Kartashov, M. Y. et al. Novel Flavi-like virus in ixodid ticks and patients in Russia. Ticks Tick Borne Dis. 14, 102101 (2023).
Article PubMed Google Scholar
Postler, T. S. et al. Renaming of the genus Flavivirus to Orthoflavivirus and extension of binomial species names within the family Flaviviridae. Arch. Virol 168, 224 (2023).
Article CAS PubMed Google Scholar
Qin, X.-C. et al. A tick-borne segmented RNA virus contains genome segments derived from unsegmented viral ancestors. Proc. Natl Acad. Sci. USA 111, 6744–6749 (2014).
Article CAS PubMed PubMed Central ADS Google Scholar
Ladner, J. T. et al. A multicomponent animal virus isolated from mosquitoes. Cell Host Microbe 20, 357–367 (2016).
Article CAS PubMed PubMed Central Google Scholar
Paraskevopoulou, S. et al. Viromics of extant insect orders unveil the evolution of the flavi-like superfamily. Virus Evol. 7, veab030 (2021).
Article PubMed PubMed Central Google Scholar
Kobayashi, K. et al. Gentian Kobu-sho-associated virus: a tentative, novel double-stranded RNA virus that is relevant to gentian Kobu-sho syndrome. J. Gen. Plant Pathol. 79, 56–63 (2013).
Article CAS Google Scholar
Debat, H. & Bejerman, N. Two novel flavi-like viruses shed light on the plant-infecting koshoviruses. Arch. Virol 168, 184 (2023).
Article CAS PubMed Google Scholar
Petrone, M. E. et al. A ~40-kb flavi-like virus does not encode a known error-correcting mechanism. Proc. Natl Acad. Sci. USA 121, e2403805121 (2024).
Ferron, F., Sama, B., Decroly, E. & Canard, B. The enzymes for genome size increase and maintenance of large (+)RNA viruses. Trends Biochem. Sci 46, 866–877 (2021).
Article CAS PubMed Google Scholar
Shi, M. et al. Divergent viruses discovered in arthropods and vertebrates revise the evolutionary history of the Flaviviridae and related viruses. J. Virol. 90, 659–669 (2016).
Article CAS PubMed Google Scholar
Garry, C. E. & Garry, R. F. Proteomics computational analyses suggest that the envelope glycoproteins of segmented Jingmen Flavi-like viruses are class II viral fusion proteins (b-penetrenes) with mucin-like domains. Viruses 12, 260 (2020).
Article CAS PubMed PubMed Central Google Scholar
Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).
Article CAS PubMed PubMed Central ADS Google Scholar
Lin, Z. et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379, 1123–1130 (2023).
Article MathSciNet CAS PubMed ADS Google Scholar
Mirdita, M. et al. ColabFold: making protein folding accessible to all. Nat. Methods 19, 679–682 (2022).
Article CAS PubMed PubMed Central Google Scholar
Lee, S. et al. Petascale Homology Search for Structure Prediction. Preprint at bioRxiv https://doi.org/10.1101/2023.07.10.548308 (2023).
Blitvich, B. J. & Firth, A. E. A review of Flaviviruses that have no known arthropod vector. Viruses 9, 154 (2017).
Article PubMed PubMed Central Google Scholar
Kielian, M. & Rey, F. A. Virus membrane-fusion proteins: more than one way to make a hairpin. Nat. Rev. Microbiol. 4, 67–76 (2006).
Article CAS PubMed PubMed Central Google Scholar
Rey, F. A., Heinz, F. X., Mandl, C., Kunz, C. & Harrison, S. C. The envelope glycoprotein from tick-borne encephalitis virus at 2 Å resolution. Nature 375, 291–298 (1995).
Article CAS PubMed ADS Google Scholar
Dessau, M. & Modis, Y. Crystal structure of glycoprotein C from Rift Valley fever virus. Proc. Natl Acad. Sci. USA 110, 1696–1701 (2013).
Article CAS PubMed PubMed Central ADS Google Scholar
Fédry, J. et al. The ancient gamete fusogen HAP2 is a eukaryotic class II fusion protein. Cell 168, 904–915.e10 (2017).
Article PubMed PubMed Central Google Scholar
Guardado-Calvo, P. & Rey, F. A. The viral class II membrane fusion machinery: divergent evolution from an ancestral heterodimer. Viruses 13, 2368 (2021).
Article CAS PubMed PubMed Central Google Scholar
Li, L. et al. The flavivirus precursor membrane-envelope protein complex: structure and maturation. Science 319, 1830–1834 (2008).
Article CAS PubMed ADS Google Scholar
El Omari, K., Iourin, O., Harlos, K., Grimes, J. M. & Stuart, D. I. Structure of a Pestivirus envelope glycoprotein E2 clarifies its role in cell entry. Cell Rep. 3, 30–35 (2013).
Article CAS PubMed PubMed Central Google Scholar
Li, Y., Wang, J., Kanai, R. & Modis, Y. Crystal structure of glycoprotein E2 from bovine viral diarrhea virus. Proc. Natl Acad. Sci. USA 110, 6805–6810 (2013).
Article CAS PubMed PubMed Central ADS Google Scholar
Kong, L. et al. Hepatitis C virus E2 envelope glycoprotein core structure. Science 342, 1090–1094 (2013).
Article CAS PubMed PubMed Central ADS Google Scholar
Khan, A. G. et al. Structure of the core ectodomain of the hepatitis C virus envelope glycoprotein 2. Nature 509, 381–384 (2014).
Article CAS PubMed PubMed Central ADS Google Scholar
Aitkenhead, H. et al. Structural comparison of typical and atypical E2 Pestivirus glycoproteins. Structure 32, 273–281 (2024).
Article CAS PubMed Google Scholar
Torrents de la Peña, A. et al. Structure of the hepatitis C virus E1E2 glycoprotein complex. Science 378, 263–269 (2022).
Article PubMed ADS Google Scholar
Metcalf, M. C. et al. Structure of engineered hepatitis C virus E1E2 ectodomain in complex with neutralizing antibodies. Nat. Commun. 14, 3980 (2023).
Article CAS PubMed PubMed Central ADS Google Scholar
van Kempen, M. et al. Fast and accurate protein structure search with Foldseek. Nat. Biotechnol. 42, 243–246 (2024).
Article PubMed Google Scholar
Oliver, M. R. et al. Structures of the hepaci-, pegi-, and pestiviruses envelope proteins suggest a novel membrane fusion mechanism. PLoS Biol. 21, e3002174 (2023).
Article CAS PubMed PubMed Central Google Scholar
Buchfink, B., Reuter, K. & Drost, H.-G. Sensitive protein alignments at tree-of-life scale using DIAMOND. Nat. Methods 18, 366–368 (2021).
Article CAS PubMed PubMed Central Google Scholar
Jones, P. et al. InterProScan 5: genome-scale protein function classification. Bioinformatics 30, 1236–1240 (2014).
Article CAS PubMed PubMed Central Google Scholar
Urayama, S.-I., Takaki, Y. & Nunoura, T. FLDS: a comprehensive dsRNA sequencing method for intracellular RNA virus surveillance. Microbes Environ. 31, 33–40 (2016).
Article PubMed PubMed Central Google Scholar
Hou, X. et al. Artificial intelligence redefines RNA virus discovery. Preprint at bioRxiv https://doi.org/10.1101/2023.04.18.537342 (2023).
Chen, Y.-M. et al. RNA viromes from terrestrial sites across China expand environmental viral diversity. Nat. Microbiol. 7, 1312–1323 (2022).
Article CAS PubMed Google Scholar
Arhab, Y., Bulakhov, A. G., Pestova, T. V. & Hellen, C. U. T. Dissemination of internal ribosomal entry sites (IRES) between viruses by horizontal gene transfer. Viruses 12, 612 (2020).
Article CAS PubMed PubMed Central Google Scholar
Modis, Y., Ogata, S., Clements, D. & Harrison, S. C. Structure of the dengue virus envelope protein after membrane fusion. Nature 427, 313–319 (2004).
Article CAS PubMed ADS Google Scholar
MacIntosh, G. C. in Ribonucleases (ed. Nicholson, A. W.) 89–114 (Springer, 2011).
Puente-Lelievre, C. et al. Tertiary-interaction characters enable fast, model-based structural phylogenetics beyond the twilight zone. Preprint at bioRxiv https://doi.org/10.1101/2023.12.12.571181 (2024).
Vaney, M.-C. et al. Evolution and activation mechanism of the flavivirus class II membrane-fusion machinery. Nat. Commun. 13, 3718 (2022).
Article CAS PubMed PubMed Central ADS Google Scholar
Bamford, C. G. G., de Souza, W. M., Parry, R. & Gifford, R. J. Comparative analysis of genome-encoded viral sequences reveals the evolutionary history of flavivirids (family Flaviviridae). Virus Evol. 8, veac085 (2022).
Article PubMed PubMed Central Google Scholar
Mushegian, A. Methyltransferases of Riboviria. Biomolecules 12, 1247 (2022).
Article CAS PubMed PubMed Central Google Scholar
Li, Z., Jaroszewski, L., Iyer, M., Sedova, M. & Godzik, A. FATCAT 2.0: towards a better understanding of the structural diversity of proteins. Nucleic Acids Res. 48, 60–64 (2020).
Article Google Scholar
Mifsud, J. C. O. et al. Transcriptome mining extends the host range of the Flaviviridae to non-bilaterians. Virus Evol. 9, veac124 (2022).
Article PubMed PubMed Central Google Scholar
Kong, Y. et al. Metatranscriptomics reveals the diversity of the tick virome in northwest China. Microbiol. Spectr. 10, e0111522 (2022).
Article PubMed Google Scholar
Costa, V. A. et al. Limited cross-species virus transmission in a spatially restricted coral reef fish community. Virus Evol. 9, vead011 (2023).
Article PubMed PubMed Central Google Scholar
Perveen, N. et al. Virome diversity of Hyalomma dromedarii ticks collected from camels in the United Arab Emirates. Vet World 16, 439–448 (2023).
Article CAS PubMed PubMed Central Google Scholar
Guo, G. et al. Virome analysis provides an insight into the viral community of Chinese mitten crab Eriocheir sinensis. Microbiol. Spectr. 11, e0143923 (2023).
Article PubMed Google Scholar
Dunay, E. et al. Viruses in sanctuary chimpanzees across Africa. Am. J. Primatol. 85, e23452 (2023).
Article PubMed Google Scholar
Elbadry, M. A. et al. Diversity and genetic reassortment of keystone virus in mosquito populations in Florida. Am. J. Trop. Med. Hyg. 108, 1256–1263 (2023).
Article PubMed PubMed Central Google Scholar
Fu, L., Niu, B., Zhu, Z., Wu, S. & Li, W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 28, 3150–3152 (2012).
Article CAS PubMed PubMed Central Google Scholar
Kearse, M. et al. Geneious Basic: an integrated and extendable desktop software platform for the organization and analysis of sequence data. Bioinformatics 28, 1647–1649 (2012).
Article PubMed PubMed Central Google Scholar
Katoh, K. & Standley, D. M. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol. Biol. Evol. 30, 772–780 (2013).
Article CAS PubMed PubMed Central Google Scholar
Edgar, R. C. et al. Petabase-scale sequence alignment catalyses viral discovery. Nature 602, 142–147 (2022).
Article CAS PubMed ADS Google Scholar
Mifsud, J. C. O. BatchArtemisSRAMiner: v1.0.0. Zenodo https://doi.org/10.5281/zenodo.8417951 (2023).
Bolger, A. M., Lohse, M. & Usadel, B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 30, 2114–2120 (2014).
Article CAS PubMed PubMed Central Google Scholar
Li, D., Liu, C.-M., Luo, R., Sadakane, K. & Lam, T.-W. MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. Bioinformatics 31, 1674–1676 (2015).
Article CAS PubMed Google Scholar
Teufel, F. et al. SignalP 6.0 predicts all five types of signal peptides using protein language models. Nat. Biotechnol. 40, 1023–1025 (2022).
Article CAS PubMed PubMed Central Google Scholar
Edgar, R. C. Muscle5: high-accuracy alignment ensembles enable unbiased assessments of sequence homology and phylogeny. Nat. Commun. 13, 6968 (2022).
Article CAS PubMed PubMed Central ADS Google Scholar
Mifsud, J. C. O. et al. Underlying data for “Mapping glycoprotein structure reveals Flaviviridae evolutionary history”. Zenodo https://doi.org/10.5281/zenodo.11092288 (2024).
Meng, E. C. et al. UCSF ChimeraX: tools for structure building and analysis. Protein Sci. 32, e4792 (2023).
Article CAS PubMed PubMed Central Google Scholar
Letunic, I. & Bork, P. Interactive Tree Of Life (iTOL) v5: an online tool for phylogenetic tree display and annotation. Nucleic Acids Res. 49, 293–296 (2021).
Article Google Scholar
Renner, M. et al. Flavivirus maturation leads to the formation of an occupied lipid pocket in the surface glycoproteins. Nat. Commun. 12, 1238 (2021).
Article CAS PubMed PubMed Central ADS Google Scholar
Egloff, M.-P., Benarroch, D., Selisko, B., Romette, J.-L., & Canard, B. An RNA cap (nucleoside-2′-O-)-methyltransferase in the flavivirus RNA polymerase NS5: crystal structure and functional characterization. EMBO J. 21, 2757–2768 (2002).
Article CAS PubMed PubMed Central Google Scholar
Noble, C. G. et al. A conserved pocket in the dengue virus polymerase identified through fragment-based screening. J. Biol. Chem. 291, 8541–8548 (2016).
Article CAS PubMed PubMed Central Google Scholar
Jia, H., Zhong, Y., Peng, C. & Gong, P. Crystal structures of flavivirus NS5 guanylyltransferase reveal a GMP-arginine adduct. J. Virol. 96, e0041822 (2022).
Article PubMed Google Scholar
Walls, A. C. et al. Structure, function, and antigenicity of the SARS-CoV-2 spike glycoprotein. Cell 181, 281–292 (2020).
Article CAS PubMed PubMed Central Google Scholar
Krey, T. et al. Crystal structure of the Pestivirus envelope glycoprotein E(rns) and mechanistic analysis of its ribonuclease activity. Structure 20, 862–873 (2012).
Article CAS PubMed Google Scholar
Dong, X. et al. A novel virus of Flaviviridae associated with sexual precocity in Macrobrachium rosenbergii. mSystems 6, e0000321 (2021).
Article PubMed Google Scholar
Sievers, F. et al. Fast, scalable generation of high‐quality protein multiple sequence alignments using Clustal Omega. Mol. Syst. Biol. 7, 539 (2011).
Article PubMed PubMed Central Google Scholar
Capella-Gutiérrez, S., Silla-Martínez, J. M. & Gabaldón, T. trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics 25, 1972–1973 (2009).
Article PubMed PubMed Central Google Scholar
Minh, B. Q. et al. IQ-TREE 2: new models and efficient methods for phylogenetic inference in the genomic era. Mol. Biol. Evol. 37, 1530–1534 (2020).
Article CAS PubMed PubMed Central Google Scholar
Kalyaanamoorthy, S., Minh, B. Q., Wong, T. K. F., Von Haeseler, A. & Jermiin, L. S. ModelFinder: fast model selection for accurate phylogenetic estimates. Nat. Methods 14, 587–589 (2017).
Article CAS PubMed PubMed Central Google Scholar
Le, T. K. & Vinh, L. S. FLAVI: an amino acid substitution model for flaviviruses. J. Mol. Evol. 88, 445–452 (2020).
Article CAS PubMed ADS Google Scholar
Guindon, S. et al. New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. Syst. Biol. 59, 307–321 (2010).
Article CAS PubMed Google Scholar
Hoang, D. T., Chernomor, O., von Haeseler, A., Minh, B. Q. & Vinh, L. S. UFBoot2: improving the ultrafast bootstrap approximation. Mol. Biol. Evol. 35, 518–522 (2017).
Article PubMed Central Google Scholar
Paradis, E. & Schliep, K. ape 5.0: an environment for modern phylogenetics and evolutionary analyses in R. Bioinformatics 35, 526–528 (2018).
Article Google Scholar
Revell, L. J. phytools 2.0: An updated R ecosystem for phylogenetic comparative methods (and other things). PeerJ 12, e16505 (2024).
Article PubMed PubMed Central Google Scholar
Yu, G., Smith, D. K., Zhu, H., Guan, Y. & Lam, T. T. ggtree: an R package for visualization and annotation of phylogenetic trees with their covariates and other associated data. Methods Ecol. Evol. 8, 28–36 (2017).
Article Google Scholar
Hackl, T., Ankenbrand, M. & van Adrichem, B. gggenomes: A grammar of graphics for comparative genomics. Github https://github.com/thackl/gggenomes (2024).
Winter, D. J. Rentrez: an R package for the NCBI eUtils API. R J. 9, 520–526 (2017).
Article Google Scholar
Chamberlain, S. A. & Szöcs, E. taxize: taxonomic search and retrieval in R. F1000Res. 2, 191 (2013).
Article PubMed PubMed Central Google Scholar
Rambaut, A. & Drummond, A. J. FigTree: Tree figure drawing tool, version 1.4.0. http://tree.bio.ed.ac.uk/software/figtree/ (2012).
Jombart, T., Kendall, M., Almagro‐Garcia, J. & Colijn, C. treespace: Statistical exploration of landscapes of phylogenetic trees. Mol. Ecol. Resour. 17, 1385–1392 (2017).
Article PubMed PubMed Central Google Scholar
Kendall, M. & Colijn, C. Mapping phylogenetic trees to reveal distinct patterns of evolution. Mol. Biol. Evol. 33, 2735–2743 (2016).
Article CAS PubMed PubMed Central Google Scholar
Legendre, P. & Legendre, L. Numerical Ecology (Elsevier, 2012).
Saberi, A., Gulyaeva, A. A., Brubacher, J. L., Newmark, P. A. & Gorbalenya, A. E. A planarian nidovirus expands the limits of RNA genome size. PLoS Pathog. 14, e1007314 (2018).
Article PubMed PubMed Central Google Scholar
Rolland, C., La Scola, B. & Levasseur, A. How Tupanvirus degrades the ribosomal RNA of its amoebal host? The ribonuclease T2 track. Front. Microbiol. 11, 1691 (2020).
Article PubMed PubMed Central Google Scholar
Barrio-Hernandez, I. et al. Clustering predicted structures at the scale of the known protein universe. Nature 622, 637–645 (2023).
Article CAS PubMed PubMed Central ADS Google Scholar
Steinegger, M. & Söding, J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol. 35, 1026–1028 (2017).
Article CAS PubMed Google Scholar
Potter, S. C. et al. HMMER web server: 2018 update. Nucleic Acids Res. 46, 200–204 (2018).
Article Google Scholar
Gabler, F. et al. Protein sequence analysis using the MPI bioinformatics toolkit. Curr. Protoc. Bioinformatics 72, e108 (2020).
Article CAS PubMed Google Scholar
Zimmermann, L. et al. A completely reimplemented MPI bioinformatics toolkit with a new HHpred server at its core. J. Mol. Biol. 430, 2237–2243 (2018).
Article CAS PubMed Google Scholar
Mirdita, M. et al. Uniclust databases of clustered and deeply annotated protein sequences and alignments. Nucleic Acids Res. 45, 170–176 (2017).
Article Google Scholar
Steinegger, M. et al. HH-suite3 for fast remote homology detection and deep protein annotation. BMC Bioinformatics 20, 473 (2019).
Article PubMed PubMed Central Google Scholar
Buchfink, B., Ashkenazy, H., Reuter, K., Kennedy, J. A. & Drost, H.-G. Sensitive clustering of protein sequences at tree-of-life scale using DIAMOND DeepClust. Preprint at bioRxiv https://doi.org/10.1101/2023.01.24.525373 (2023).
Deorowicz, S., Debudaj-Grabysz, A. & Gudyś, A. FAMSA: fast and accurate multiple sequence alignment of huge protein families. Sci. Rep. 6, 33964 (2016).
Article CAS PubMed PubMed Central ADS Google Scholar
Chernomor, O., von Haeseler, A. & Minh, B. Q. Terrace aware data structure for phylogenomic inference from supermatrices. Syst. Biol. 65, 997–1008 (2016).
Article PubMed PubMed Central Google Scholar
Moi, D. et al. Structural phylogenetics unravels the evolutionary diversification of communication systems in Gram-positive bacteria and their viruses. Preprint at bioRxiv https://doi.org/10.1101/2023.09.19.558401 (2023).

Download references

Acknowledgements

The authors acknowledge the University of Sydney’s high-performance computing cluster, Artemis, for providing the computing resources used for this study. E.C.H. is supported by a National Health and Medical Research Council Investigator award (GNT2017197). J.C.O.M. is supported by the Australian Government’s Research Training Program Scholarship. J.G., S.L. and M.R.O. are supported by the Wellcome Trust and Royal Society, through a Sir Henry Dale Fellowship (107653/Z/15/Z). K.T. was supported by a Lord Kelvin Adam Smith Fellowship from the University of Glasgow. J.G. is also supported by the MRC-University of Glasgow Centre for Virus Research core support from the Medical Research Council/UKRI (MC_UU_00034/1) and by a Medical Research Foundation Emerging Leaders Prize (MRF-ELP-VAH-23-107).

Author information

Authors and Affiliations

Sydney Institute for Infectious Diseases, School of Medical Sciences, The University of Sydney, Sydney, New South Wales, Australia
Jonathon C. O. Mifsud, Vincenzo A. Costa & Edward C. Holmes
MRC–University of Glasgow Centre for Virus Research, Glasgow, UK
Spyros Lytras, Michael R. Oliver, Kamilla Toon & Joe Grove
Division of Systems Virology, Department of Microbiology and Immunology, The Institute of Medical Science, The University of Tokyo, Tokyo, Japan
Spyros Lytras
Laboratory of Data Discovery for Health Limited, Hong Kong SAR, China
Edward C. Holmes

Authors

Jonathon C. O. Mifsud
View author publications
Search author on:PubMed Google Scholar
Spyros Lytras
View author publications
Search author on:PubMed Google Scholar
Michael R. Oliver
View author publications
Search author on:PubMed Google Scholar
Kamilla Toon
View author publications
Search author on:PubMed Google Scholar
Vincenzo A. Costa
View author publications
Search author on:PubMed Google Scholar
Edward C. Holmes
View author publications
Search author on:PubMed Google Scholar
Joe Grove
View author publications
Search author on:PubMed Google Scholar

Contributions

J.C.O.M., E.C.H. and J.G. conceptualized the study. J.C.O.M., V.A.C. and J.G. compiled and curated the sequence data. J.G., M.R.O. and K.T. conducted the protein structure prediction and associated analysis. J.C.O.M. and J.G. conducted the structural homology searches and sequence-based benchmarking analysis. J.C.O.M. and S.L. conducted the phylogenetic analyses. E.C.H. and J.G. supervised and funded the study. J.C.O.M. and J.G. wrote the manuscript with review and editing assistance from all authors

Corresponding author

Correspondence to Joe Grove.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature thanks the anonymous reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data figures and tables

Extended Data Fig. 1 Two-dimensional MDS plot of the NS5b phylogeny variations.

a, Coloured by alignment method b, Excluding trees generated from Clustal Omega alignments. Points, which represent individual phylogenies, are colour-coded based on the clusters identified using the ‘findGroves’ function. Point shapes indicate the alignment software, the trimAl gap and consensus thresholds, and the substitution model applied. An arrow signifies the master phylogeny (Tree 18) chosen for further analysis.

Extended Data Fig. 2 Structure prediction performance.

a, For structural inference, all Flaviviridae polyprotein sequences were split into blocks of 300 residues, each overlapping by 100 residues (461 species, 16,463 blocks in total). Residue numbers are provided for the first three blocks. b, Representative ColabFold protein structure predictions spanning the entire Dengue Virus 2 (DENV-2) polyprotein. Residue numbers are provided as in a. Structures are colour-coded by prediction confidence scores (pLDDT), as denoted in the key. c, pLDDT confidence scores along the length of the DENV-2 polyprotein. Dotted lines delineate the mature proteins, which are labelled on the x-axis. d, Scatter plots representing predicted TM-score (pTM) confidence metric for ColabFold and ESMFold for each sequence block in each genus/subclade. Numerical values provide the performance ratio between either protein structure prediction method; values below 1 indicates better performance by ColabFold.

Extended Data Fig. 3 Benchmarking with sequence-based homology searches.

a, RdRp phylogeny, as in Fig. 2a. b, Heatmap comparison of homology detection using Foldseek (as in Fig. 2), DIAMOND and InterProScan for the stated reference proteins. DIAMOND results represent a pure sequence-based recapitulation of the Foldseek search (i.e., all query and reference structures used in Foldseek were represented by their cognate protein sequences for DIAMOND analysis). Foldseek and DIAMOND data are log e-values and are colour-coded as in the key. InterProScan results provide a simple binary score: match (red) or no match (white). Vertical lines demark divisions between major clades.

Extended Data Fig. 4 ESMFold permits unambiguous detection of glycoproteins in divergent species.

a, RdRp phylogeny, as in Fig. 2a. b, E glycoprotein Foldseek e-value heatmaps for Flaviviridae structures predicted only by ColabFold (top), only by ESMFold (middle) or when combined (bottom). c, Representative examples of targets that are predicted well by ESMFold, but poorly by ColabFold. Structures are colour-coded by pLDDT confidence scores, as shown in the key. pLDDT and pTM metrics are provided for each model.

Extended Data Fig. 5 Analysis of environmental pesti-like viruses.

a, Large genome flavivirus/Pestivirus subset of the Flaviviridae phylogeny (Tree 18) collapsed to highlight the environmental pesti-like viruses for which no glycoproteins were identified. The scale bar denotes the number of amino acid substitutions per site. b, Genome organisation is provided for each species, with annotations based on conserved domain sequence searches. c, Foldseek e-value heatmaps for the indicated reference proteins, values are log transformed and colour-coded as shown in the key. For E, E1 and E2 the values represent summary e-values after comparison with a range of relevant reference structures, as described in the methods.

Extended Data Fig. 6 Absence of a fusion loop in E glycoprotein homologues of the jingmenviruses.

Structurally conserved E protein fusion loops (FL) were found in orthoflavi-, LGF-, and in pesti-like viruses. The FL is absent from the E protein homologue of the jingmenviruses (its expected location is marked by an asterisk). Amino acid side chains are shown for the FL only.

Extended Data Fig. 7 Foldseek detection of methyltransferase in diverse viruses.

DENV-2 reference structure and Foldseek hits for MTase. For each hit, only the Foldseek-aligned residues are shown, metrics provide e-value, sequence identity (%), structural alignment score (LDDT, ranging from 0 to 1), and protein structure prediction method. Predicted structures are colour-coded by pLDDT confidence scores, as shown in the key.

Extended Data Fig. 8 Conservation of structure revealed through the Foldseek 3Di alphabet.

Representative structures of E (West Nile virus) or E1 and E2 (Hepacivirus F) colour-coded by either sequence or structural conservation, as denoted in the key. In both cases values represent percentage conservation of the consensus for each structurally aligned protein (see Methods for details), with amino acid residues representing protein sequence and the 3Di structural alphabet representing protein structure.

Extended Data Fig. 9 Structurally aligned E glycoprotein phylogenies.

Protein sequences were aligned using their 3Di structural representation (see Methods for details). Phylogenies were reconstructed using 3Di sequence alone (top), amino acid (AA) sequence alone (middle) or combined 3Di and AA sequence. Right hand trees are derived from alignments trimmed with a gap threshold of 35%. Scale bars indicate substitutions per site for either the 3Di, AA or combined sequences, respectively. Tip shapes are colour-coded by genus/subclade as in Fig. 1a. All phylogenetic trees are provided in the associated Zenodo repository.

Extended Data Fig. 10 Structurally aligned E1 glycoprotein phylogenies.

Protein sequences were aligned using their 3Di structural representation (see Methods for details). Phylogenies were reconstructed using 3Di sequence alone (top), amino acid (AA) sequence alone (middle) or combined 3Di and AA sequence. Right hand trees are derived from alignments trimmed with a gap threshold of 35%. Scale bars indicate substitutions per site for either the 3Di, AA or combined sequences, respectively. Tip shapes are colour-coded by genus/subclade as in Fig. 1a. All phylogenetic trees are provided in the associated Zenodo repository.

Extended Data Fig. 11 Structurally aligned E2 glycoprotein phylogenies.

Protein sequences were aligned using their 3Di structural representation (see Methods for details). Phylogenies were reconstructed using 3Di sequence alone (top), amino acid (AA) sequence alone (middle) or combined 3Di and AA sequence. Right hand trees are derived from alignments trimmed with a gap threshold of 35%. Scale bars indicate substitutions per site for either the 3Di, AA or combined sequences, respectively. Tip shapes are colour-coded by genus/subclade as in Fig. 1a. All phylogenetic trees are provided in the associated Zenodo repository.

Supplementary information

Supplementary Information

This file contains Supplementary Note 1, Supplementary Figs. 1–9 and references, which detail the MUSCLE alignment analysis, FoldTree structural analysis and the complete NS5b and T2 ribonuclease RNA virus phylogenies.

Reporting Summary

Supplementary Table 1

Flaviviridae sequence metadata including clade designations and GenBank nucleotide accession numbers.

Supplementary Table 2

Combination of sequence alignment, quality trimming methods, and amino acid substitution models used to infer the NS5b phylogenies.

Supplementary Table 3

Flaviviridae host association and vector status metadata related to Fig. 2c.

Supplementary Table 4

Contribution of 3Di vs amino acid in the glycoprotein phylogeny partition models.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Mifsud, J.C.O., Lytras, S., Oliver, M.R. et al. Mapping glycoprotein structure reveals Flaviviridae evolutionary history. Nature 633, 695–703 (2024). https://doi.org/10.1038/s41586-024-07899-8

Download citation

Received: 10 February 2024
Accepted: 01 August 2024
Published: 04 September 2024
Version of record: 04 September 2024
Issue date: 19 September 2024
DOI: https://doi.org/10.1038/s41586-024-07899-8

This article is cited by

Structural genomics sheds light on protein functions and remote homologs across the insect tree of life
- Weiyin Wu
- Chunlai Cui
- Xing-Xing Shen
Cell Research (2026)
Measuring and locating the changes in protein structure using MELO
- Lingyan Zheng
- Yang Liao
- Feng Zhu
Nature Communications (2026)
Molecular characterization and geographic incidence of two pestiviruses infecting the corn leafhopper (Dalbulus maidis) in the United States
- Juliana Osse de Souza
- Hia Kalita
- Alejandro Olmedo-Velarde
Archives of Virology (2026)
Taxonomic expansion and reorganization of Flaviviridae
- Peter Simmonds
- Anamarija Butković
- Jens H. Kuhn
Nature Microbiology (2025)
Strange relatives: the enigmatic arbo-jingmenviruses and orthoflaviviruses
- Edwin O. Ogola
- Amitava Roy
- Marshall E. Bloom
npj Viruses (2025)