Introduction

Noroviruses are the leading cause of foodborne illness in almost a fifth of all acute gastroenteritis (AGE) cases worldwide1. Norovirus-associated AGE, characterized by vomiting and dehydrating diarrhea, is highly transmissible and young children and the elderly are especially susceptible2. The first major outbreak of human norovirus occurred in 1968 among schoolchildren in Norwalk, Ohio, USA and the causal agent was identified using immune electron microscopy (IEM) to visualize the virus particle in 19723. In the late 1980s, researchers established the classification of Norwalk virus as a member of the family Caliciviridae on the basis of their genome organization4.

Noroviruses are non-enveloped positive-sense ssRNA viruses with approximately 7.5 kb genomes5. With the exception of the murine norovirus, the genomic structure of noroviruses consists of three open reading frames (ORFs). Of these, ORF1 is translated to a large polyprotein including RNA-dependent RNA polymerase (RdRP) and ORF2 and ORF3 encode the major capsid protein (VP1) and the minor capsid protein (VP2), respectively5,6. Since the 1990s, scientists have conducted more detailed studies on the genes and proteins of noroviruses. In mid-1990, numerous studies were published documenting various attempts to classify noroviruses through various methods—such as IEM, reverse transcription-PCR, and Southern hybridization—based on partial RdRP sequences or complete VP1 amino acid (aa) sequences7,8. In the early stages, researchers classified them using IEM into a minimum of 4 or 6 antigenic types, but these antigenic classification schemes exhibited poor accuracy and reproducibility attributed to the cross-reactivity of antibodies9,10. In research during the 2000s, noroviruses were classified into five genogroups and about 30 genetic clusters based on the VP1 protein sequences11,12,13,14. Researchers examined the pairwise distances of strains, clusters, and genotypes using the conserved regions and domains of VP1. However, they observed that the ranges of the three categories overlapped, suggesting that distinguishing norovirus strains based on partial sequences alone may be challenging, leading to inconsistent and confused classification outcomes14.

The conventional genetic classification of noroviruses is based on the aa sequences of the complete VP1 (genotype) or the nucleotide (nt) sequences of the ORF1 RdRP region (P-type)15. Thus, a dual nomenclature system (genotype + P-type) was introduced for the accurate identification of norovirus strains and is now routinely used in many laboratories worldwide15,16. In 2019, the classification scheme for noroviruses was updated by proposing new genogroups and subtypes based on the 2× standard deviation criteria. In this scheme, noroviruses were divided into ten (GI–GX) genogroups, five of which (GI, GII, GIV, GVIII, and GIX) have the ability to infect humans17. GI and GII are generally detected in humans, with GII notably accounting for over 85% of norovirus infections18,19. Previous studies have indicated that a majority of norovirus strains causing human infection are GII recombinants, particularly those of the GII.4 variants20.

Recent studies on the genetic characteristics of noroviruses provide evidence supporting the necessity for additional considerations in their phylogenetic classification. Firstly, gene trees cannot fully represent the evolutionary histories due to their incongruence with species trees, especially in the presence of recombination21,22. Recombination of noroviruses has been observed at the ORF2/3 overlap, within ORF2, and at the ORF1/ORF2 junction23,24,25,26. The current dual-type system, which relies solely on the partial RdRP and complete VP1 sequences, cannot account for all of recombination events. Also, single-gene analyses often lack sufficient resolution and can sometimes produce conflicting results27,28. Recently, some studies have argued that using multiple genes (or genomic sequences) to reconstruct phylogenies is more important for improved phylogenetic accuracy29,30,31. Furthermore, VP1 exhibits a high degree of genetic diversity, suggesting its inadequacy as a proper molecular marker. Nevertheless, certain strains previously classified within the GII genogroup were reclassified as GIX and GVIII based on the VP1 classification, despite their high genomic similarity to GII17,32,33,34,35. Moreover, GIV strains that infect cats, lions and dogs cluster with GVI strains based on RdRP sequences. Additionally, the similarity in their VP1 protein structure suggests that the genomes of animal-infecting GIV and GVI strains exhibit high similarity, regardless of genogroup17,36.

Along with the aforementioned obstacles, another challenge in norovirus research is the extremely low levels of norovirus concentrations in environmental or stool samples37. Hence, it is essential to precisely detect the norovirus types within samples using minimal analytical methods. Complicating matters further, there are numerous cases of co-infection with more than two types and the recombination in the environmental sources (e.g., oysters)38,39,40,41. Accurately identifying their dual-types with the existing system is challenging, emphasizing the need for complementary genomic databases as well as RdRP and VP1 sequences.

In this study, we evaluate the genetic diversity of norovirus genomes to clarify their genomic characteristics and the criteria of existing classification. Thereafter, we reconstruct phylogenomic trees to compare the evolutionary relationships derived from gene-based and genome-based analyses.

Materials & methods

Data mining & identification

A total of 1417 norovirus genome sequences were downloaded from the National Center for Biotechnology Information (NCBI) database. Ten genogroups, including 50 genotypes and 71 P-types, were represented in the dataset (Table S1). From these, we extracted the nucleotide and peptide sequences of the ORFs using the Entrez retrieval system based on the accession numbers. The dual types from NCBI were updated through the Norovirus Typing Tool (ver. 2.0) (https://www.rivm.nl/mpf/typ-ingtool/norovirus) and phylogenetic analyses based on the RdRP and VP1 sequences.

Similarity plot

The ORF protein sequences of genogroups GI and GII were respectively aligned using MAFFT with the L-INS-I algorithm (ver. 7.505). The results were concatenated in the order of the ORFs for each genogroup. The percent similarities of sequences in the concatenated alignments were calculated using a Python script based on the sum-of-pairs scoring function with a sliding window of 5 aa and a step size of 1 aa42 and similarity plots were visualized in R (ver. 4.2.0) using the ggplot2 package.

Evolutionary selection pressure

To examine positive selection acting on the norovirus GI and GII genogroups, we subsampled 100 sequences from established databases and used site models in codeml as implemented in the PAML software package (ver. 4.10.5)43. We carried out likelihood ratio tests (LRTs) comparing a null model and an alternative model: M0 (one ratio) vs. M3 (discrete), M1a (nearly neutral) vs. M2a (positive selection), and M7 (beta) vs. M8 (beta&ω). Positively selected amino acid sites were identified based on Bayes empirical Bayes posterior probabilities. All PAML analyses were carried out using the F3 × 4 codon frequency model. The level of significance (P) for the LRTs was estimated using a χ2 distribution with the corresponding degrees of freedom. The test statistic is calculated as twice the difference of the log-likelihood between the models (2∆lnL = 2[lnL1lnL0] where L1 and L0 are the likelihoods of the alternative and null models, respectively).

Global pairwise alignment

To assess the distance between genogroups, we used the LAGAN (Limited Area Global Alignment of Nucleotides) tool, which is an efficient and reliable pairwise aligner that is suitable for genomic comparisons of distantly related organisms44. Global pairwise alignments produced by LAGAN were visualized with mVISTA. We compared the ten genogroups using GenBank genome sequences—accession numbers MT031988, JQ622197, JX145650, KC894731, KC792553, MW662289, OL757872, AB985418, MN473468, and KJ790198—as references for GI–GX, respectively.

To get the distance matrices of genotypes and P-types, we aligned the RdRP and VP1 nucleotide sequences of the various types using the G-INS-I algorithm in MAFFT (ver. 7.505). We generated distance matrices based on the alignments, including gaps, on the UGENE platform (ver. 48.1).

Phylogenetic analyses

We constructed phylogenetic trees of noroviruses using two methods: alignment-based and alignment-free. For the alignment-based tree, a multiple sequence alignment of the 1417 downloaded sequences and our 10 assembled results was generated using MAFFT with the L-INS-I algorithm (ver. 7.505). To determine the best-fit substitution models, the ModelFinder in IQ-TREE (ver. 1.6.12) was used. The phylogenetic trees were reconstructed by the maximum likelihood (ML) and Bayesian inference methods. The ML method was performed using RAxML-NG (ver. 1.1.0) with the GTR + F + I + G4 nucleotide substitution model and Bayesian phylogenetic inference was performed using the MrBayes package (ver. 3.2.7a) with the same model. The Markov chain Monte Carlo search was run for 106 generations with a sampling frequency of 5 × 102 using three heated and one cold chain. A method for the alignment-free tree is described in Supplementary Materials.

Results

Genomic diversity

In this study, a comprehensive analysis was conducted on 1427 norovirus genomes from the NCBI database and human stool samples. All genogroups were represented in these genomes, although some genotypes, such as GIII.3, GIV.NA1, GNA1.1, and GNA2.1, were not included due to the absence of their genome data (Table S1). For a more accurate analysis, we verified or revised the dual type of certain strains.

We assessed the genomic diversity of the GI and GII, which are the genogroups most commonly infecting humans. Considering the degeneracy in the third base of codons, we used the protein sequences of the three ORFs. Consistent with previous research findings45,46, our genome database confirmed that the RdRP region (located at the 3’ end of ORF1) is the most conserved region, while VP1 and VP2 exhibit greater variability (Figure 1). In ORF1, the N-terminal region displayed variability within the GI and GII genogroups and sequence conservation increased toward the C-terminus of the polyprotein. Amino acid positions 700 to 900, corresponding to the p22 protein, showed significantly lower similarity. Previous analyses of p22, one of the most variable genomic regions, revealed that it plays a role in Golgi disassembly and the antagonism of Golgi-dependent cellular protein secretion, which were observed during norovirus replication47,48. We thus concluded that the conservation of the ORF1 polyprotein is not limited to the RdRP but extends across the majority of the sequence.

Figure 1
figure 1

Similarity plots of norovirus GI genogroup and GII genogroup genomes. The plots are based on the concatenated sequences from 94 and 1161 complete genomes in GI and GII, respectively. Analyses were performed using the sum of pairs scoring with a sliding window of 5 amino acids (aa) and a step size of 1 aa. The plot depicts the percent similarity (Y-axis) of aa positions (X-axis). (A) Similarity plots of GI genomes. (B) Similarity plots of GII genomes. A schematic representation of the human norovirus open reading frames (ORFs) and the encoded proteins are shown above the graphs.

Despite the 5’ end of ORF1 showing a similarity trendline of less than 50%, the initial five amino acids remained highly conserved (Figure S1). The sequence logo analysis showed that both ends of ORF1 have conserved nucleotide sequences in all genogroups. Upon translating the conserved nucleotides from the 5’ and 3’ ends of ORF1 into protein sequences, we observed an intriguing pattern. Most genogroups associated with strains infecting humans show identical deduced protein sequences at both ends. Since the sequence logos of the GVII, GVIII, and GX genogroups were constructed with one or two sequences due to their limited availability in the current genomic database, further research will be needed.

Selection pressure

To conduct a phylogenetic analysis, it is essential to identify genomic regions that contain sufficient phylogenetic signals. Thus, we measured the selective pressure for the three ORFs of both genogroup GI and GII. We carried out likelihood ratio tests (LRTs) comparing null and alternative codon substitution models. Across all ORFs, M3 was selected over M0 in the first comparison, indicating that the GI and GII genogroups have variable ω values among sites (Table 1). Following that, the null hypothesis M1a was consistently chosen over M2a, and the test was concluded. Consequently, no predicted positive selection sites were identified, but we confirmed that both GI and GII exhibit their lowest ω (dN/dS) ratios in ORF1. This suggests lower selection pressure on ORF1, signifying its phylogenetic significance compared to other ORFs. The capsid proteins of most viruses undergo rapid evolution to evade host immune detection, reach different host organs, and trigger pathological effects, ultimately promoting efficient transmission to new hosts. Our results also demonstrate that capsid proteins, encoded by ORF2 and ORF3, experience a high degree of selection pressure. Even though the major capsid protein, VP1 interacts directly with the entry receptors and antibodies of its host, VP2 showed a higher ω ratio than VP1. Although higher evolutionary rates in VP2 have been previously documented, the functional drivers behind the observed variability remain unclear49,50. When comparing GI and GII, each ORF of GII exhibited a higher selection pressure value than its counterpart in GI.

Table 1 Selection pressures (dN/dS) and statistical test values for the three ORFs in genogroups GI and GII.

Pairwise distances of norovirus types

We examined the sequence similarity at the whole genome level to figure out the probable genetic relationships within norovirus genogroups. A global pairwise alignment was performed based on genomic sequences of all ten genogroups. The alignments of GII with GVIII, GII with GIX, GVIII with GIX, and GIV with GVI revealed high degrees of similarity across their genomes, particularly in ORF1 and the ORF1/ORF2 junction, when compared to the other comparisons (Figures 2 and S2). To further characterize genome similarities, we counted the base pairs in conserved regions between genogroups (Figure 2). The GI and GII genogroups, which predominantly infect humans, shared the least conserved regions among the comparisons. The GV genogroup, the murine norovirus, distinctively possesses ORF4, which encodes virulence factor 1 (VF1), a mitochondria-localized protein that acts as an innate immune antagonist and contributes to viral adaptation during ongoing murine norovirus infection51,52. In the figures, GV generally had low similarity with all other genogroups. Most notably, while the whole genome size is about 7.5 kb, genogroups GII, GVIII, and GIX shared conserved regions exceeding half of the genome size by a significant margin, as did groups GIV and GVI.

Figure 2
figure 2

Alignment plots and total base pairs of representative genome sequences of the ten norovirus genogroups. In the plots, regions with over 70% identity in a 150 bp sliding window are marked in blue. The analysis used GenBank genome sequences—accession numbers MT031988, JQ622197, JX145650, KC894731, KC792553, MW662289, OL757872, AB985418, MN473468, and KJ790198—as references for genogroups GI-GX, respectively.

To clarify the sorting criteria among subtypes, including P-types and genotypes, we measured the pairwise distances of RdRP and VP1 nucleotide sequences of all types present in our dataset (Figure 3 and Tables S2 and S3). All sequences used in the subtype analysis were complete except for the GII.P38 RdRP sequence. In the P-type distance matrix (Figure 3A and Table S2), the minimum and maximum identity values were 55% and 95%, respectively. Intra-genogroup identities were 71–91% in GI, 71–92% in GII, 95% in GIII, 79% in GIV, 67% in GV, and 80% in GVI. The results indicated that the inter-genogroup identity range for P-types is 55–70%, and intra-genogroup identity exceeds 70%. Notably, intra-genogroup identity within GV, between GV.P1 and GV.P2, is relatively low. In the genotype distance matrix (Figure 3B and Table S3), where the percent identity ranges from a minimum of 47% to a maximum of 87%, the values were largely lower than those for the P-types. Intra-genogroup identities of VP1 were 67–75% in GI, 65–87% in GII, 73% in GIII, 68–74% in GIV, 68% in GV, and 65% in GVI. It could be inferred that the inter-genogroup identity is less than 65%. Ironically, GIX.1 showed 65% identity with some GII genotypes, equivalent to the intra-genogroup identity of GII and GVI, while it also had values greater than 62% with all GII types. Among the alignments, identity scores of 80% or higher were only evident in genotypes GII.22–GII.27, GII.NA1, and GII.NA2, which were identified recently.

Figure 3
figure 3figure 3

Pairwise distance matrices of the P-types and genotypes of strains from the ten genogroups. Vertical and horizontal lines separate the types into ten genogroups. Percent sequence identity is indicated by the color-coded boxes. (A) The pairwise alignments between the RdRP sequences of 71 P-types are plotted. Only the GII.P38 sequence is partial. (B) The alignments from the VP1 sequences of 50 genotypes are represented.

Phylogenomic analysis

Since the dN/dS ratio of ORF1 implied their phylogenetic significance, we reconstructed two norovirus phylogenies based respectively on this region and genomic sequences using the ML method. The phylogenomic analysis, including the downloaded dataset and assembled genomes, was performed based on the complete or partial genome nucleotide sequences. This tree’s topology was identical to that of the ORF1-based tree, indicating that the phylogenetic relationships of most genogroups were well-supported by the genomic sequences (Figures S3A and S4A). Consistent with the pairwise distances, the trees showed that the GV genogroup had distant phylogenetic relationships with all other genogroups and that there was a notable genetic distance between groups GI and GII.

However, in this genomic based tree, GVIII and GIX—two genogroups (formerly GII) that had been reclassified through a highly variable VP1-based analysis17—were found to be part of the same clade as GII (Figures 4 and 5B). This result, along with the pairwise distance analysis, strongly indicates a high degree of genetic similarity among the genomes of the GII, GVIII, and GIX genogroups, as well as an ability to effectively distinguish between GII.4 variants and their recombinants (Figure 5A). Moreover, GII dual types with swine as hosts were conclusively categorized alongside strains that infect humans (Figure 5B and Table 2). We also reconstructed a tree solely for the GII clustering, which is the predominant genogroup associated with human diseases (Figures S3B and S4B). In the tree, strains can be divided into three major groups, named GII.A, GII.B, and GII.C. The GII.A clade, encompassing strains with P-types P4, P12, P16, P21, and P31, included prominent variants like GII.4 and GII.17, which collectively account for a significant proportion of infections. Strains with P-types P6, P7, and P8 were classified within the GII.B clade, while types recently reported to be in GII.C clustered together. Variant GVIII.1 [GII.P28] was affiliated with GII.A, and variant GIX.1 [GII.P15] was grouped within GII.B.

Figure 4
figure 4

Midpoint-rooted phylogenomic trees of norovirus ten genogroups. The phylogenetic trees were reconstructed by the maximum likelihood and Bayesian inference methods based on the 1417 downloaded sequences and our 10 assembled results. Branches were collapsed by genogroup. Bootstrap value and Bayesian Inference posterior are depicted on branches, and dash (-) indicates with PPBI < 50% or ML < 60%.

Figure 5
figure 5figure 5figure 5

Midpoint-rooted phylogenomic trees of norovirus strains. The phylogenetic trees were reconstructed using ML method based on the 1417 downloaded sequences and our 10 assembled results. Branches were collapsed by dual-types and bootstrap values above 60% are depicted on branches. (A) Phylogenomic tree of GII strains. (B) Phylogenomic tree of rest of GII strains, GVIII, and GIX strains. (C) Phylogenomic tree of norovirus strains except for GII, GVIII, and GIX.

Table 2 Host for each dual-type of ten norovirus genogroups.

Furthermore, there was a mixing of branches between GIV and GVI based on their host specificity. Upon confirming their hosts, the GIV strains that infect animals were grouped together within the GVI genogroup, which specifically targeting only carnivores and human noroviruses GIV.1 [GIV.P1] and GIV.3 [GIV.P3] were clustered into same clade (Figure 5C and Table 2).

Discussion

Noroviruses are regarded as rapidly evolving viruses with a large host range and present an extensive diversity driven by the accumulation of point mutations and recombination. Presently, their classification is determined by VP1 (genogroups and genotypes) and RdRP (P-types)15,16. The number of genogroups has been expanded to ten (GI–GX), with some genotypes having been recently updated17. Research focusing on VP1 is essential for the prevention and treatment of norovirus infections. However, due to the rapid evolution of this protein and recombination events at the three regions (ORF1/2 and ORF2/3 junction, and within ORF2), gene-based analysis may inadequately reflect phylogenomic history of the genus, as exemplified by GVIII and GIX. Since gene trees do not always align with the species tree topology, it is essential to incorporate genome sequence analysis to comprehend the evolutionary history of a species53,54,55,56. Moreover, since environmental samples can be co-infected with more than two types, relying solely on RdRP and VP1 typing is inadequate for accurately identifying norovirus strains within them. Therefore, in this study, we have detailed the criteria for genotypes and P-types and established a comparison of the phylogenetic relationships between gene-based and whole-genome-based analysis to achieve a more precise evolutionary lineage of the genus Norovirus.

According to prior research, the hypervariable VP2 region may interact with its VP1 interaction domain, and VP2 could function in the stability of norovirus particles or in regulating the maturation of antigen-presenting cells and protective immunity induction in a virus-strain-specific manner57,58,59. Moreover, VP2 seems to undergo covariation with VP1 in the GII, GIV, and GVI genogroups36,49,60. Our genomic diversity analyses also indicated the conservation pattern of norovirus genomes and the variability and high ω (dN/dS) ratios in the two capsid proteins, supporting their coevolution. Furthermore, it was observed that ORF1 carries a significant phylogenetic signal, playing a crucial role in the evolutionary trajectory of noroviruses. We also measured the criteria for current subtypes and observed some genotypes exhibit overlapped range of intra-genogroup and inter-genogroup similarity. Consequently, we inferred that the gene-based classification could not present the phylogenetic relationships of genus Norovirus.

Since the mid-1990s, norovirus GII.4 variants have been responsible for 62 to 80% of norovirus outbreaks globally and contributed to at least six pandemics of acute gastroenteritis61. Additionally, intragenotype recombination within GII.4 has the potential to give rise to new GII.4 variants, further hastening the occurrence of pandemics62,63. Our phylogenomic tree can distinguish each dual type and even intragenotype recombinant strains of GII.4. This feature also enables the accurate type prediction of norovirus strains, even with short reads from environmental or stool samples. Additionally, the whole-genome-based tree showed that the GIV, GVI, GVIII, and GIX strains segregate independently of their corresponding capsid genogroups. GVIII and GIX, previously known as GII, were reclassified through an analysis based on the highly variable VP1 region. Despite being categorized into different genotypes based solely on VP1 sequences, our study confirmed that their genomes closely resemble those of GII strains, as demonstrated in the alignment plot (Figure 2), the phylogenomic tree (Figure 4), and the sequence similarity networks (Figure S5). Notably in Figure 2, the total conserved base pairs are noticeable, with the GII genogroup sharing over 4800 bp (64% of genome length) with GVIII and GIX, and GIV sharing 4200 bp with GVI. In the GII clade containing GVIII and GIX, the global human pathogen P-types GII.P4, GII.P7, GII.P12, GII.P16, GII.P21, and GII.P31 are exclusively found in GII.A and GII.B64. Currently, there are no available drugs or vaccines for treating or preventing norovirus disease in humans65. Targeting the GII.A and GII.B groups, which include the globally common P-types, can cover a broad spectrum of norovirus strains, and a heterologous cross-protection in prevention and treatment can be expected.

The GIV and GVI strains were subdivided into two clades based on not the capsid sequences but their infection hosts. GIV.1 and GIV.3, which infect humans, possessed the RdRP and VP1 of GIV, whereas GIV.2 and GVI strains, which are the carnivore noroviruses, regardless of the capsid protein, had the RdRP of GVI. Moreover, the predicted cleavage sites for the ORF1 polyproteins of GIV and GVI viruses demonstrated conservation in both location and amino acid sequence by host, rather than genogroup36. Furthermore, a structure analysis revealed that the VP1 of GIV.2 has a large loop insertion in the P-domain, a characteristic present in GVI but absent in GIV.1 and GIV.336. To explain this, two possibilities were considered: One suggests that in certain GVI strains, VP1 evolved to resemble GIV because of their high mutation rates. The other posits that recombination occurred between GIV and GVI, resulting in a strain carrying GIV’s capsid proteins and GVI’s ORF1 and then the VP1 changed to align with GVI’s RdRP, acquiring a loop structure. Due to the limited research data on GIV and GVI, the accuracy of these hypotheses remains uncertain. From these findings, the existence of GIV.2[GVI.P1] show three points: first, inter-genogroup recombination is indeed possible; second, RdRP may have a more significant impact on host specificity than VP1, which interacts directly with the host; and third, following recombination, other genes might undergo evolutionary changes to adapt to their respective hosts. These insights suggest the potential existence of a recombinant strain that possesses the GIV P-type and GVI genotype. Although this hypothetical strain would belong to the GVI genogroup, which typically infects animals, it may ultimately lead to the emergence of a strain capable of infecting humans. Our inference regarding the interactions between human and animal viruses leads us to assert the potential of zoonotic transmission.

Conclusions

In conclusion, we conducted a comprehensive analysis to enhance the phylogenetic interpretation of norovirus evolution. As a result, we identified their genomic characteristics and the thresholds for the identity range of inter-genogroup and intra-genogroup in the current classification system. Thereafter, we reconstructed a phylogenomic tree of norovirus strains to compare the evolutionary relationships between gene-based and the whole genome-based study. Genome-based classification can be used to detect norovirus dual types accurately from environmental samples and identify emerging recombinants. Overall, our study marks a significant initial step towards the phylogenomic classification of the genus Norovirus, valuable not only for interpreting the evolutionary relationships among norovirus strains but also for antiviral targeting.