Introduction

Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2), the causative agent of COVID-19, was first identified in Wuhan City, China, prompting the World Health Organization (WHO) to declare a global pandemic on March 11, 2020. As of December 2022, according to the WHO (https://covid19.who.int/), there have been more than 647 million confirmed cases of COVID-19 worldwide, with more than 6.6 million deaths. The rapid emergence of novel SARS-CoV-2 variants has displaced earlier strains, underscoring the importance of continuous surveillance to monitor their evolution and spread. Genomic surveillance plays a crucial role in understanding the virus’s genetic diversity and informing public health responses aimed at mitigating transmission and reducing the clinical impact of COVID-191.

Whole genome sequencing (WGS) has proven to be an essential tool in monitoring the genetic evolution of SARS-CoV-2 and identifying emerging variants2,3. WGS allows for the detection of mutations across the entire viral genome, particularly in key structural proteins like spike (S), membrane (M), and envelope (E), which can affect viral transmission and immune escape. In addition to WGS, tools such as Pangolin (Phylogenetic Assignment of Named Global Outbreak Lineages) are widely used to classify SARS-CoV-2 lineages by analyzing mutation patterns in viral genomes4,5. GISAID6 and Nextstrain7 facilitate global data sharing and real-time variant tracking, enabling timely surveillance.

Sample pooling is a promising strategy for genomic surveillance, with the potential to significantly reduce both the time and costs associated with WGS by combining multiple individual specimens into a single sequencing library. While several studies have investigated pooling strategies for SARS-CoV-2 detection, most have focused on PCR-based methods8,9. However, the application of pooled WGS for SARS-CoV-2 variants surveillance remains underexplored.

In this study, we present a bioinformatics pipeline for SARS-CoV-2 genomic surveillance using pooled WGS. Our approach accurately estimates the abundance of SARS-CoV-2 variants in pooled samples, eliminating the need for individual sample barcoding. We first validated the bioinformatics pipeline using simulated NGS datasets derived from GISAID genome sequences. We then applied the complete workflow, encompassing both wet-lab experiments and bioinformatics analyses, to real NGS datasets from pooled samples containing multiple reference materials. Finally, we applied our method to pooled clinical samples to estimate the abundance of SARS-CoV-2 variant, comparing the results with national epidemiological data collected during the same period of clinical sample collection. This approach also enables the identification of emerging novel SARS-CoV-2 mutations circulating in communities, providing a valuable resource for public health authorities to monitor and address the pandemic effectively.

Materials and methods

Identification of mutations associated with SARS-CoV-2 variants

To identify mutations associated with SARS-CoV-2 variants, we analyzed COVID-19 genome sequences and corresponding metadata obtained from GISAID. Lineage information for each genome sequence was extracted from the metadata, focusing on lineages with at least 10 high-quality genome sequences available. High-quality genome sequences were defined as those longer than 29,000 bases and containing more than 99% non-ambiguous bases (i.e., A/C/G/T).

For each lineage, a lineage-specific FASTA file containing the relevant genome sequences was created. These FASTA files were compared against the reference genome (NC_045512.2) to identify mutations associated with each lineage. In cases where a lineage contained more than 500 genome sequences, we randomly subsampled up to 500 genome sequences. Minimap210 was used to align the lineage-specific FASTA files to the reference genome, followed by mutation identification using VarScan211. Detected mutations were annotated using the Variant Effect Predictor (VEP)12. Mutations with a Variant Allele Fraction (VAF) greater than 85% were considered lineage-associated mutations, as detailed in Supplementary Table 1.

Generation of simulated pooled NGS datasets

To benchmark the pipeline under controlled conditions, we generated simulated pooled datasets with predefined lineage compositions and known abundance ratios. All genome sequences used to create these simulated datasets were obtained from GISAID. The simulated pooled NGS datasets were generated with the following systematic procedure:

  1. 1.

    Selection of Genome Sequences: Ten genome sequences were randomly selected from a dataset corresponding to a specific GISAID submission date.

  2. 2.

    Assignment of Abundance Ratios: Abundance ratios for each genome sequence were randomly assigned within the range of 0.1 and 0.2, derived from a uniform distribution.

  3. 3.

    Normalization of Abundance Ratios: The assigned abundance ratios were normalized to ensure their sum equaled 1.

  4. 4.

    Read Generation using InSilicoSeq (ISS): Utilizing ISS v1.6.013, one million paired-end reads were simulated based on the selected genome sequences and their normalized abundance ratios.

This simulation focused on two critical periods: May to July 2021 (Delta variant spread) and November 2021 to January 2022 (Omicron variant emergence). From each selected submission date, 35 pooled NGS datasets were simulated, resulting in a total of 1,120 datasets across 32 submission dates.

Bioinformatic pipeline for estimating the abundance of SARS-CoV-2 lineages in pooled NGS data

To estimate the abundance of SARS-CoV-2 lineages in pooled NGS data, a dedicated bioinformatics pipeline was developed, consisting of the following key steps:

  1. 1.

    Reads Alignment: The pooled NGS data were aligned to the SARS-CoV-2 reference genome (NC_045512.2) using the bwa-mem algorithm14.

  2. 2.

    Post-Processing: Picard tools were employed to mark duplicated reads and sort the resulting BAM files.

  3. 3.

    Mutation Calling: Mutations were identified using VarScan2 and subsequently annotated with the Variant Effect Predictor (VEP).

  4. 4.

    Lineage Candidate Identification: To identify potential lineages, a lineage-specific marker coverage ratio was calculated by dividing the number of detected markers by the total markers associated with each lineage. Lineages with a coverage ratio exceeding a predefined threshold (default set to 0.7), and at least five detected markers were selected for further analysis.

  5. 5.

    Initial Lineage Grouping: Lineages sharing identical sets of detected markers were grouped. For each group, the lineage with the highest marker coverage ratio was chosen. In cases where more specific sub-lineages (e.g., B.1.1. versus B.1.1.411) shared the same marker coverage ratio, only the broader, more general lineage was retained. However, in certain cases, multiple lineages may still be retained within the group.

  6. 6.

    Clustering of Lineage Groups Based on Marker Similarity:

    1. A.

      Pairwise comparisons were performed to calculate the percentage of shared detected markers between lineage groups, based on the intersection-over-union (IoU) of detected markers.

    2. B.

      These relationships were represented in a graph, connecting nodes (representing lineage groups) that shared at least 90% of detected markers.

    3. C.

      Connected components in the graph were identified, with each representing a cluster of lineage groups sharing similar detected markers.

  7. 7.

    Creation of Cluster-Marker Association Matrix:

    1. A.

      Markers were aggregated for each cluster, representing all member lineage groups within the cluster.

    2. B.

      An association matrix was constructed with clusters as rows and selected markers as columns, indicating the presence1 or absence (0) of markers within each cluster.

  8. 8.

    Iterative Estimation of Cluster Fractions:

    1. A.

      Initial clusters were identified if they contained more than a threshold number of private markers (default set to five) associated exclusively with other clusters.

    2. B.

      An optimization algorithm (from the scipy.optimize package) was applied to estimate the fraction of each selected cluster in the sample. Detailed procedures are provided in the Supplementary Text.

    3. C.

      Clusters were iteratively added until the inclusion of an additional cluster no longer significantly improved the squared error sum, with the improvement ratio threshold set at 2%.

  9. 9.

    Lineage Annotation and Result Compilation: Each cluster identified through the above steps was annotated with the corresponding lineages, and the final results were compiled for presentation.

The results of this workflow are illustrated in Fig. 1.

Fig. 1
figure 1

Example of Result for COVID-19 Variant Abundance Estimation and Mutation Detection through Pooled NGS Data Analysis. This figure shows the outcomes of analyzing pooled NGS data. It provides the estimated abundance for each cluster of lineages and lists all detected mutations, each with its corresponding lineage associations.

Reference panels: pooled samples of reference materials

We prepared pooled samples (panels) of reference materials with a constant SARS-CoV-2 RNA load, using the AMPLIRUN® SARS-CoV-2 RNA Control (Vircell SA, Granada, Spain) as reference material. To construct panels, we utilized five RNA controls (P.1, B.1.351, B.1.1.7, B.1.617.2, and Omicron; Vircell SA, Granada, Spain), which were serially diluted to obtain 2,000 copies per 1 \(\:\mu\:L\). A total of six panels were created, each consisting of 10 spiked reference material samples with a concentration of 2,000 copies/\(\:\mu\:L\) (Supplementary Table 2). The absolute copies/\(\:\mu\:L\) of diluted samples were measured using digital droplet PCR (ddPCR) with open reading frame 1 (ORF1) gene primer/probe15.

Validation panels: pooled samples of clinical samples

Upper respiratory tract samples were collected from COVID-19 patients (n = 100) from December 2021 to January 2022, coinciding with the introduction of the Omicron variant. These samples were collected to reflect the epidemic trend and to evaluate the surveillance ability for new emerging variants in Korea. The clinical samples were stored at − 80 °C using a universal transport medium (Becton Dickinson, Sparks, MD, USA). Viral RNA was isolated from the clinical samples using the QIAamp viral RNA extraction kit (Qiagen, Hilden, Germany).

Ten validation panels were created by combining 10 individual samples from the collection of 100 patients (Supplementary Table 3). The SARS-CoV-2 concentration in each patient sample was adjusted to achieve a dilution of 2,000 copies per microliter (\(\:\mu\:L)\), based on quantitative calculations performed with droplet digital PCR (ddPCR).

Normalization of samples for pooling

Clinical samples with a Ct value of less than 20, the median for positive clinical samples, were expected to contain more than 10,000 viral copies15. Therefore, the samples were diluted 100-fold prior to performing ddPCR (Supplementary Table 3). For sample pooling, the SARS-CoV-2 copy number in each component sample was adjusted to approximately 2,000 copies per microliter. Additionally, we compared the results from real-time RT-PCR and ddPCR and applied a linear regression equation (Supplementary Fig. 1) to derive dilution factors. This enabled the direct normalization of samples for pooling based on Ct values from real-time RT-PCR.

Droplet digital PCR

Digital PCR reactions were performed with QX200 Droplet Digital PCR System and Custom ddPCR assay (Bio-Rad Laboratories, Hercules, CA, USA). We used specific primers and probes for ORF1 gene primer/probe15. The 20 \(\:\mu\:L\) PCR mix was composed of 5 \(\:\mu\:L\) Bio-Rad ddPCR Supermix, 2 \(\:\mu\:L\) of reverse transcriptase, 1 \(\:\mu\:L\) of 300mM DTT, 1 \(\:\mu\:L\) of each amplification primer/probe mix, 1 \(\:\mu\:L\) of nucleic acids templets and 10 \(\:\mu\:L\) of nuclease-free water. Droplets were then thermally cycled as follows: one cycle of 50 °C for 60 min and 95 °C for 10 min, followed by 40 cycles of 95 °C for 30 s, 60 °C for 1 min, followed by 98 °C for 10 min and 4 °C for 30 min. A temperature ramp of 1.5 °C was fixed on all PCR steps, and the lid was heated at 105 °C. Following PCR, the 96-well plate was read in the QX200 Droplet Reader (Bio-Rad). The absolute copy number of ORF1a gene of SARS-CoV-2 was calculated using the QuantaSoft analysis software (Bio-Rad).

Real-time reverse-transcription PCR

Extracted nucleic acid samples were tested for comparative analysis of SARS-CoV-2 by PowerChek™ SARS-CoV-2 Real-time PCR Kit (Kogene biotech, Seoul, Korea). The qRT-PCR was performed with a CFX 96 touch real-time PCR detection system (Bio-rad, Hercules, CA, USA), applying the conditions according to the manufacturer’s instructions. The median Ct value of ORF1gene of clinical sample for validation panels is approximately 19.9 (range:15.1 ~ 24.9) (Supplementary Table 3).

NGS assay

The panels of 10 pooled samples for SARS-CoV-2 variants surveillance were evaluated using the Ion AmpliSeq SARS-CoV-2 Research Panel (ThermoFisher Scientific) in accordance with the manufacturer’s instructions. Ion AmpliSeq SARS-CoV-2 Research Panel was designed for sequencing the whole genomes of SARS-CoV-2. cDNA was synthesized using the Ion Torrent NGS Reverse Transcription Kit-cDNA synthesis kit (ThermoFisher Scientific). Sequencing-ready libraries were created from the prepared cDNA using the Ion AmpliSeq Library Kit Plus (ThermoFisher Scientific). Sequencing was carried out on the ThermoFisher Ion S5 XL Sequencer with an Ion 530 Chip (ThermoFisher Scientific).

Results

Evaluation of performance at the PANGO lineage level using simulated NGS datasets

The comparison between true lineages and predicted lineages from the simulated pooled NGS dataset presents a significant challenge due to the extensive diversity and phylogeny of SARS-CoV-2 variants5,16. To assess sensitivity at the lineage level, we employed two strategies.

The first strategy involved assessing whether a true lineage was included among the inferred lineages from a pooled sample. This approach yielded a sensitivity of 69.7% (4,471 out of 6,417). Alternatively, we employed a second strategy where a predicted lineage was considered accurate if it shared considerable markers with the true lineage. For this alternative approach, we defined two lineages A and B as sharing considerable markers if the ratio between the number of intersection of markers of A and B and the number of union of markers of A and B is greater than 90%. This alternative approach resulted in a sensitivity of 82.8% (5,314 out of 6,417).

Furthermore, for the inclusion of a predicted lineage among the true lineages of a pooled sample, the PPV was 54.1% (4,471 out of 8,258). From the approximate match, the PPV was improved to 77.4% (6,390/8,258). The overall performance is summarized in Fig. 2.

Fig. 2
figure 2

Lineage-Level Performance in COVID-19 Variant Detection. This figure shows the detection performance for key PANGO lineages associated with WHO COVID-19 variants, indicating both exact (superscript a) and approximate (superscript b) match outcomes for Detected Number, Sensitivity, and PPV (Positive Predictive Value). An approximate match is identified by a shared marker ratio exceeding 90%. The figure highlights the efficiency of detecting lineages such as Alpha (B.1.1.7), Delta (AY.4), and Omicron (BA.1) variants, demonstrating the system’s adaptability to both precise and nearly matching genetic sequences.

Evaluation of performance at the WHO variant level using simulated NGS datasets

We then aggregated the true abundance of PANGO lineages into WHO Variant levels. For WHO-designated Variants of Concern (VOC) and Variants of Interest (VOI), the overall sensitivity was 99.1% (2,184 out of 2,204), and the PPV was 99.9% (2,184 out of 2,185). Additionally, we examined the correlation between the true abundance and estimated abundance for the major WHO Variants. The correlation exceeded 0.95 for all major WHO variants. The detailed performance and correlation are depicted in Fig. 3. Additionally, we explored the trends in the estimated abundance of pooled samples over the study period, as shown in Fig. 4. This analysis provided insights into the prevalence and dynamics of different WHO variants within the study population, reflecting the effectiveness of our surveillance strategy in tracking the evolution and spread of SARS-CoV-2 variants.

Fig. 3
figure 3

Performance Evaluation of WHO Variant Level COVID-19 Variation Detection. This figure presents a comprehensive analysis of the sensitivity and positive predictive value (PPV) for the detection of various COVID-19 variants, including Alpha, Beta, Delta, Epsilon, Eta, GH/490R, Gamma, Iota, Kappa, Lambda, Mu, Omicron, and Zeta. The assessment was based on a total of 1120 tests, with the count of True Positives (TP), False Negatives (FN), and False Positives (FP) summarized for each variant. Sensitivity and PPV are depicted for each variant, highlighting the high accuracy and reliability of the detection method across the spectrum of COVID-19 variants, with most variants showing near-perfect sensitivity and PPV rates.

Fig. 4
figure 4

Temporal Trends in WHO COVID-19 Variant Prevalence Through Simulated Pooled NGS Data. This figure shows estimated fractions of WHO-recognized COVID-19 variants from simulated pooled NGS, including Lambda, Mu, Omicron, GH/490R, Iota, Kappa, Epsilon, Eta, Gamma, Alpha, Beta, Delta, and Zeta. The simulation spans several months, illustrating the dynamic shifts in variant prevalence. Each variant’s estimated fraction is plotted over time, highlighting the changing landscape of COVID-19 variants.

Reference panel and clinical validation

We evaluated the whole workflow by using six reference panels mixed with different WHO variants at desired ratios, as shown in the left bar (REAL) of Fig. 5. The right bar (Estimated) represents the predicted abundance obtained by analyzing each pooled NGS dataset from the reference panel.

Fig. 5
figure 5

Reference Panel Evaluation for COVID-19 Variant Detection. This figure illustrates the performance of six reference panels in accurately estimating the fraction of WHO-designated COVID-19 variants, including Alpha, Beta, Delta, Gamma, and Omicron. For each panel, the figure compares real versus estimated values, showing the precision of detection across six different panels. It highlights the ability of the reference panels to closely estimate the actual fraction of variants.

To evaluate the monitoring capability of our method for detecting new variants and the domestic spread of SARS-CoV-2, we collected specimens from COVID-19 patients (n = 100) from December 2021 to January 2022, reflecting the recent trends of the epidemic. This period was selected as it coincided with the emergence of the Omicron variant in Korea. The left pane of Fig. 6 displays National prevalence in Korea for the Delta and Omicron variants during December 2021 to January 2022. The right pane of Fig. 3 illustrates the estimated prevalence of Delta and Omicron from the 100 clinical samples. For the former 4 weeks, we sequenced 20 samples with two pools for each week, for the latter 2 weeks, we sequenced 10 samples with one pool for each week.

Fig. 6
figure 6

Analysis of Clinical Samples for COVID-19 Variant Prevalence. This figure demonstrates the prevalence of COVID-19 variants Omicron and Delta over a series of weeks, contrasting pooled clinical samples with official South Korean national prevalence data. It visually represents the proportion of each variant detected, emphasizing the shifting dominance of variants over the observe period. The side-by-side comparison with national statistics highlights the dynamic nature of the pandemic’s evolution within South Korea, reinforcing the essential role of ongoing surveillance in guiding public health strategies and interventions.

Discussion

The global public health threat posed by emerging and re-emerging infectious diseases highlights the urgent need for advanced diagnostic and surveillance technologies that facilitate timely detection, monitoring, and outbreak control17,18,19,20,21.

In this study, we implemented a pooling strategy that allows up to ten individual samples to be combined into a single sequencing library. This approach facilitates the collection of unbiased samples and significantly reduces the labor, time, and resource requirements associated with processing numerous individual samples, which is particularly beneficial in large-scale surveillance settings. Furthermore, by enabling more efficient and cost-effective genomic surveillance, this approach can assist in identifying and monitoring new variants of the virus, thereby informing public health strategies to mitigate the transmission and clinical impact of COVID-19. Given the continuous evolution of SARS-CoV-2, regular updates to the lineage markers remain essential.

Compared to other pooled testing strategies, particularly those using PCR, our approach offers several distinct advantages. Primarily, by utilizing WGS, it enables the detection of a comprehensive range of mutations and provides more detailed genetic information. Because WGS captures the full mutation profile of pooled samples, it is more effective in identifying diverse and novel SARS-CoV-2 variants. This capability is especially valuable for epidemiological analyses and for guiding public health responses to newly emerging variants.

WGS-based surveillance has demonstrated substantial public health benefits in real-world applications. Notably, the U.S. CDC’s Traveler-based Genomic Surveillance program (TGS) at international airports effectively employs WGS to rapidly detect and monitor emerging SARS-CoV-2 variants, highlighting its practical value in facilitating timely public health interventions22. Large-scale genomic surveillance programs require highly efficient and streamlined workflows. Although several high-capacity multiplexed WGS-based approaches, such as COVseq23, have been developed, our pooling method eliminates the need for individual sample barcoding, thereby simplifying the workflow and further reducing resource requirements. Pooling strategies potentially offer an additional axis of optimization to enhance efficiency and scalability in surveillance programs.

Nevertheless, this study has several limitations. First, the reliance on simulated datasets might not fully represent the complexity and heterogeneity inherent in real-world clinical samples. Second, although pooling enhances surveillance efficiency, it may inadvertently obscure the detection of variants present at low frequencies. Third, the pipeline’s performance could vary depending on sequencing depth, viral load, and sample quality. Future research should assess the robustness of this method across diverse epidemiological contexts and laboratory conditions to confirm its broader applicability.