Background & Summary

Giardia duodenalis (synonyms Giardia lamblia and Giardia intestinalis) is a unicellular flagellated parasite that infects the gastrointestinal tract of a wide range of mammalian animals, including humans1. It is estimated that G. duodenalis causes infection in approximately 280 million people annually worldwide2. Investigations into DNA polymorphisms indicate that G. duodenalis, despite being morphologically indistinguishable, should be recognized as a species complex3. G. duodenalis can be classified into eight assemblages (A-H) commonly identified by nucleic acid polymorphisms of the beta-giardin (bg) gene, glutamate dehydrogenase (gdh) and triosephosphate isomerase (tpi)4.

The first sequenced genome of G. duodenalis (WB isolate) was reported in 2007, which yielded a genome of 11.7 Mbp in size and was composed of 306 contigs (92 scaffolds)5. So far, at least 38 G. duodenalis genomes have been sequenced and reported (assemblage A to E)6, which were represented by assemblage A (WB/C65, Be-27), assemblage B (GS8, BAH15c19), assemblage C (cyste1, cyste3, pool8)10, assemblage D (cyste2, cyste4, pool5)10, and assemblage E (P15), and some other isolates11,12. However, the majority of the Giardia genomes had been assembled only at the contig level6. The fragmented assembly of G. duodenalis genomes imposes great constraints on our understanding of its chromosomal structure and the evolutionary dynamics at the chromosomal scale. With the rapid development of genome sequencing technology, long-read sequencing using Oxford Nanopore Technologies (ONT) and Pacific Biosciences (Pacbio) have been used to generate genome for G. duodenalis13,14. Among them, the genomes of the G. duodenalis reference isolate WB-C6 and the Be2 isolate have been assembled to the chromosomal scale by means of sequencing millions of trophozoites or cysts7,14.

For most microorganisms, the sample quantity needed for genomic sequencing is generally acquired through in vitro axenic clones, which is both time-consuming and labor-intensive. However, for microorganisms that are either non-culturable or hard to cultivate in vitro, securing an adequate quantity for whole-genome sequencing poses a great challenge. In more recent years, a technology that combines multiple displacement amplification (MDA) with single-cell genomic sequencing could be a breakthrough for this problem, where only a small number of homogeneous cells are needed. Now, single-cell sequencing is widely used for human cells15,16, ruminant livestock17, and some microorganisms, such as Plasmodium vivax18, Eimeria tella19, Cryptosporidium20, etc. Furthermore, Kooyman et al. first isolated single and pooled cysts of G. duodenalis, and then performed genome sequencing using MDA technology and Illumina paired-end sequencing, which led to the generation of a draft genome11. Therefore, employing the MDA technology to amplify the genomic DNA of single microbial cells enables the acquisition of a substantial quantity of DNA, which is suitable for subsequent genome sequencing.

In this study, we performed genome assembly of G. duodenalis using ten trophozoites sequenced on both the ONT and PacBio platforms and assembled a contiguous, chromosome-scale genome of G. duodenalis (g12a2). The total length of the assembled G. duodenalis genome was 11.1 Mbp, with five contigs. The genome results indicated that the longest contig was 3,279,057 bp, and the N50 and N90 of the contigs were 3,062,379 bp and 1,457,149 bp, respectively. The GC% and BUSCO completeness score of the newly assembled genome were 49.23% and 21.6%, respectively. A high-resolution genetic map of G. duodenalis can enhance our understanding of the diversity and evolution of G. duodenalis, promote research in comparative genomics of G. duodenalis, and open up new avenues for drug and vaccine development.

Methods

Single trophozoite selection

The G. duodenalis trophozoites used in this study were preserved in the Henan International Joint Laboratory for Zoonoses (Zhengzhou, China). They were axenically cultured in TYI-S-33 medium, and these trophozoites belong to sub-assemblage AI (named isolate g12a2 in this study). The axenically cultured trophozoites were aliquoted into 2 mL centrifuge tubes, and 30 μL of the trophozoite suspension was placed as droplets into the culture medium, which was then sealed with paraffin oil. The selected trophozoites were washed with 0.1 mol/L phosphate-buffered saline (PBS, pH 7.0) and then isolated under an inverted microscope (OLYMPUS-BX53, Japan) using a micromanipulator (World Precision Instruments Inc., USA). The isolated 10 trophozoites of G. duodenalis were preserved together in one PCR tube containing 4 microliters of phosphate buffered saline (PBS) for subsequent DNA extraction (Fig. 1).

Fig. 1
figure 1

Workflow for the generation of Giardia duodenalis genome assembly in this study.

Genomic DNA preparation

Ten G. duodenalis trophozoites samples were lysed and subsequently subjected to whole-genome amplification using the REPLI-g single cell kit (QIAGEN, Germany), which was based on a MDA method. The whole-genome amplification (WGA) products were purified using Agencourt AMPure XP beads (BECKMAN, USA) at a ratio of 1:0.9. This process effectively removed dNTP, primers, primer dimers, salt ions, and other impurities from the amplified products, so as to achieve the selective retention of DNA fragments.

Fig. 2
figure 2

Composition of the Giardia duodenalis genome of isolate g12a2 in this study. The outermost layer represents chromosomes, with chr1, chr2, chr3, chr4, and chr5 represented by five different colors. Gene content information is represented by a histogram. Repeat sequences are represented by a heat map. The GC content is represented by a line graph.

Quality assessment of the total genomic DNA

The concentration of the WGA product of G. duodenalis was determined to be 1094.0 ng/μL using a NanoDrop One spectrophotometer (Thermo Fisher Scientific, USA). The ratios of A260/A280 and A260/A230 were 1.85 and 2.29, respectively. These values indicate that the DNA was highly pure and free from contamination by impurities, such as polysaccharides, phenol, and salts (Table 1).

Table 1 Comparison of the Giardia duodenalis genome.

Fifteen pairs of specific primer sequences were designed based on the reference genome sequence of G. duodenalis in the National Center for Biotechnology Information (NCBI) database (Table 2). Among the 15 pairs of designed primers, 14 successfully amplified the target amplicons through PCR amplification. Furthermore, nested PCR amplification of the bg, gdh, and tpi loci of G. duodenalis yielded target bands at the corresponding positions (Table 2). This result suggests that the extracted DNA has a relatively complete coverage. Quantitative polymerase chain reaction (q-PCR) primers were designed and used for detection. The upstream primer F was “CTCCCTCCTTTGTGCACCTTTCTACAGCTA”, and the downstream primer R was “CTCGCCCATGATTCTACGTCTTTCAGAGTG”. We then performed the q-PCR detection on the G. duodenalis genomic DNA to assess its purity. The obtained CT values ranged from 21.39~23.00, indicating that the DNA content was within the normal range.

Table 2 Primer sequences for Giardia duodenalis genome coverage detection.

Short-read sequencing

The qualified whole genome DNA of G. duodenalis was sent to a commercial genome sequencing company (Wuhan Nextomics Biosciences Co., Ltd., China) for genome sequencing. For MGI Tech Sequencing (MGI), 1 μg of this genomic DNA was used to prepare a second-generation sequencing library with the MGIEasy Universal DNA Library Prep Kit V1.0 (CAT#1000005250, MGI). The library was then sequenced on the MGISEQ-2000 platform (BGI, Shenzhen, China), generating second-generation sequencing data. After that, the reads were demultiplexed, and the files were exported in fastq format (Fig. 1). Approximately 2.2 Gb of 150 bp paired-end reads (clean data) with 198 × coverage were generated using the MGI sequencing platform.

Long-read sequencing

For Oxford Nanopore Technologies sequencing (ONT), approximately 8–10 µg of genomic DNA was repaired using the NEBNext FFPE DNA Repair Mix (Cat #M6630, NEB, USA). Subsequently, end-repair and dA-tailing were carried out with NEBNext Ultra II reagents (NEB, USA). The dA-tailed insert molecules were further ligated with an ONT adaptor via ligation kit SQK-LSK114 (Oxford Nanopore Technologies, UK). The concentrations of library fragments were quantified with the Qubit® 3.0 Fluorometer (Invitrogen, USA). Based on the quantified concentrations, an appropriate amount of the DNA library was selected. The DNA library was then loaded into the primed PromethION 48 sequencer (ONT, UK) flow cell and the ONT sequencing data were collected in fastq format.

For Pacific Biosciences (PacBio) sequencing, the quality control of at least 15 µg of genomic DNA was conducted using the Femto Pulse system (Agilent, USA), followed by shearing of the DNA with the Megaruptor 3 system (Diagenode). The dA-tailed insert fragments were further ligated and subjected to nuclease treatment using the SMRTbell® prep kit 3.0 (PacBio, USA). Library size selection and purification were performed using the PippinHT system (Sage Science, USA). Ultimately, the DNA library was constructed using the Revio enzyme binding kit (Pacific Biosciences, USA), and sequenced on the PacBio platform. The resulting PacBio sequencing data were collected in BAM format. Specifically, the ONT and PacBio sequencing generated approximately 11.5 Gb and 12.1 Gb of data, respectively, achieving 1029 × coverage for the former and 1088 × coverage for the latter against the G. duodenalis genome (estimated at ~12 Mb).

Pre-assembly data processing

The sequencing data for the g12a2 isolate from the three platforms of MGI, ONT, and PacBio were preprocessed. This involved the use of Fastp (v 0.23.4)21 to remove adapter contamination, filter out low-quality reads, and perform a thorough quality control analysis. Following, the BWA alignment tool (v 0.7.17)22, using the BWA-MEM algorithm, was employed to align the sequencing data from MGI, ONT, and PacBio with the reference genome Be2, which yielded alignment rates for each dataset. The calculated alignment rate for each sequencing dataset served as a metric to assess the completeness and integrity of the sequencing data.

Genome assembly

For the ONT sequencing data, the Canu (v 2.0)23 software was used to generate an assembly result based on the Overlap-Layout-Consensus (OLC) algorithm with default parameters. Meanwhile, the MECAT2 assembly tool24 produced a mecat assembly result by exploiting the overlap relationships among the reads, using default parameters. Both the Canu and MECAT2 assembly results were processed to remove contigs with abnormal GC content. Subsequently, the assemblies were aligned to the reference genome using NUCmer (v 3.1)25, and contigs with extremely low alignment rates were removed. To further improve the assembly results, MGI sequencing data were used for error correction. The MGI sequencing data were aligned to the assembly results using BWA (v 0.7.17), and the aligned sequences were sorted using Samtools (v 1.9)26 to generate index files. These BAM files were then polished in two rounds using Pilon (v 1.24)27 (with the parameters:–changes–vcf–diploid), which iteratively corrected the contigs in the original assembly results. Ultimately, the corrected Canu and MECAT2 assembly results were obtained.

To improve the quality of the genome assembly, we compared the overlapping regions and sequence features of the assembly results generated by Canu and MECAT2. By integrating the information from both assembly results, we employed the Quickmerge tool (v 0.3)28 (with the parameters -hco 5.0 -c 1.0 -l 10000 -ml 100) to merge the outputs of Canu and MECAT2. This process yielded a higher-quality genome assembly, namely the ONT assembly merge.

For the PacBio sequencing data, we performed the same processing as for the ONT sequencing data, which generated a PacBio merged assembly result, referred to as the PacBio assembly merge. We used the Quickmerge tool (v 0.3) with default parameters to integrate the ONT and PacBio merged assembly results, thereby obtaining an ONT-PacBio merged assembly result.

With the default parameters of the RagTag tool (v 2.1.0)29, we conducted error correction, scaffolding, and gap filling between scaffolds. We used the Oxford ONT-PacBio merged assembly result as the input file for a secondary assembly. Ultimately, we obtained the final assembly result for G. duodenalis (isolate g12a2), which was 11.1 Mbp in size and had a contig N50 of 3.1 Mbp (Fig. 2).

Data Records

The raw sequencing data for G. duodenalis (isolate g12a2), including MGI short reads (accession CRA021161)30, PacBio (accession CRA021160)31 and ONT long reads (accession CRA021157)32, and the whole-genome assembly (accession GWHFIGV00000000.1)33 of the G. duodenalis isolate can be accessed through National Genomics Data Center, China National Centre for Bioinformation/Beijing Institute of Genomics, Chinese Academy of Sciences (PRJCA033502)34. The genome assembly for G. duodenalis generated in this study has been submitted to the NCBI GenBank database with the assembly accession GCA_049639855.135. Moreover, the genomic annotation results have been deposited in the Figshare database36.

Technical Validation

The ONT sequencing data for the G. duodenalis genome, which were processed by Fastp (v 0.23.4), were aligned to the assembled g12a2 genome, and each position in the assembly result of g12a2 was supported by at least 171 × reads. Subsequently, Python was used to plot and present the results (Fig. 3).

Fig. 3
figure 3

Mapping the Oxford Nanopore Technologies sequencing data after Fastp to the reads of the g12a2 assembly. The vertical axis represents the amount of Oxford Nanopore Technologies sequencing data mapped to this region. The horizontal axis represents the new genome.

NUCmer (v 3.1) with the parameters of “–maxgap = 50000” and “–mincluster = 100” was employed to align the final genome assembly result (isolate g12a2) with the two reference genomes, Be2 (GCA_026248805.1)37 (Fig. 4a) and WB-C6 (GCA_000002435.2)38 (Fig. 4b). Subsequently, the dot-plot function of MUMmerplot (v 3.5)39 was used to assess and visualize the generated delta files. Additionally, GenomeSyn (v 1.41)40 was used to identify the synteny between g12a2 and the two reference genomes, and the findings were displayed in a graphical format (Fig. 4c). The alignment analysis confirmed that these five contigs generated in this study (isolate g12a2) exhibited a high degree of congruence with reference genome of GCA_026248805.1 (Be2), and also demonstrated a strong similarity to the initial five contigs of reference GCA_000002435.2 (WB-C6), which correspond to the previously identified chromosomes.

Fig. 4
figure 4

The comparison between the g12a2 assembly and the previously published Be2, WB-C6 isolate assembly. (a) The g12a2 assembly shows a high concordance with the previously published Be2 isolate assembly. (b) The g12a2 assembly shows a high concordance with the first five contigs of the previously published WB-C6 isolate assembly. (c) Genome collinearity analysis of g12a2 with Be2 and WB-C6. The collinear gene blocks were determined by NGenomeSyn between genomes for three Giardia duodenalis isolates. Syntenic relationship in gene organization across the genomes of g12a2, Be2 and WB-C6. Syntenic blocks, which consist of regions containing orthologous genes, are connected by lines.

The N50 in genome assembly is an important indicator for evaluating the quality of genome assembly. For a genome, a larger N50 indicates less fragmentation and can reduce the disruption of gene structure caused by fragmentation41. In this study, the N50 of contigs for the novel assembly of isolate g12a2 was 3.1 Mbp, which was higher than the N50 of the reference genome of WB-C6 (2.8 Mbp) and that of the Be2 isolate (3.0 Mbp).

Benchmarking Universal Single-Copy Orthologs (BUSCO, v 5.2.2)42 was used to evaluate the integrity of genomes g12a2, Be2, and WB-C6 with the parameters “-m geno -l eukaryota_odb10 -offline -i” based on the eukaryotic database updated on January 8, 2024. It was reported that although BUSCO was not an optimal tool in diplomonads since many of the conserved proteins used in the BUSCO analyses have been lost43, the analysis was conducted for isolate g12a2 in this study. A BUSCO analysis was carried out on the genome assemblies of g12a2, WB-C6, and Be2, and the completeness percentages were 21.6%, 22.4%, and 22.4%, respectively. The BUSCO output scores were plotted and visually presented using the ggplot2 library available via the R Tidyverse package (Fig. 5).

Fig. 5
figure 5

Comparison of Benchmarking Universal Single-Copy Orthologs (BUSCO) evaluation results of g12a2 (this study) with reference results of Be2 and WB-C6.

In this study, a chromosome-scale genome assembly of G. duodenalis was obtained through long-read sequencing of ten trophozoites. The quality of novel genome sequencing and assembly was comprehensively evaluated using multiple methods, including genome coverage detection, synteny analysis with the reference genome, and BUSCO evaluation. This study provides alternative opportunities for high-quality genome sequencing of free-living microorganisms, as well as those that are non-culturable or hard to cultivate in vitro in the short term.