Background & Summary

Cryptosporidium spp. are parasitic apicomplexans that cause moderate-to-severe diarrhea in humans and animals1. The lack of widely efficacious medications and the absence of a vaccine necessitate heavy reliance on infection prevention for the management of cryptosporidiosis, thereby highlighting the urgent requirement for innovative interventions2,3. Cryptosporidium species have been detected in 155 mammalian species, including primates4,5. Currently, at least 44 species of Cryptosporidium have been identified6. Several species, including Cryptosporidium parvum, Cryptosporidium ubiquitum, and Cryptosporidium muris, exhibit wide host ranges, leading to zoonotic infections in conjunction with other Cryptosporidium spp7. Whole-genome sequencing (WGS) and comparative genomic analysis have been employed to elucidate the genetic underpinnings responsible for variations in host range among different species of Cryptosporidium, as well as the process of host adaptation within each species8,9,10. The use of WGS analysis has become more prevalent in the characterization of Cryptosporidium owing to the emergence of next-generation sequencing (NGS) technologies. A total of 15 species have been subjected to genome sequencing, encompassing C. parvum, Cryptosporidium hominis, C. ubiquitum, Cryptosporidium meleagridis, and others. The majority of the available genomic sequence data (19 sequences) pertain to the zoonotic C. parvum, yet only two of these sequences have been annotated11. The initial comprehensive genome assembly for C. parvum Iowa II was made accessible in 2004 using a random shotgun sequencing technique. This approach yielded a total of 9.1 Mb of DNA sequences distributed across all eight chromosomes12. According to previous studies, the genetic divergence between C. parvum and C. hominis was estimated to be approximately 3%-5% at the DNA level13.

One of the primary challenges encountered in genomics research on Cryptosporidium spp. is the limited availability of adequately purified oocysts in sufficient quantities for NGS analysis, primarily because of the absence of an in vitro culture system capable of propagating parasites. Previous WGS analyses of Cryptosporidium have been conducted using oocysts purified from laboratory animals that were infected12,14,15. Troell et al.16 sequenced the Cryptosporidium single-oocyst genome, followed by a comprehensive whole-genome analysis through comparison with de novo assembly of the reference population genome. This research represents a significant milestone as it establishes the feasibility of acquiring high-quality genomic data from single-celled eukaryotes, encompassing both extensive coverage and precise information16. However, previous research on Cryptosporidium only involved single-oocyst NGS of the genome without assembling it at the chromosomal level.

Here, our study aimed to address this limitation by generating a reference genome for C. parvum using long-read sequencing data from Oxford nanopore technology (ONT) and PacBio high fidelity (HiFi) sequencing platforms, along with error correction using short-read data. As a result, the assembled genome of C. parvum was 9.13 Mb in length and showed a high completion rate with 98.2% single-copy BUSCO genes. A total of 3,915 protein-coding genes were predicted, of which 3,666 genes (93.6%) were functionally annotated. This study is an attempt to complete the high-quality chromosome-level genome assembly of Cryptosporidium species using 10 oocysts amplification coupled with long-read sequencing, which might also be an effective strategy for genome sequencing projects of other difficult-to-collect or uncultivable pathogens.

Methods

Sample collection and genome sequencing

The Cryptosporidium strain was isolated from a calf with pre-weaning diarrhea in Henan, China, and identified as C. parvum using the SSU rRNA gene17. It was then subtyped by sequence analysis of the 60 kDa glycoprotein gene18 and identified as IIdA19G1 subtype. Oocysts of the identified Cryptosporidium species were purified using a three-step filtering (Fig. 1) comprising raw fecal filtration using 80-mesh iron sieve, sucrose gradient centrifugation, and cesium chloride gradient centrifugation19,20. Purified Cryptosporidium oocyst fluid (6 μL) was absorbed using a 10 μL pipette and dripped onto a glass petri dish. Under an inverted Olympus microscope at 60 × (OLYMPUS-BX53, Japan), a single oocyst of C. parvum was isolated using a three-axis hydraulic micromanipulator (World Precision Instruments Inc., USA). In this study, 10 oocysts were selected and pooled into a PCR tube containing 4 μL PBS buffer (Fig. 1).

Fig. 1
Fig. 1
Full size image

The purification and collection process of oocyst. (Yellow arrow: C. parvum oocyst).

The 10 oocysts sample was then lysed and whole-genome amplified using the REPLI-g Single Cell Kit (based on multiple displacement amplification method; QIAGEN, Germany). The resulting whole-genome amplification (WGA) products were purified using Agencourt AMPure XP beads (BECKMAN, USA) to remove dNTP, primers, primer dimers, salt ions, and other impurities from the amplified products. According to NanoDrop One (Thermo Fisher Scientific, USA), the WGA product concentration in C. parvum was 762 ng/μL. Through Qubit 3.0 (Invitrogen, USA), the quantity of the WGA product was 30 μg, and the Nc/Qc (NanoDrop/Qubit) value was 1.2.

The high-quality amplified DNA was used to construct the genomic library, and the library was size-selected using BluePippin (Sage Science, USA). The purified and size-selected library was then sequenced on the Pacific Biosciences Sequel II platform (HiFi) in continuous long-read mode (Pacific Biosciences, USA) and the PromethION 48 sequencer (ONT, UK) following the manufacturer’s instructions, respectively. A total of 3.5 Gb (386 × coverage) PacBio HiFi and 8.8 Gb (967 × coverage) ONT long sequencing reads were obtained after removing adaptors and chimeric reads (Table 1). For short-read sequencing, library preparation was performed with 50 ng of fragmented DNA using the MGIEasy Universal DNA Library Prep Kit (MGI, Shenzhen, China) and then sequenced on the MGISEQ-2000 platform (BGI, Shenzhen, China). About 1.6 Gb (173 × coverage) of 150-bp paired-end reads (clean data) were generated using MGI sequencing platform (Table 1).

Table 1 Sequencing data used for the genome assembly of C. parvum.

De novo assembly

We first used SACRA v.2.021 to split chimeric long reads derived from multiple displacement amplification and fastp v.0.20.122 to trim adapter and low-quality bases in short reads. 486,818 chimera-containing reads in PacBio data and 1,394,568 in ONT data were identified and split using SACRA v.2.0, respectively. The clean long reads from ONT and PacBio platforms were independently assembled using Nextdenovo v.2.5.2 (https://github.com/Nextomics) and Canu v.2.2.223 with default parameters (Fig. 2). To improve the assembly contiguity, the outputs for each platform were merged using Quickmerge v.0.3 with default parameters (https://github.com/mahulchak/quickmerge). The merged assembly was then polished two rounds with Pilon v.1.24 (https://github.com/broadinstitute/pilon) using short clean reads24 (Fig. 2). For this, short reads were first mapped to the assembly using BWA v.0.7.1025 with default parameters. Then reads with mapping quality at least 30 were used for polishing (--minmq 30). The polished assemblies from the two sequencing platforms were further merged using Quickmerge v.0.3. Finally, we obtained a total genome length of 9.13 Mb across eight assembled contigs with six capped by telomeric repetitive sequences (TTTAGG)n at one or both ends (Table 2).

Fig. 2
Fig. 2
Full size image

Framework of genome assembly.

Table 2 Comparison between the assembled and published C. parvum reference genomes.

The statistics of genome assembly, including contig length, N50 and GC content were comparable to those of the published C. parvum reference genome. Benchmarking Universal Single-Copy Orthologs (BUSCO) v.5.4.626 was used to evaluate the completeness of the C. parvum genome assembly against the Coccidia_odb10 database.

Gene prediction and annotation

Protein-coding genes were predicted through the integration of ab initio methods, homology alignment data, and transcriptomic data as described previously27. Briefly, the transcriptomic data28 for gene model training and protein data29 for homology alignment of C. parvum were downloaded from CryptoDB (https://cryptodb.org). For ab initio methods, PASA v.2.4.030 was applied to produce candidate gene structures, which could be applied to obtain a set of gene structures for training the SNAP (v.2013-11-29)31, Augustus v.3.3.332 (--genemodel=complete), GenomeThreader v.1.6.133, and GlimmerHMM v.3.0.434 using default parameters. Subsequently, Augustus v.3.3.332 and GlimmerHMM v.3.0.434 were used to predict gene structure using trained gene models. Gene models derived from ab initio and homologous alignment approaches was finally integrated into a non-repetitive gene set using EvidenceModeler v.1.1.135 and 3,915 protein-coding genes were predicted (Table 2).

The predicted protein sequences were functionally annotated through searching against 18 databases using InterProScan v.5.4536, including CDD37, Coils38, Gene Ontology39, Gene3D40, Hamap41, MobiDBLite42, PANTHER43, Pfam44, Phobius45, PIR46, PRINTS47, ProSite48, SFLD49, SignalP50, SMART51, SUPERFAMILY52, TIGRFAM53, TMHMM54 (Table 3). Finally, 3,666 genes (93.6% of the total) were successfully annotated.

Table 3 Gene function annotation statistics of the assembled C. parvum genome.

Noncoding RNAs annotation

Non-coding RNAs are usually divided into several groups, including rRNA, tRNA, miRNA, and snRNA. Identification of the rRNA genes was conducted by Barrnap v.0.955 using default parameters. The tRNAscan-SE v.2.0.1256 was used to predict tRNA with eukaryote parameters. The miRNA genes were identified by searching miRBase v.21 databases57 using default parameters. The snRNA genes were predicted using INFERNAL v.1.158 based on Rfam v.12.0 database59 using default parameters. Finally, a total of 14 rRNAs, 45 tRNAs, 0 miRNA and 8 snRNAs were predicted (Table 4).

Table 4 Noncoding RNA of the assembled genome.

Data Records

The raw sequencing data, including MGI short reads (accession CRA01331560), PacBio HiFi (accession CRA01331661) and ONT long reads (accession CRA01332062), and the whole-genome assembly (accession GWHEQBI0000000063) of the C. parvum IIdA19G1 strain can be access through National Genomics Data Center, China National Centre for Bioinformation/Beijing Institute of Genomics, Chinese Academy of Sciences (PRJCA02054064). The genome assembly65 have also been submitted to NCBI database under the BioProject accession number PRJNA1045063. Moreover, the genomic annotation results have been deposited in the Figshare database66.

Technical Validation

We evaluated the assembly using two criteria: the mapping of short and long sequencing reads and BUSCO assessment. The reads from the short-insert library were re-mapped onto the assembly using BWA v.0.7.1025, while PacBio HiFi and ONT long reads were aligned using minimap2 v.2.2467 using default parameters. The assembly completeness was evaluated using BUSCO v.5.4.626 using the Coccidia dataset and genome mode (-l coccidia_odb10 -m geno).The mapping rate for short reads was 99.4%, while the mapping rates for HiFi and ONT long reads were 99.6% and 97.7%, respectively (Table 5). Moreover, 98.2% of the complete single-copy BUSCO genes were included in the assembled genome (Table 2). Overall, these assessments independently confirmed the accuracy and completeness of the genome assembly.

Table 5 Results of long and short sequencing reads mapped to the assembled C. parvum genome.