Background & Summary

Cryptosporidium is an apicomplexan protozoan parasite of global medical, scientific, and veterinary significance that can cause moderate-to-severe diarrhea in humans and animals1. It is the leading cause of waterborne disease outbreaks in the United States2,3. Though cryptosporidiosis occurs in both immunocompromised and immunocompetent individuals, it is especially severe in immunocompromised and elderly populations as well as in children, resulting in persistent infection, malnutrition, and, in some cases, death3,4,5. In 2019, the Global Burden of Disease study found 133,422 global deaths and an annual loss of 8.2 million disability-adjusted life years (DALYs) due to Cryptosporidium6. C. meleagridis is an avian and mammalian-infecting Cryptosporidium species that was first described in turkeys7,8. Human infections with Cryptosporidium are caused predominantly by C. parvum and C. hominis, but species such as C. meleagridis can also infect humans. In fact, C. meleagridis is the third most common human-infecting Cryptosporidium species following C. parvum and C. hominis9. Though generally less common, C. meleagridis infection has been reported to be as common as C. parvum in some parts of the world and can lead to death in rare cases10,11.

Currently, 17 of the >30 reported Cryptosporidium species have assembled genome sequences. Twelve species have annotated genome sequences including C. andersoni, C. bovis, C. canis, C. felis, C. hominis, C. meleagridis, C. muris, C. parvum, C. ryanae, C. tyzzeri, C. ubiquitum and C. sp. Chipmunk genotype12,13. Cryptosporidium spp. have eight chromosomes and genome sizes of ~9 Mb. The only C. meleagridis genome sequence, strain UKMEL1 (CmUKMEL1), contains gaps and is assembled into 57 contigs. Historically, it has been challenging to sequence the genome of Cryptosporidium parasites. Sustainable in vitro culture and cloning are not possible. Thus, sequencing a bulk population of parasites, when enough can be isolated, has been the preferred approach. Recently, a new method was implemented to generate genome sequences for Cryptosporidium using multiple displacement amplification, a whole genome amplification (WGA) approach. It was tested on 10 ng of genomic DNA from C. meleagridis strain TU1867 (CmTU1867) which provided sufficient DNA for library construction and generation of a high-quality genome sequence using Oxford Nanopore Technologies (ONT) long-read sequencing14. Here we share a chromosome-level assembly and reannotation of the C. meleagridis genome.

Whole genome sequencing and assembly

The newly generated CmTU1867 genome assembly contains an additional 201,275 base pairs (bp) of sequence relative to CmUKMEL1. The largest contig in the new assembly is 632,735 bp longer than the largest contig in CmUKMEL1. We also note a larger N50 value in the new assembly (Table 1). For comparison, a recent telomere-to-telomere, T2T, genome assembly for the closely related and highly syntenic species, C. parvum IOWA II (CpBGF) is provided15. This high-quality C. meleagridis genome assembly results from a new experimental approach designed to help generate long-read genome sequences from limiting quantities of genomic DNA and is an important resource that will facilitate our understanding of Cryptosporidium evolution and host specificity.

Table 1 Statistics of the C. meleagridis and CpBGF genome assemblies.

The initial genome assembly contained 8 chromosomes and 5 contigs ranging in size from 681–30,300 bp, 2 of which were later identified as contamination and removed. Two additional contigs were manually created (“contig_10” and “contig_11”) from the beginnings of chromosome 2 and chromosome 6 due to detection of assembly artifacts in these chromosomes. The final assembly contains 8 chromosomes and contigs 9–13 (Table 1). Contig_9 and contig_13 have regions of sequence identical to parts of chromosomes 1 and 3, respectively, but assembled separately from the chromosomes. The chromosomes of C. meleagridis are numbered and oriented according to their homology with the highly syntenic C. parvum. The new assembly is highly syntenic to the previous CmUKMEL1 assembly at the nucleotide level (Fig. 1).

Fig. 1
figure 1

DNA synteny plot mapping the contigs of CmUKMEL1 to the eight chromosome-level contigs of CmTU1867. Jupiterplot between the previous CmUKMEL1 genome sequence and the new CmTU1867 genome sequence. Ribbons are colored with respect to the reference CmTU1867 chromosome.

In comparison to CmUKMEL1, the new CmTU1867 assembly lacks telomeres, except for chromosome 5, which has one assembled telomere. A search of the ONT long-reads, revealed several reads with telomere sequences that did not assemble. Though these reads did not assemble, regions of the read that did not contain the telomere pattern matched unique sequences in the assembled chromosomes. By mapping these reads back to the genome assembly, we identified three additional telomeres that could be placed manually at the 5′ and 3′ ends of chromosome 3, and the 5′ end of chromosome 4. At least 4 telomere-containing long-reads mapped to these regions with at least 1 long (>1 kb) read that extended into unique regions of the chromosome. However, due to the low number of reads in support of these telomeres, we did not extend the ends of chromosomes in the assembly with these telomere-containing reads.

Genome annotation

The new CmTU1867 genome assembly was annotated using the previous CmUKMEL1 annotation, a recent CpBGF annotation, orthology analysis, and de novo gene prediction. Gene expression data for CmTU1867 are not available to assist with the annotation, so UTRs are not predicted. Annotation of CmTU1867 reveals 166 additional protein-encoding genes and numerous additional ribosomal RNAs (Table 2).

Table 2 Annotated genes and RNAs in CmBEI, CpBGF, and CmUKMEL1.

A comparison of the synteny of the protein-encoding genes and rRNAs between CmTU1867 and CpBGF revealed highly syntenic chromosomes (Fig. 2). The new CmTU1867 genome sequence has 16 additional ribosomal RNA genes compared to CmUKMEL1 (Table 2). The 5 small, 5.8S rRNA units are found on chromosomes 1, 2, 7, 8 (Fig. 2). The six 5S rRNAs in CmTU1867 are in 2 clusters of 3, on chromosome 3 and 7 (Fig. 2). In CpBGF the cluster of 5S rRNAs on chromosome 3 contains 2 rRNAs whereas in CpIOWA-ATCC and CmTU1867, the cluster of 5S rRNAs on chromosome 3 contains 3 rRNAs. These patterns may arise because of variation in the copy number of the 5S rRNA within a population of parasites or among different species of Cryptosporidium or compressions during genome assembly. When CmTU1867 reads were mapped to the assembly at regions where there are 5S rRNA clusters in chromosomes 3 and 7, we saw relatively even coverage throughout the region. However, CpBGF shows 2-3X read compression at this locus on chromosome 3 and 7 suggestive of population variation. One of the unassembled contigs, contig_9, has an additional 18S/28S cluster. However, due to the fact that we are not able to find a chromosomal location for it despite long-read sequencing and since our sample is not clonal, we do not have sufficient evidence to conclude its status.

Fig. 2
figure 2

Protein synteny analysis of the eight chromosome-level contigs of CmTU1867 and Cryptosporidium parvum, CpBGF. Circos plot rings, moving from the center to the exterior illustrate shared ortholog clusters between CmTU1867 and CpBGF, number of base pairs in 50,000 bp increments, GC content histogram, and gene density. Locations of rRNA genes are as indicated.

While annotating, we noticed several genes that encoded a single long protein in CmUKMEL1 but were annotated as two distinct genes in CpBGF. Upon investigation, we discovered that these gene annotations vary in size in several Cryptosporidium species. In CmTU1867, we have kept the long protein annotation when it is observed. There are 20 cases where the single long protein in C. meleagridis does not appear to exist as a single open reading frame in CpBGF, (Table 3). A lack of RNAseq evidence for C. meleagridis makes it challenging to validate the existence of these long open reading frames whereas C. parvum has a large quantities of expression data available. We made a note in the submitted CmTU1867 annotation if the gene is annotated as two or three distinct genes in other species. Two of the 20 CmTU1867 proteins are annotated as 3 distinct proteins in CpBGF (Table 3).

Table 3 Single large, annotated genes in CmTU1867 that are annotated as two or three distinct sequential genes in CpBGF and/or other Cryptosporidium spp.

Interestingly, each of the 5 annotated 18S and 28S rRNAs has a putative protein-encoding gene within it (Table 4). Our submitted annotation does not contain any of these putative ORFs as their presence would be so unusual it cannot be accepted by the NCBI GenBank. However, we note, they may exist. The 18S rRNA genes encode a putative intron-encoded homing endonuclease. While we detect the presence of this putative protein, we do not detect an intron in the 18S rRNA. The six putative homing endonuclease protein sequences in the 18S rRNAs are not identical due to a guanine deletion at position 1061 in two of the five 18S rRNAs (Chr1 and Chr7). This results in a premature stop codon in three of the putative homing endonuclease sequences (Fig. 3). This indel is likely due to an ONT homopolymer sequencing artifact. BLASTp searches of other Cryptosporidium species revealed annotations of this gene in C. ubiquitum and C. felis. We note that annotation in other species does not make these genes real, only proteomics can confirm them, thus they are not included in our submitted annotation.

Table 4 Putative ORFs (not submitted in the CmTU1867 GenBank record) encoded with the 18S and 28S rRNAs in CmBEI and their coordinates.
Fig. 3
figure 3

Portion of the 18SrRNA gene sequence and the putative ORF contained within it. Multiple sequence alignment of 18S rRNAs representing the guanine SNV and the putative ORFs contained in these sequences.

The six putative senescence associated proteins encoded in the 28S rRNAs are identical. This protein is found in BLASTp searches in C. hominis TU502, C. canis, C. ubiquitum, and C. muris. This protein has an ART2/RRT15 domain according to InterPro. As was the case with the putative intron homing endonuclease in the 18S genes above, given the location, we have not included these proteins in the submitted annotation due to a lack of evidence for their existence.

Comparison with previous assemblies

The annotation was assessed by comparing the CmTU1867, CmUKMEL1 and CpBGF protein-coding sequence gene content using orthology-based algorithms. Several putative species-specific single copy genes were identified (Table 5). We identified 23 species-specific genes in CmTU1867, 11 in CmUKMEL1, and 39 in CpBGF (Table 5). This finding makes sense because CpBGF, which is a T2T assembly, is the most complete of the three assemblies and CmTU1867 is a more complete genome assembly than CmUKMEL1. To assess whether species-specific genes were located in sub-telomeric regions, the first and last 25 genes of each chromosome were assessed for the presence of species-specific genes. We observe that the putative species-specific genes are not enriched in sub-telomeric regions (bolded gene names in Table 5), rather, they are scattered throughout the genome. If real, the evolutionary origin of these genes is intriguing. However, these results are derived from as of yet, incomplete genome assemblies for C. meleagridis and require further validation.

Table 5 Putative species-specific genes identified in CmTU1867, CmUKMEL1 and CpBGF.

All orthogroups (multiple shared derived genes – orthologs or paralogs) as opposed to the single-copy genes in Table 5, that were not shared by all three genome assemblies were investigated. Some of the orthogroups fall at the ends of chromosomes in C. parvum that extended beyond the ends of the CmTU1867 and CmUKMEL1 chromosomes. Other times they were unannotated in one species or the other but present in the syntenic genome sequence region(s). When we found unannotated proteins that were not initially detected in CmTU1867, we manually added these genes to the submitted annotation. Ultimately, we found very few orthogroups that were unique to a subset of species (Fig. 4). The manual validation of the orthogroups is presented in Supplementary Table 1.

Fig. 4
figure 4

Venn diagram of orthogroup search results following manual validation. Orthogroup comparison among the new CmTU1867, the previous CmUKMEL1, and the newly released reference genome, CpBGF. See Fig. 6 for the pre-validation results. Arrows link the list of gene IDs found in the smaller orthogroups that are unique to a species or shared by two species.

Methods

Whole genome sequencing and assembly

C. meleagridis isolate TU1867 genomic DNA was obtained from BEI Resources (cat. number NR-2521 ATCC, Manassas, VA). A total of 10 ng of C. meleagridis DNA was amplified through whole genome amplification using multiple displacement amplification (MDA), followed by T7 endonuclease debranching yielding 400 ng debranched DNA14. Following sequence generation and assembly, polishing and annotation proceeded as in (Fig. 5). The sequence was polished with existing reads from the NCBI GenBank Sequence read archive accession SRR79356116.

Fig. 5
figure 5

Experimental workflow for genome sequencing, assembly, annotation, and validation. Bioinformatics workflow for assembly and annotation of the DNA derived from CmTU1867 WGA. Green boxes represent initial steps as well as new data used for parts of the pipeline and blue boxes represent subsequent downstream analyses of the data generated. The illustration was generated in BioRender.

ONT library preparation used the SQK-RBK004 Rapid Barcoding Sequencing Kit (Oxford Nanopore Technologies, Oxford, UK) as per the manufacturer’s instructions. Sequencing was performed on an ONT MinION device with R9.4.1 flow cells and bases were called by guppy v.6.4.2 using the high-accuracy base call model. The long-read fastq reads were assembled using Flye v.2.8.217 with the–nano-raw option and -g 9 m. The draft long-read genome assembly was polished with PolyPolish v.0.5.018 using default parameters to increase the accuracy of the base calls by using C. meleagridis strain TU1867 Illumina sequences (NCBI accession SRX253214) generated elsewhere. Intermediate files needed for PolyPolish were generated using BWA v.0.7.1719. The resulting contigs were ordered and oriented to match the CpIOWA-BGF T2T genome assembly15 (GCA_035232765.1), called CpBGF in this manuscript, using AGAT20 v. 1.1.0 PERL script agat_sq_reverse_complement.pl, GenomeTools21, and the progressive Mauve alignment v 1.1.322 in Geneious Prime v 2023.2.123. Contamination was detected by searching the NCBI nr database using BLAST24 (BLASTx default parameters) and FCS-GX25 (Fig. 5). Contaminant contigs were removed from further analysis. Telomeres were identified, as in CpBGF15 using the telomere-locating python script FindTelomeres to find the Cryptosporidium telomere repeat 5’-CCTAAA-3’ and its complement at the ends of assembled contigs (https://github.com/JanaSperschneider/FindTelomeres). The unassembled ONT long-reads were also searched for this telomere repeat with FindTelomeres and reads with telomeres were mapped back to the genome assembly. Read-mapping to the whole genome assembly was performed using minimap2 v.2.26 with the option –secondary = no to prevent multi-mapping. Genome statistics were generated using the GenomeTools v.1.6.221 programs gt stat and gt seqstat. AGAT v.1.1.020 PERL scripts agat_sq_stat_basic.pl and agat_sp_statistics.pl were used to generate statistical information with default parameters.

Genome annotation

Tracks for manual annotation were generated using a local Apollo2 server26 using two approaches: (1) an orthology based annotation transfer using the tool Liftoff27 and (2) an ab initio gene prediction using Augustus28 trained with C. parvum IOWA-ATCC12 (GCA_015245375.1) and CmUKMEL1 (GCA_001593445.1) protein sequences from CryptoDBv.5029. Annotation Liftoff tracks were created from the current CmUKMEL1, CpBGF, and CpIOWA-ATCC annotated genes with the -copies flag to look for extra gene copies. In situations where AUGUSTUS and Liftoff gene structures disagreed, the conflicting gene models were searched using BLASTp in CryptoDB to check for the gene structure that was most abundant in existing annotations. As there are no available RNA-seq data for C. meleagridis there is no evidence to confirm gene predictions and annotate UTRs. Tracks for prediction and manual annotation of rRNAs were created using barrnap30 with the parameters–kingdom euk–outseq Cmel_barrnap.fasta–evalue 1e-06–lencutoff 0.8 Cmel_genome.fasta. TRNAscan 2.031 was used to predict tRNAs using default parameters. Functional annotation was generated with Blast2GO32 (using BLASTp, the nr database, word size 5, and e-value 1e-5) and compared with results from the reference T2T CpBGF genome functional annotation. Edits to the CmTU1867 gff file gene names were performed with basic bash and awk commands. InterPro33 was used for classification of protein families investigated in Table 3 and for the genes encoded within rRNAs.

Comparative genomics

A comparison of orthologous genes between the new C. meleagridis assembly and the previous C. meleagridis assembly34 was completed using Orthofinder v2.5.5 which was run in a conda environment using the latest Anaconda release35 (2024.02-1) with default parameters (latest Diamond algorithm36) and visualized using OrthoVenn337. Figure 4 represents the orthology results following extensive manual validation (Fig. 6 and Supplementary Table 1) of each orthogroup difference. Manual analyses utilized BLASTp searches of both NCBI and CryptoDB29. Orthology, genome, and rRNA comparisons were created using Circos38. The configuration file for Circos was created following the Circos documentation and run using the command: circos -conf config_file.conf. Additionally, TBTools39 was used to visualize the circos plot. In TBTools, the “Advanced Circos” feature was selected. The ChrLen File and the Links File were generated manually following the Circos format. The rRNA features were added to the plot in the “Set Input Genome Feature List” option on TBTools. Gene density and GC content were created in TBTools following the TBTools documentation with default parameters. The genome comparisons between CmUKMEL1 and CmTU1867 were created using JupiterPlot40. The JupiterPlot was made using the command: jupiter name = $prefix ref = $reference fa = $scaffolds where the ref variable is set to the reference genome in FASTA format and the fa variable is set to the set of scaffolds in FASTA format. The general parameters were set to the default (t = 4) and karyotype options were slightly modified (m = 10000, ng=0, i=0, g = 1, gScaff=100000, labels = ref). Link options followed the default options (maxGap = 100000, minBundleSize = 50000, MAPQ = 50, linkAlpha=5). CmTU1867 long reads were mapped back to contig regions containing 5S rRNA clusters using minimap2 with --secondary = no to account for multi-mapping. The raw orthogroups were analyzed and validated extensively to create the final Venn diagram (Fig. 4). Any protein that was syntenic to proteins in the other two genome sequences was moved into an orthogroup with those proteins. Proteins that were found in the sequence of the other two genome sequences but were not annotated in one or more genome(s) were also moved into an orthogroup and added to the Venn diagram.

Fig. 6
figure 6

Ortholog search results shown in a Venn diagram. Orthogroup comparison among the new CmTU1867, the previous CmUKMEL1, and CpBGF prior to validation and correction. Arrows link the list of gene IDs found in the smaller orthogroups that are unique to a species or shared by two species.

Data Records

The genomic sequence, reads SRR2728254241, and metadata for the Cryptosporidium meleagridis TU1867 strain have been deposited in the National Center for Biotechnology Information, NCBI GenBank under BioProject accession number PRJNA102204742. This whole genome shotgun project has been deposited at DDBJ/ENA/GenBank under the accession JBCHVM00000000043. The version described in this paper contains WGS scaffolds JBCHVM010000001-JBCHVM0100000013.

Technical Validation

CmTU1867 assembly completeness was evaluated using the Benchmarking Universal Single-Copy Orthologs (BUSCO) software v.5.5.044 to search against apicomplexan databases (apicomplexa_odb10) which contain 446 orthologous single-copy genes in total. The results showed an overall completeness score of 96.6% (n = 446). Of these, 430 (96.4%) single-copy genes were retrieved of which 1 (0.224%) was duplicated. These results indicate high completeness of the genome assembly.

Further analysis of the assembly and annotated protein encoding regions utilized an orthology comparison of CmTU1867, CmUKMEL1, and CpBGF with the OrthoFinder algorithm and the results were visualized in OrthoVenn3 as described in the methods (Fig. 5). Orthogroups belonging to CmTU1867 only, CpBGF only, CmUKMEL and CpBGF only, and CmUKMEL1 and CmTU1867 only were extensively analyzed (Supplementary Table 1). Several genes found only in CpBGF were shown to be subtelomeric in both CmUKMEL1 and CmTU1867 and thus likely missing from the incomplete chromosome ends of CmTU1867. Several genes encoding short <100 amino acid proteins found in both CmTU1867 and CmUKMEL1 exist in CpBGF but are unannotated. Following these analyses, a new Venn diagram (Fig. 4) was created that represents the revised, validated findings.

We additionally found one putative open-reading frame (ORF) predicted by AUGUSTUS in CmTU1867 Chr4 region 828518–828700 bp that was not in CmUKMEL1, CpBGF, or any other species according to BLASTp and BLASTn searches. We removed this putative gene from our annotation since we could not validate it with RNAseq data or orthology, but it may be a gene unique to C. meleagridis detected by the improved assembly.