Background & Summary

Prototheca is a genus of achlorophyllous green microalgae classified within the phylum Chlorophyta, class Trebouxiophyceae, and family Chlorellaceae. It exhibits wide distribution across diverse environments worldwide1. Prototheca has been identified as a pathogen capable of infecting humans (primarily caused by Prototheca wickerhamii), cattle and dog (mainly caused by Prototheca bovis and Prototheca ciferrii), and some other reported species caused by Prototheca cutis, Prototheca blaschkeae and Prototheca miyajii1,2,3. The Prototheca infection, also known as Protothecosis, has been documented in several hundred cases4. Over the past two decades, there has been a significant increase in the prevalence of Prototheca as an emerging pathogen5. However, several knowledge gaps persist regarding their biological characteristics, particularly concerning their pathogenicity. In particular, the limited genomic and molecular research on Prototheca impedes the development of robust diagnostic tools and a comprehensive understanding of its pathogenesis. To enhance public awareness of protothecosis and improve its identification and diagnosis, we have initiated and established the Protothecosis Science Popularization and Monitoring Consortium (PSPMC) and China Prototheca Working Group (CPWG) in collaboration with multiple organizations.

Due to the rapid development and cost reduction of next-generation long-read sequencing technologies, genome sequencing has become more cost-effective6,7. For the Prototheca genus, several genomes have been assembled, including P. wickerhamii strain ATTC165294, P. wickerhamii strain S1 and S9318, P. wickerhamii strain InPu-22_FZ9 and two strains of Prototheca zopfii Pz20 and Pz2310. In addition to genome reference resources, multi-omics approaches such as transcriptomics, proteomics, and metabolomics have been employed to further elucidate the pathogenic mechanisms of Prototheca11. The diagnosis of Prototheca infection is generally determined through blood, body fluid, or tissue culture12. However, it poses a significant challenge, particularly in cases where protothecosis is not suspected clinically. The diagnosis cannot be classified as different Prototheca without DNA sequencing. The genomic data of different Prototheca species not only enhances diagnostic precision but also provides valuable insights for foundational research. A total of eighteen Prototheca species have been reported, with fifteen being documented in 201913, and three newly identified species, Prototheca fontanea, Prototheca lentecrescens, and Prototheca vistulensis were later characterized14. However, all the taxonomic samples utilized in the study were obtained from Poland based on the cytochrome B(CYTB)-based PCR-restriction fragment length polymorphism (RFLP). P. bovis and P. ciferrii, previously classified as P. zopfii genotype 1 and genotype 2 respectively5,15, are closely associated with dairy herd environments and remain under investigation. In this study, samples of the two species were collected from China. The species P. bovis has been identified as the most pathogenic among dairy cattle16, while P. ciferrii has been reported that it can cause the infections in dogs inducing a more aggressive disease course17,18. A high-quality genome serves as the genetic foundation for molecular research and gene diagnostics. Recently, high-throughput long-read sequencing, particularly PacBio HiFi sequencing, has enabled the generation of the highest-quality genomes to date19. The genome assembly algorithms have revolutionized the field by significantly improving computational efficiency and reducing costs19. These two high-quality genome resources hold significant value for taxonomy identification, evolutionary studies, comprehensive understanding, and even the diagnosis of Protothecosis.

Methods

The process of isolating and obtaining the Prototheca strains was approved by the Medical Ethics Committee of Shanghai East Hospital, Tongji University (Approval No. 2024YS-274). An overview of the experimental and bioinformatic workflow used in this study is depicted in Fig. 1. Briefly, the process encompassed sample collection, DNA extraction, multi-platform sequencing, genome assembly, annotation, and comprehensive quality assessment.

Fig. 1
figure 1

Workflow of genome sequencing, assembly, and annotation for Prototheca bovis SH08 and Prototheca ciferrii SH13.

Sample collection and extraction

The strain of Prototheca bovis SH08 was obtained from fresh cow milk sample in Shanghai, China. Prototheca ciferrii SH13 was obtained from skin tissue samples of human patients in Shanghai, China. The two strains were classified as P. bovis and P. ciferrii based on the previous protocol and analysis of the CYTB gene14. The samples of P. bovis SH08 and P. ciferrii SH13 can be available at the biobank of Shanghai East Hospital, Tongji University. The two strains were cultivated in a bacteriological incubator on Sabouraud dextrose agar medium and incubated at 35 °C for 4 days, followed by subsequent harvesting for DNA extraction. The high molecular weight (HMW) genomic DNA of the P. bovis SH08 and P. ciferrii SH13 were extracted using the CTAB method, following a previously reported protocol20.

Short library construction, sequencing, and genome survey

The short reads library was generated for P. bovis SH08. The library with an insert size of 300–500 bp was subjected to Covaris E220 System for fragmentation and prepared using the MGIEasy Universal DNA Library Prep Set (MGI-Tech). The PE 150 reads were sequenced on the next generation sequencing MGISEQ-2000 platform, generating approximately 5.5 gigabases (Gb) of sequencing data (Table 1), following identical filter parameters as our previous study10. The 17-mer and 19-mer frequency were calculated for P. bovis SH08 using jellyfish21. The estimated genome sizes are approximately 29.06 Mb and 30.1 Mb, with a total of 4,621,095,510 and 4,551,930,609 k-mers, and peak depths of 159 and 151, respectively, as calculated by KmerFreq v5.021.

Table 1 The sequencing data statistics for the P. bovis SH08 and P. ciferrii SH13 genomes.

PacBio Library construction, sequencing, and genome assembly

For PacBio HiFi sequencing, two libraries with an approximate insert size of 20 kb were generated for P. bovis SH08 and P. ciferrii SH13 using the SMRTbell Express Template Prep Kit 2.0 (Pacific Biosciences, USA). A total of approximately 1.99 Gb (119,083 reads) and 4.17 Gb (242,272 reads) clean data were generated using PacBio SequelII SMRT cell with HiFi model, respectively (Table 1). The average length of CCS reads was 16,720 bp and 17,214 bp for P. bovis SH08 and P. ciferrii SH13, respectively (Table 1). Using the HiFi long reads, the nuclear genomes of the P. bovis SH08 and P. ciferrii SH13 were assembled using hifiasm v0.7 with default parameters22. The assembled genome of P. bovis SH08 is 30.6 Mb, with a contig N50 of 1.19 Mb and a total number of 53 contigs. Similarly, the assembled genome of P. ciferrii SH13 is 32.7 Mb, with a contig N50 of 1.78 Mb and a total number of 96 contigs (Fig. 2 and Table 2). The maximum length of assembled genome of P. bovis SH08 and P. ciferrii SH13 is 2,899,485 bp and 2,975,726 bp, respectively (Table 1). The GC content of P. bovis SH08 and P. ciferrii SH13 is 73.6% and 67.8%, respectively (Table 1).

Fig. 2
figure 2

The genome characteristic of two species P. bovis SH08 and P. ciferrii SH13.

Table 2 The genome assembly and annotation statistics for P. bovis SH08 and P. ciferrii SH13.

Genome annotation

The repeat annotation was conducted by employing TRF (4.9) with default parameters for the identification of Tandem repeats23. Then, the homologous sequences of the P. bovis SH08 and P. ciferrii SH13 genomes were identified using the software RepeatProteinMask (v 4.0.7)24 and RepeatMasker (open-4.0.9)25 based on the Repbase library (http://www.girinst.org/repbase)26. The databases of two strains repetitive sequence were generated using RepeatModeler open-1.0.1140 and LTR_FINDER_parallel 1.0.741, followed identified using the ab initio method with RepeatMasker (open-4.0.9)25. Finally, 14.79% and 14.24% of assembled P. bovis SH08 and P. ciferrii SH13 genomes were classified as repetitive sequences (Table 2 and Table 3). After repeat identification, three different algorithms, namely homology-based annotation, ab initio prediction and RNA-Seq data-based annotation were employed for gene prediction. In homology-based annotation, a total of 10 published homology protein sequences from Coccomyxa subellipsoidea, Chlorella variabilis, Chlorella desiccate, Chlorella sorokiniana, Auxenochlorella protothecoides, three strains of Prototheca wickerhamii ATTC16529, S1 and S931, and two strains of Prototheca zopfii Pz20 and Pz23 were integrated with the MAKER2 pipeline27. In the ab initio prediction, the P. bovis SH08 and P. ciferrii SH13 genome sequences, which were masked for repeated elements, were employed to identify coding regions of genes using AUGUSTUS v3.2.328 and SNAP29. For RNA-based prediction, the relative transcriptome of P.zopfii (7.5 Gb) from the published study30. The RNA reads were mapped to the genomes of two strains of P. bovis SH08 and P. ciferrii SH13 genomes using hista2.2.131, followed by transcript assembly using stringtie2.1.632. The gene sets were obtained by integrating the three strategies using Maker2 (2.31.10)27. Finally, a total of 5,141 and 4,986 protein-coding genes were obtained respectively for P. bovis SH08 and P. ciferrii SH13 genomes (Table 2). The total count of gene models exhibiting an Annotation Edit Distance (AED) score ≤ 0.5 reached 4,957 (96.4%) for P. bovis SH08 and 4,805 (96.4%) for P. ciferrii SH13, indicating robust support from the available evidence.

Table 3 The repetitive sequence statistics in the P. bovis SH08 and P. ciferrii SH13.

Functional annotation

The gene set of the P. bovis SH08 and P. ciferrii SH13 genomes was functionally annotated based on the seven databases including NR version 2023-04-01 (NCBI nonredundant protein), KEGG version 105.0, 2023-01-01 (Kyoto Encyclopedia of Genes and Genomes, http://www.genome.jp/kegg/), KOG version 2023-03-0147, TrEMBL version 2023-03-01 (http://www.uniprot.org), Swiss-Prot version 2023-03-01 (http://www.gpmaw.com/html/swiss-prot.html), InterPro 93.0 and GO Ontology (GO) version 2023-04-01. The functional annotation for P. bovis SH08 and P. ciferrii SH13 genomes accounted for 97.28% and 99.16% of the annotated genes, respectively (Table 4). A total of 4,344 (84.50%) and 4,175 (83.73%) genes can be functional annotation in KEGG database (Fig. 3). The primary metabolic pathway identified is Carbohydrate metabolism (387), with P. bovis SH08 and P. ciferrii SH13 exhibiting significant involvement in Carbohydrate metabolism (365) as shown in Fig. 3. A total of 3,577 and 3,529 genes can be annotated in all seven databases for P. bovis SH08 and P. ciferrii SH13, respectively (Fig. 4).

Table 4 The functional annotation statistics of the P. bovis SH08 and P. ciferrii SH13.
Fig. 3
figure 3

The KEGG annotation of predicted coding genes in P. bovis SH08 and P. ciferrii SH13.

Fig. 4
figure 4

The Venny picture of functional annotation in five databases (NR, InterPro, KEGG, KOG and Swissport).

Ethics statement

The Ethics Committee of Shanghai East Hospital, Tongji University approved the study protocol and the sharing of de-identified genomic data. Informed consent was obtained from all participants. The ethical approval for this study includes the isolation of Prototheca from the milk of the cows and sample of the patients, and the consent of the dairy farm manager was obtained when collecting the milk samples.

Technical Validation

Evaluation of the genome assembly

The quality assessment of the two strain genomes was conducted by aligning the HiFi reads to a high-quality assembled genome using Minimap2, resulting in a mapping rate and coverage of 96.91% and 98.57% for P. bovis SH08, and a mapping rate and coverage of 96.35% and 96.71% for P. ciferrii SH13, respectively. The mean depth of mapping coverage is 60.15 × and 114.37 × for P. bovis SH08 and P. ciferrii SH13, respectively. The distribution of sequencing coverage depth and accumulative coverage showed that the genome assembly is high quality (Fig. 5) with 10 kb window. The accuracy of the final assembly of the two strains was assessed using Merqury with a 19-mer, yielding a quality value (QV) score of 56.4816, indicating an accuracy rate of 99.99977% for SH08 and a QV score of 57.2924, reflecting an accuracy rate of 99.99981% for SH13. The genome sizes of P. bovis SH08 and P. ciferrii SH13 are comparable to those of P. zopfii Pz20 and Pz23 (~31 Mb), while the contig N50 values for P. bovis SH08 and P. ciferrii SH13 exceed 1 Mb, indicating exceptionally high-quality genome assemblies. The analysis of GC content revealed that the genomes of P. bovis SH08 and P. ciferrii SH13 exhibit a GC content of 73.6% and 67.8% (Fig. 6), respectively, indicating the absence of exogenous species contamination in both species. To mitigate potential contamination, the HiFi assembled genomes were aligned against the GenBank nucleotide (nt) database, and the available Prototheca organelle Refseq genome was retrieved from NCBI. The contigs exhibited no bacterial contamination, and only a limited number of organelle sequences were identified and subsequently excluded from the nuclear genomes. The completeness of the P. bovis SH08 and P. ciferrii SH13 genome sequences were evaluated using BUSCO, based on the “chlorophyta_odb10” database. The analysis revealed that 76.7% and 78.2% of the 1519 conserved chlorophyta genes were identified as complete (Table 5), which is comparable to other Prototheca species.

Fig. 5
figure 5

The plot of sequencing depth and accumulative coverage.

Fig. 6
figure 6

The distribution of GC and depth and accumulative coverage.

Table 5 The BUSCO statistics of Prototheca genomes.

Evaluation of the gene set

Based on the comparison of gene features, the distribution of gene(mRNA) length and coding sequence (CDS) length exhibited remarkable similarity among the eight green algae (Fig. 7a, c). The P. bovis SH08 and P. ciferrii SH13 exhibited a similar pattern to P. zopfii Pz20 and Pz23 in terms of exon length, intron length, and exon number, which is consistent with their comparable genome size and relatedness (Fig. 7b, d, e). Furthermore, the gene features indicated that the gene annotation is of high quality when compared to published data. The completeness of the P. bovis SH08 and P. ciferrii SH13 gene sets were evaluated using BUSCO, based on the “chlorophyta_odb10” database. The analysis revealed that 76.7% and 77.7% of the 1519 conserved chlorophyta genes were identified as complete (Fig. 8), indicating high-quality gene annotation comparable to other Prototheca species (77.3% and 77.6% in P. zopfii Pz20 and Pz23)10.

Fig. 7
figure 7

The gene features in P. bovis SH08, P. ciferrii SH13 and other related green algal species. (a) Distribution of gene(mRNA) length. (b) Distribution of exon length. (c) Distribution of CDS length. (d) Distribution of Intron length. (e) Distribution of exon number.

Fig. 8
figure 8

The BUSCO assessment analysis for gene sets in P. bovis SH08 and P. ciferrii SH13.

Genome sequences comparison

To confirm the genome quality, the published P. bovis SAG 2021 and P. zoffii Pz20 genomes were downloaded. The published P. bovis SAG 2021 genome assembly comprised 24,744,895 bp and consisted of 4,555 contigs15. The previously fragmented P. bovis SAG 2021 genome was scaffolded and patched using the RagTag software33, based on our newly assembled reference P. bovis SH08 genome. A total of 3,815 contigs were placed, with a total length of 21,558,679 bp (87.1% of assembled genome size). Then, two genomes were compared using MUMmer4 with nucmer model (c 2000 -l 400). The mummer analysis of P. bovis genomes revealed clearly and high similarity (>97%), and suggested genome assembly was accurate (Fig. 9a). The newly P. ciferrii SH13 genome was compared with previous published P. zoffii Pz20 genome also using MUMmer4 with nucmer model (c 2000 -l 400). The mummer analysis of genomes revealed nearly one contig to one contig clearly and high similarity (>96%) and suggested genome assembly was correct (Fig. 9b).

Fig. 9
figure 9

The genome sequences comparison. (a) The sequence comparison between P. bovis SH08 and P. bovis SAG 2021 genomes. (b) The sequence comparison between P. ciferrii SH13 and P. zoffii Pz20 genomes.

Data Records

The P. bovis SH08 and P. ciferrii SH13 genome sequences and annotations were deposited in Figshare.The dataset had been deposited at PRJNA1184973 in the Sequence Read Archive (SRA)34. DNBSeq short-read data of P. bovis SH08 was deposited in the SRA at SRR31320872. HiFi long-read data of P. bovis SH08 and P. ciferrii SH13 were deposited in the SRA at SRR31320874 and SRR31320873. The genome assembly of P. bovis SH08 and P. ciferrii SH13 had been deposited at GenBank under accession JBJGBU00000000035 and JBJGBT00000000036. The P. bovis SH08 and P. ciferrii SH13 genome sequences and annotations were deposited in Figshare37.