Background & Summary

Commensal bacteria, fungi, viruses and protists are common within the mammalian intestine, interacting with each other and contributing to gut homeostasis1. Protozoa are increasingly being recognised as essential to the gut microenvironment2,3 by their ability to reshape the constituent bacteriome4,5 and mucosal immune responses6,7,8 with minimal signs of disease. Despite the biological significance of these commensal protists, reference genome assemblies with annotation to facilitate genetic diversity and comparative genomic studies are often lacking.

Tritrichomonas musculus, also known as Tmu, is an extracellular anaerobic commensal protist that frequently colonizes the mouse large intestine. The trophozoites have an ovoid shape, are 10 to 16 µm long, possess an anterior and posterior flagella and one undulating membrane2 (Fig. 1A). Mice colonized with T. musculus do not present with disease, including symptoms such as diarrhea and weight loss, but do exhibit a mild goblet cell hyperplasia, host epithelial cell inflammasome activation, including the release of IL-1β and IL-18, and a profound shift in the 16S bacterial community structure during T. musculus colonization5. This reprogramming of the constituent microbiome and host immune potential is sufficient to protect mice from a lethal challenge with the pathogenic bacteria Salmonella typhimurium6. However, the change in the immune status increases the risk of colitis and colorectal cancer6,7.

Fig. 1
figure 1

(A) Representative T. musculus trophozoite colonizing mouse ceca surrounded by bacteria. In situ cecum scanning electron microscopy image (methods available in Tuzlak & Alves-Ferreira, 2023). (B) Schematic workflow for T. musculus purification and whole genome sequencing. Created with BioRender.com.

Here, we describe a high quality, annotated T. musculus genome assembly. We purified T. musculus from the cecal content of laboratory mice monocolonized with the EAF2021 isolate. The protists were purified by Percoll gradient centrifugation followed by flow cytometry prior to DNA and RNA extraction. Sequencing libraries were prepared using purified DNA or RNA and Illumina short-read, PacBio and Oxford Nanopore (ONT) long-read sequencing was performed to facilitate genome assembly and annotation (Fig. 1B). The genome is 193.5 Mb-long with an N50 scaffold length of 3.5 Mb that assembled into 756 contigs. A total of 46,131 protein-coding genes were identified. This annotated genome will be a useful resource to develop a genetic model system to study the biology, evolution and diversity of T. musculus and to discover the metabolic pathways that are essential for colonization and survival within the host.

Methods

Sample collection, whole genome DNA and RNA sequencing

Tritrichomonas musculus trophozoites of the EAF2021 isolate were purified from laboratory mice cecal content using flow cytometry as described previously2. High molecular weight (HMW) genomic DNA (gDNA) necessary for long-read sequencing was extracted using the Blood & Cell Culture DNA Mini Kit-G20 (Qiagen) and 700 ng and 1 µg of DNA was used for PacBio and Nanopore library preparation, respectively (Fig. 1B). HMW PacBio libraries were generated following the Pacific Biosciences protocol “Preparing HiFi SMRTbell® Libraries using SMRTbell Express Template Prep Kit 2.0”. The libraries were run on an 8 M SMRTCell using sequencing reagents version 2.0. Sequencing was performed on a Sequel II sequencer (Pacific Biosciences) with control software version 9.0.0.9223 and a movie collection time of 15 hours per SMRTCell. ONT libraries were generated using the Ligation Sequencing SQK-LSK109 kit protocol per manufacturer’s instructions (Nanopore-MinION) and sequencing was performed on a FLO-MIN106 flow cell using the operation software MinkNOW (version 21.06.9) for 60 hours. In total 947,266 and 2,631,769 raw reads were generated from PacBio and ONT sequencing, respectively (Table 1). PacBio reads were used for genome assembly and ONT reads larger than 10 kb were used for scaffolding. Total RNA was extracted using RNeasy Micro Kit (Qiagen) and submitted to CD Genomics (New York, USA) for library preparation and sequencing. RNA libraries were prepared using SMART-Seq v4 Ultra Low Input RNA Kit (Takara) and sequencing was performed using Hi-Seq X Ten (Illumina) by CD Genomics. A total of 55,879,424 cleaned reads were used for genome annotation (Table 1).

Table 1 Sequencing data statistics.

Genome assembly

PacBio HiFi reads were assembled using canu (v2.1)9. MinION reads >10 kb were used to correct and scaffold the contigs using LongStitch (v 1.0.1)10 with the parameters k = 32, w = 250, with the gap fill option selected. Next, the assembly was polished using the PacBio reads with racon (v1.4.13)11. Genomic summary statistics were determined using Galaxy12. Forty contigs were removed from the assembly due to being <1 kb in length or having either zero or only one MinION read mapping to the PacBio assembly. Only contigs that had similar coverage from both long-read sequencing platforms were included. The genome had a GC content of 29.47% (Table 2). In summary, the 193.49 Mbp genome was assembled in 756 contigs with N50 of ~ 3.5 Mb, the shortest contig size was 7,256 bp and the longest was ~11 Mbp (Table 2).

Table 2 T.musculus genome assembly statistics.

Gene and functional predictions

Gene annotation was done using funannotate (v1.8.11)13 which removed 298 duplicate contigs from the assembly. Repeat masking was done using redmask (v0.0.2)13 which masked 42.67% of the genome, indicating that it is a highly repetitive genome, which likely explains its larger size compared to other trichomonads (Table 3). NCBI’s Foreign Contamination Screen14 identified 18 contigs as contaminants that were removed from the assembly and an additional 4 contigs as chimeric that were separated into multiple contigs with the intervening sequence removed. RNA-Seq data used by funannotate was assembled with Trinity (v2.12.0)15. Funannotate ran the following programs: PASA (v2.5.2)16, GeneMark (v4)17, Augustus (v3.3.3)18, eggnog-mapper (v2.1.2)19, signalp (v6)20, and interproscan (v5.52–86.0)21. Ribosomal RNA genes were identified by RNAmmer (v1.2)22. Relatively few genes possessed introns (5.66%) and the gene function annotation predicted that 36.6% of the genome was covered by 46,131 genes. BUSCO statistical analysis showed that the genome assembly captured 100% of the expected trichomonad genes, indicating that the genome assembly was complete, but only 53% of available BUSCO genes within the eukaryote_odb10 database (Table 4). Accordingly, 44,152 genes were identified as protein coding genes and 39,744 as hypothetical genes. Of these, 24,105 of the hypothetical genes have an InterPro number, 13,990 have a GO (Gene Ontology) term, 4,671 have an EC (Enzyme Commission) number and 24,215 have Interpro, GO and EC numbers (Table 3).

Table 3 T. musculus genome annotation statistics.
Table 4 BUSCO statistics.

Ploidy analysis

Ploidy analysis was performed using a combination of GenomeScope 2.0 and Smudgeplot23 using T. musculus Illumina Hi-Seq reads (150 bp, two biological replicates). GenomeScope was run to fit a mixed model of negative binomial distributions to the k-mer spectrum of sequencing data as measured by FASTK (Fig. 2). The k-mer spectra was also visualized using Smudgeplot to estimate organism ploidy. The ploidy analysis by GenomeScope was consistent with T. musculus being haploid, with a homozygosity rate of 99.7% and heterozygosity rate of 0.345% (Fig. 1). However, the ploidy analysis using Smudgeplot (Sup Fig. 1), a software suite trained to recognize polyploid states, rather supported an organism that was diploid with highly homozygous sister chromosomes.

Fig. 2
figure 2

GenomeScope plot visualization of T.mu genome presented a single k-mer peak, indicated haploid genome. K-mer analysis (k = 21) estimated genome size is 334 Mb with coverage of 20.3X, 50.5% unique sequences, homozygous rate of 99.7% and heterozygosity rate of 0.345%.

Comparative functional genomic analysis

A gene functional comparison of T. musculus was performed by comparing predicted Pfam domains against other representative genomes, including protozoa that colonize the mammalian gut mucosa and are described as pathogenic or opportunistic, as well as Saccharomyces cerevisiae. Predicted T. musculus proteins were submitted to InterProScan v.5.30–69.021 for functional annotation (minimum E-value 1e-05). Pfam domains were retrieved from available annotated genomes for Trichomonas vaginalis G3 2022, Tritrichomonas foetus strain K, Giardia Assemblage A isolate WB 2019 (GiardiaDB release 62), Histomonas meleagridis 2922-C6/04-10x (TrichDB release 62) and Entamoeba dispar SAW760 (AmoebaDB release 62). For Blastocystis sp. ST1 (UP000078348) and Saccharomyces cerevisiae (UP000002311), Pfam annotations were retrieved from proteomes in UniProt release 2023_01 (Fig. 3A). Overall frequencies of domain-containing proteins were tabulated and ranked for each organism.

Fig. 3
figure 3

Functional protein domain comparison between T. musculus and other protists colonizing the mammalian gut mucosa. (A) Total number of Pfam domains identified for each annotated protist genome analyzed. Saccharomyces cerevisiae genome was used as a model organism. (B) Top 20 most abundant Pfam domains detected for all annotated T. musculus protein sequences, ranked by their frequency of occurrence compared to other protists and yeast (left panel). Each unique Pfam domain was ranked by the frequency of its occurrence. The number indicated in each box represents the rank order within each species from most abundant to least abundant (right panel). Shading was dependent on rank order, with darker shades indicating a lower rank.

We examined the frequency of the 20 most abundant Pfam domains identified in the T. musculus proteome against other infectious protists and S. cerevisiae (Fig. 3B; Suppl. Table 1). The most prevalent domain in T. musculus, the protein kinase domain (PF00069), with 3411 copies in the genome, was consistent with other organisms. As expected, the transcription-initiator DNA-binding domain IBD (PF10416), first reported to bind a DNA element unique to T. vaginalis, was present only in the trichomonads24. Of particular interest was the BspA-type Leucine rich repeat region (LRR_5 PF13306) found in 2,217 T. musculus proteins. BspA-like proteins are known to mediate interactions with the host extracellular matrix and were previously reported in large numbers in T. vaginalis and Entamoeba spp25,26. We also identified an expansion in the multi-antimicrobial extrusion protein domain (MatE; PF01554) in trichomonads (87 genes in T. musculus, 36th most abundant domain; 48 and 44 genes in T. vaginalis and T. foetus; 14 genes in H. meleagridis), whereas the domain was detected in only three or fewer genes among the remaining protists, and yeast.

Data Records

The T. musculus genome assembly (isolate EAF2021) and raw reads have been deposited in the NCBI database under the BioProject accession number PRJNA841657, SRA accession numbers SRR25145092, SRR25067786, SRR2506778727, GenBank accession JAPFFF000000000.128 and GeneBank assembly GCA_039105265.129.

Technical Validation

Benchmarking Universal Single-Copy Orthologs (BUSCO) is a comprehensive tool to evaluate the quality of the genome assemblies and transcriptomes for eukaryotes and prokaryotes30,31. To estimate the completeness of the T. musculus assembly, we conducted BUSCO (version 5.4.2) analyses using mode euk_genome_met mode and metaeuk as gene predictor (eukaryote_odb10). BUSCO reported 53% total complete BUSCO genes representing 132 out of 225 BUSCOs and 104 complete and single-copy BUSCOs (41.6%) (Table 4). BUSCOs scores lower than 50% have been observed in published protists genomes32,33,34 and they are acceptable as satisfactory and indicate a good completeness. The low number detected is likely due to the lack of complete or partial BUSCO datasets for protists, especially for T. musculus related species.