Background & Summary

Over the past decade, algal blooms have become increasingly prevalent worldwide due to the intensified anthropogenic activity1. Algal blooms have emerged as one of the most severe environmental issues affecting inland water2. The accumulations of harmful algae, including cyanobacteria, profoundly impacts water quality and disrupt aquatic ecosystems by increasing turbidity, depleting oxygen, and competing with other organisms3,4. Additionally, it is well-documented that certain phytoplankton species can generate toxic secondary metabolites that are harmful to the health of aquatic animals5. Studies have shown that silver carp (Hypophthalmichthys molitrix) and bighead carp (Hypophthalmichthys nobilis) ingest large amounts of toxic algae during periods of rapid growth6,7, suggesting these two species have developed specialized mechanisms to counteract the adverse effects of algal toxins8. Thus, these two filter-feeding fish are used to control phytoplankton and improve the ecological quality of water bodies9,10.

Mud carp (Cirrhina molitorella) is a freshwater cyprinid species distributed in southern China, Vietnam, the Philippines, and Thailand11. This species primarily inhabits midwater to bottom depths in large and medium-sized rivers, often venturing into flooded forests during the rainy season. It predominantly consumes algae, benthic organisms, and organic detritus by scraping sediment surface12. C. molitorella is one of the four major carp species cultivated in southern China, contributing to approximately one-third of the total commercial landings in the Pearl River13. Recently, high-quality genome assemblies of H. molitrix and H. nobilis have been constructed to study the genetic basis of the filter-feeding habits of these two closely related carp species14,15. Generating the genome sequence of C. molitorella facilitates the investigation of filter-feeding habits through comparative genomic analysis of filter-feeding cyprinid species from different genera.

Here, we constructed a chromosome-level genome assembly of C. molitorella by integrating PacBio, Illumina, and Hi-C sequencing strategies. The assembled sequences were anchored to 25 pseudo-chromosomes with a scaffold N50 of 39.38 Mb. BUSCO (v4.0.5) evaluation showed that the final assembly achieved 97.4% completeness. The high-quality genome assembly of C. molitorella serves as a valuable genomic resource for exploring digestive mechanisms of filter-feeding fish and provides genetic resources for developing molecular breeding program for this important aquaculture species.

Methods

Sample preparation and genome sequencing

All animal experiments were approved by the Institutional Animal Care and Use Committee of Sun Yat-Sen University. All efforts were made to minimize animal suffering. No wild C. molitorella individual as well as endangered or protected species was used in this study. One female C. molitorella individual, collected from a farm in Guangzhou, Guangdong Province, China, was used for genome sequencing. High-quality DNA was extracted from liver cells of C. molitorella using the CTAB method, followed by purification with QIAGEN Genomic kit (QIAGEN, Germany). The sequencing libraries were constructed and purified using AMPure PB beads (Pacific Biosciences, USA). Sequencing was performed on a PacBio Sequel II instrument (Pacific Biosciences, USA). For Illumina sequencing, short-insert paired-end (PE) (150 bp) DNA libraries of C. molitorella were constructed in accordance with the manufacturer’s instructions. Sequencing of PE libraries were performed on the Illumina NovaSeq 6000 platform (Illumina, USA). A total of 200.26 Gb of PacBio reads and 100.79 Gb of Illumina reads were generated (Supplementary Table 1 and 2). Genomic DNA for the Hi-C library was extracted from liver tissue, and the Hi-C library was constructed based on a previously published procedure and sequenced (2 × 150 bp) on the Illumina NovaSeq. 6000 platform (Illumina, USA)16. A total of 142.43 Gb of Hi-C reads were generated (Supplementary Table 3).

Eye, brain, gill, heart, stomach, intestine, kidney, liver, and spleen samples were collected from the C. molitorella specimen to construct sequencing libraries for RNA-sequencing (RNA-seq). Total RNA was extracted with TRIzol reagent (Invitrogen, USA). RNA-seq libraries were constructed using a VAHTSTM mRNA-seq V2 Library Prep Kit for Illumina (Vazyme, China) and sequenced (2 × 150 bp) on the Illumina NovaSeq 6000 platform (Illumina, USA).

Genome size estimation and genome assembly

The genome sizes of C. molitorella were estimated using high-quality Illumina reads based on k-mer frequency distribution with the Kmer_freq_hash module in GCE (v1.0.0) (https://github.com/fanagislab/GCE), with k-mer set to 17. Based on the k-mer distribution of Illumina reads, the genome sizes of C. molitorella were estimated to be 1.03 Gb (Supplementary Figure 1).

Three draft genome assemblies were generated using filtered and corrected Nanopore reads with WTDBG2 (v2.5)17, Flye (v2.7)18, and NextDenovo (v1.0)19. The contigs of the WTDBG2- and Flye-generated draft assemblies were error-corrected using high-quality Illumina reads with Pilon (v1.23)20. The NextDenovo-generated contigs were error-corrected using high-quality Illumina reads with Nextpolish (v1.2.4)21. The resulted contigs were assembled into longer sequences using quickmerge (v0.3)22 and corrected using high-quality Illumina reads with Pilon (v1.23)20. Hi-C reads were used to correct misjoins, order and orient contigs, and merge overlaps. Low-quality Hi-C reads were filtered using Fastp (v0.21.0)23. Filtered Hi-C reads were aligned to the assembled contigs using Juicer (v1.5.7)24. Scaffolding was accomplished using 3D-DNA pipeline (v180419)25. Juicebox (v1.9.9)26 was used to modify the order and direction of certain scaffolds in a Hi-C contact map and to help determine chromosome boundaries. Gaps in the assembled scaffolds were closed using filtered PacBio and Illumina reads with TGS-GapCloser (v1.0.1)27. The final genome assembly of C. molitorella was composed of 229 scaffolds (contig N50: 24.13 Mb, scaffold N50: 39.38 Mb) assembled into 25 pseudochromosomes, resulting in a total assembly size of 1.05 Gb (Fig. 1; Table 1; Supplementary Figure 2; Supplementary Table 4). The resulted pseudochromosomes were aligned to zebrafish genome assembly using NGenomeSyn (v1.0.1)28 (Supplementary Figure 3), and the pseudochromosomes were subsequently named according to the alignment results.

Fig. 1
figure 1

Genome assembly of C. molitorella. Concentric circles show structural, functional, and evolutionary aspects of C. molitorella genome. (a) Chromosome number (b) Gene density (c) GC content (d) Collinear regions detected within the genome.

Table 1 Genome assembly statistics of C. molitorella.

The completeness of the assembled genome was assessed using Benchmarking Universal Single-Copy Orthologs (BUSCO)29. BUSCO (v4.0.5) analysis indicated that 97.4% of conserved single-copy ray-fin fish (Actinopterygii) genes (odb10) were captured in the C. molitorella genome (Supplementary Table 5). Additionally, the consensus quality value (QV) and k-mer completeness of the C. molitorella assembly, evaluated by Merqury software, was 30.35 and 92.16%, respectively (Supplementary Table 6)30. Finally, RNA-seq reads from different tissues were aligned to the assembly. The average mapping rates of RNA-seq reads of 10 tissues to the C. molitorella genome assembly was 92.27% (Supplementary Table 7). These results suggest that the C. molitorella assembly is of high quality and completeness.

Repeat annotation

Repetitive elements in the C. molitorella assembly were identified through de novo predictions using RepeatMasker (v4.1.0) (https://www.repeatmasker.org/). RepeatModeler (v2.0.1)31 was used to build the de novo repeat libraries. To identify repetitive elements, sequences from the assembly were aligned to the de novo repeat library using RepeatMasker (v4.1.0). Additionally, repetitive elements in the C. molitorella genome assembly were identified by homology searches against known repeat databases using RepeatMasker (v4.1.0). Repetitive DNA represented 529.51 Mb (50.46%) of the C. molitorella genome assembly (Supplementary Table 8). DNA transposons were the largest class of annotated transposable elements (TEs), represented 344.03 Mb (32.79%) of the genome. Retrotransposons accounted for 7.22% of the genome assembly, among which long terminal repeats (LTRs, 4.13%) and long interspersed nuclear elements (LINEs, 2.85%) were the two major classes of retrotransposons. Additionally, a large proportion of unclassified interspersed repeats (7.83%) were identified in the genome.

Gene prediction and functional annotation

Protein-coding genes in the C. molitorella genome were predicted with three approaches: homology-based prediction, ab initio prediction, and RNA-seq-based prediction. For homology-based prediction, protein-coding sequences of Danio rerio, Gasterosteus aculeatus, Oryzias latipes, Takifugu rubripes, Tetraodon nigroviridis, Hypophthalmichthys molitrix, Hypophthalmichthys nobilis, Ctenopharyngodon idella, Onychostoma macrolepis were downloaded from NCBI and aligned to the C. molitorella assembly using tblastn. GenomeThreader (v1.7.0)32 was employed to predict gene models based on the alignment results with an E-value cut-off of 10−5. For ab initio gene prediction, gene models were predicted based on the alignment results of short-read RNA-seq reads using BRAKER2 (v2.1.5)33. For RNA-seq-based prediction, the short-read RNA-seq reads were first aligned to C. molitorella reference sequences using HISAT2 (v2.1.0)34. Gene models were predicted based on the alignment results of HISAT2 using StringTie (v2.1.4)35, and coding regions were identified using TransDecoder (v5.5.0)36. Second, short-read RNA-seq reads of C. molitorella were assembled using Trinity (v2.8.5)37. Finally, Program to Assemble Spliced Alignments (PASA) (v2.5.0)38 was used to predict gene models based on the assembly results of Trinity with StringTie predicted gene models as a reference. Gene models of C. molitorella predicted by BRAKER2, GenomeThreader, and PASA were integrated into a nonredundant consensus-gene set using EVidenceModeler (v1.1.1)38. Genes that were supported by transcriptional evidence or had functional annotation were retained. In total, 36,478 protein-coding genes were identified in the C. molitorella genome (Supplementary Table 9). In the predicted gene models of C. molitorella, BUSCO (v4.0.5) analysis identified 3,284 (90.2%) complete conserved single-copy ray-fin fish (Actinopterygii) genes (odb10) (Supplementary Table 10).

To assign functions to the predicted proteins, we aligned the C. molitorella protein models against NCBI nonredundant (NR) amino acid sequences and SwissProt database using BLASTP with an E-value cutoff of 10−5. Protein models were also aligned against the eggNOG database using eggNOG-Mapper39,40. Additionally, Kyoto Encyclopedia of Genes and Genomes (KEGG) annotation of the protein models was performed using BlastKOALA41. In total, 36,124 (99.03%) gene models in the C. molitorella genome were annotated in at least one database (NCBI NR, KEGG, GO, and Swiss-Prot) (Supplementary Table 11). Non-coding RNA (ncRNA) in the C. molitorella genome assembly was identified by homology searches against Rfam databases using Infernal (v1.1.4)42 (Supplementary Table 12). The tRNA and UnaL2 LINE 3’ element were the most abundant ncRNAs.

Data Records

Raw reads of genome assemblies are accessible in NCBI under BioProject number PRJNA978961(SRR25058277 and SRR25031768)43,44. The final assembled C. molitorella genome has been deposited in the NCBI GenBank with accession number GCA_040955965.145. The genome assembly, related annotation files, and source files can be accessed through Figshare at https://doi.org/10.6084/m9.figshare.2435523746.

Technical Validation

BUSCO (v4.0.5)29 evaluation identified 3,544 (97.4%) complete conserved Actinopterygii genes (obd10) in the C. molitorella assembly, suggesting the high completeness of the assembly. Additionally, RNA-seq reads of ten tissues (brain, eye, gill, heart, stomach, intestine, kidney, liver, ovary, and spleen) were aligned to the assembly using HISAT2 (v2-2.1)34. The average mapping rates of RNA-seq reads from these tissues to the C. molitorella genome assembly was 92.27%. Third, Merqury (v1.3)30 was used to assess the completeness and quality of the C. molitorella assembly. The consensus quality value (QV) and k-mer completeness of the assembly evaluated by Merqury software were 30.35 and 92.16%, respectively. Lastly, the quality of the genome annotation was evaluated using the BUSCO (v4.0.5) software. This assessment revealed that the final genome annotation encompassed 90.2% of the actinopterygii_odb10 genes, demonstrating a high completeness rate in gene predictions.

To evaluate the reliability of genome assembly and annotation of C. molitorella, a phylogenetic tree was constructed for C. molitorella and 7 fishes of Cyprinidae. Protein sequences of the 8 species (C. molitorella, C. carpio, C. auratus, Ctenopharyngodon idellus, D. rerio, H. molitrix, H. nobilis, and Puntius tetrazona) were downloaded for phylogenetic analysis. OrthoFinder (v2.5.5)47 was applied to determine orthologous relationship among proteins from subgenome A and subgenome B of C. carpio and C. auratus as well as proteins of P. tetrazona, D. rerio, C. idella, H. molitrix, H. nobilis, C. molitorella. Gene clusters with >100 gene copies in one or more species were removed. Single-copy orthologs in each gene cluster were aligned using MAFFT (v7.487)48. Alignments were trimmed using Gblocks module of PhyloSuite (v1.2.2)49. The phylogenetic tree was constructed with the trimmed alignments using a maximum-likelihood method implemented in IQ-TREE2 (v2.1.2)50 with D. rerio as the outgroup. The best-fit substitution model was selected using the ModelFinder algorithm51. Branch supports were assessed using the ultrafast bootstrap (UFBoot) approach with 1,000 replicates52. The result displayed C. molitorella was sister to subgenome B of both C. carpio and C. auratus as well as P. tetrazona (Fig. 2), supported the view that C. molitorella had a closer relationship to subgenome B of both C. carpio and C. auratus than subgenome A of the cyprinid allotetraploid species53,54.

Fig. 2
figure 2

Evolutionary relationships of C. molitorella. The maximum likelihood phylogeny tree is based on 365 single-copy orthologs. The label marked with red represents the C. molitorella. Bootstrap values are listed in red next to each node.