Background & Summary

The genus Carassius play crucial roles in aquaculture. They are rare vertebrates which exhibit various ploidy levels and reproductive modes, including the sexual amphidiploid (allotetraploid, AABB) crucian carp Carassius auratus (C. auratus) and the unisexual amphitriploid (AAABBB) gibel carp Carassius gibelio (C. gibelio)1,2,3,4. Qihe gibel carp (C. gibelio var. Qihe) is a geographic population of C. gibelio that is cultured in the middle and lower reaches of the Qihe River in Henan Province, China; it is also well-known as the “double-backed crucian carp” due to its wide and thick back5,6. Besides, Qihe gibel carp is a rare unisexual amphitriploid (AAABBB) freshwater fish with gynogenesis ability7. Because of numerous obstacles present in the pairing and equal separation of three sets of chromosomes during meiosis and gametogenesis, amphitriploid organism is very rare in nature, especially in vertebrates8. However, Qihe gibel carp can overcome reproductive barriers through gynogenesis, which makes them valuable for studies of species evolution and fish genetic breeding. Moreover, Qihe gibel carp also has high nutritional and economic value, and belongs to precious protected germplasm resource in China9. However, the lack of high-quality reference genomes of Qihe crucian carp has hindered deeper gene function and high-quality breeding of new varities derived from this fish germplasm resources.

In recent years, genome sequencing technology has developed rapidly. The long-read sequencing technologies (PacBio HiFi or Oxford Nanopore Technologies, ONT) have been widely applied in chromosome-level genomes assembly of plants and animals, such as fish species. Compared to the second-generation sequencing technologies, long-read sequencing technology allows the completion of the assembly of the chromosome-level genome of fish and also contributes to the continuous development of structural and functional genomics research processes10. Therefore, studies of aquaculture species have progressed from gene-level research to whole genome-level research.

Using genome sequencing technology, researchers have conducted genetic evolution analysis of different fish species and revealed the evolutionary mechanisms of polyploid fish. For example, Li et al. (2021) conducted genome assembly and genome annotation of all-tetraploid common carp and goldfish, and sequenced three common carp strains11. Their results identified candidate genes related to growth and survival rate, revealed the evolutionary mechanism of the parallel subgenome structure and divergent expression evolution of common carp and goldfish, and elucidated the geographic genome structure and domestication of common carp. Wang et al. (2022) assembled the genome of one amphitriploid (AAABBB) gibel carp (C. gibelio) and one amphidiploid crucian carp/goldfish (C. auratus))12. Combined with comparative genomic analysis and cytological observations, they reported some evidence of genomic variation to promote the gynogenesis of C. gibelio, and they provided new insights into the evolutionary mechanism of successful reproduction in unisexual polyploid vertebrates. In another study, Xu et al. (2023) revealed the origin and subsequent sub-genome evolution patterns following three independent allopolyploidy events by assembling the high-quality genomes of 21 cyprinids13.

Qihe gibel carp has important breeding value as well as research value for understanding genome evolution. However, the genome of Qihe gibel carp has not been reported to date. Although Wang et al. (2022) assembled the genome of gibel carp12, in view of the fact that Qihe crucian carp as an important geographical population and its genetic isolation, and some scientists believe that the doubling event of crucian carp is independent, it is necessary to de nove sequence and assemble the chromosome level genome of Qihe crucian carp. Therefore, we used PacBio HiFi and High-throughput Chromosome Conformation Capture (Hi-C) technologies to sequence and assemble a high-quality chromosome-level genome of Qihe gibel carp. The resulting assembly consisted of 350 contigs with the full length of 1.607 Gb and 96.21% (1.515 Gb) of the assembled genome was successfully anchored to 50 chromosomes, with a contig N50 of 28.97 Mb and a scaffold N50 of 29.84 Mb. Benchmarking Universal Single-Copy Orthologue (BUSCO) analysis demonstrated that the genome assembly achieved high completeness, with a score of 97.66%. Genomic annotation was performed through de novo gene prediction, homology, and transcriptome-based prediction. Repeated sequences accounting for 43.72% (732.494 Mb) of the total were also identified, and gene prediction revealed 46,131 protein-coding genes with an annotation ratio of 96.48%. This study lays the foundation for the construction of genetic maps, the evaluation of germplasm resources, molecular breeding, and genetic classification studies of Qihe gibel carp in the future.

Methods

Sample collection and sequencing

The Qihe gibel carp used in this experiment were collected from Qihe gibel carp breeding base of Anyang Institute of Technology (Anyang, Henan, China). All experiments complied with institutional animal care guidelines and were approved by the Animal Care Committee of Anyang Institute of Technology. Muscle tissue from female Qihe gibel carp was used to extract DNA following the manufacturer’s instructions of the FastPure DNA Isolation Mini Kit (Vazyme, Nanjing, China). The purity and concentration of the extracted DNA were assessed using a NanoDrop 2000 system (Thermo Scientific, Waltham, MA, USA) and gel electrophoresis.

The short-read library was constructed and sequenced using a DNBSEQ-T7 platform in BGI Co., Ltd. A total of 196.64 Gb of clean data was obtained. We employed the SMRT Bell Express Template Prep Kit 2.0 (Pacific Biosciences, Menlo Park, CA, USA) to construct a PacBio SMRT Bell library, which was used for SMRT sequencing on the PacBio platform in BGI Co., Ltd. Using CCS workfow13 version 4.0.0, the raw data were converted into high-quality HiFi reads. In total, 228.01 Gb of HiFi reads were generated.

The Hi-C library was constructed using the same tissues, and it was sequenced on the DNBSEQ-T7 platform. A total of 406.46 Gb clean data was generated. For ISO-seq, muscle tissue of female Qihe gibel carp was also used to extract RNA following the manufacturer’s instructions of the modified CTAB extraction method. The PacBio ISO-seq library was constructed and sequenced using the PacBio Sequel II platform in BGI Co., Ltd. In total 49.78 Gb of clean data were obtained (Table 1).

Table 1 Statistics of sequencing data of the Qihe gibel carp genome.

Genome survey and assembly

To survey the genome, we assessed several important characteristics, including repeated sequence, genome size, and heterozygosity. Based on MGI-DNBSEQ (https://en.mgi-tech.com/products/), the k-mer (k = 21) frequencies were analyzed using GCE version 1.0.014. The obtained k-mer frequency information was used to evaluate genome heterozygosity, repeatability, and genome size using GenomeScope 2.0. The calculated haploid genome scale of Qihe gibel carp was 1.52 Gb, with a repeated content of 40.08% and a heterozygosity of 2.20% (Fig. 1). Next, circular consensus sequencing reads were assembled using Hifiasm version 0.19.6 (https://github.com/chhylp123/hifiasm) with default parameters. We obtained 350 contigs with a total length of 1.607 Gb and a contig N50 of 28.97 Mb, which constituted the genome assembly. The Hi-C clean reads were mapped into reference genome using BWA15 (version 0.7.12-r1039) with default parameters, which generated the uniquely aligned read pairs needed to evaluate and obviate the invalid read pairs. A length of 2,778,181,521 base pair sequences were used for scaffold correction to further cluster, order, and orient contigs through Juicer16 (version 1.6) with default parameters, and then the assembled contigs were anchored to chromosomes using 3D-DNA software17 (version 180,922) with -r 0.parameters (https://github.com/theaidenlab/3d-dna) Based on JuiceBox software18 (version 1.11.08) with default parameters, the interaction graph was visualized to perform manual correction. Ultimately, there were 316 scaffolds of 1.515 Gb in length with a scaffold N50 of 29.84 Mb anchored to 50 chromosomes, which represented 96.21% of the contigs in length.

Fig. 1
figure 1

Genome scope profiles of k-mer (k = 21) analysis in Qihe gibel carp.

Repeat sequence prediction

Repeat sequences play a key role in the gene regulatory network. They include interspersed repeats and tandem repeats, which may mainly be transposable elements (TEs), and they play an indispensable role in the evolution, inheritance, and variation of life and in gene expression and transcriptional regulation. At present, the methods for annotating repeat sequences are mainly de novo prediction and homologous comparison, which are based on the RepBase library (http://www.girinst.org/repbase)19. RepeatMask (version open-4.0.9) with default parameters20 and RepeatProteinMask (version open-4.0.9) with -nolow-no_is-norna parameters20 software are used to identify sequences that are similar to known repeat sequences. In our study, de novo prediction was mainly achieved through RepeatModeler (version open-1.0.11) with default parameters21 and LTR_Finder (version 1.0.7) with default parameters22 software, which first constructed a de novo repeat sequence library that RepeatMask software used to predict the repeat sequences. Tandem Repeats Finder (version 4.09) with default parameters23 was also used to find tandem repeat sequences. In total 731.56 Mb of repeat sequences were annotated, which accounted for 46.46% of the genome assembly, similar with 728.98 Mb (45.85) of repeat sequences in C. gibelio12. Additionally, 691.57 Mb of TEs were identified, accounting for 43.92% of the genome assembly (Tables 2, 3, Fig. 2).

Table 2 Statistics of repeated sequence annotation results.
Table 3 Statistics of transposable elements classification results.
Fig. 2
figure 2

Circos plot of the Qihe gibel carp genome. (a) GC content, (b) gene density, (c) repeat density, (d) long terminal repeat (LTR) density, (e) long interspersed nuclear element (LINE) density, (f) DNA-TE density.

Protein-coding gene prediction and functional annotation

Gene prediction of the Qihe gibel carp genome was performed using a combination of de novo-based, transcriptome-based, and homolog-based gene prediction methods. The de novo-based prediction was performed using AUGUSTUS24 (version 3.3.2) and Genscan25 software. Protein sequences from Carassius auratus (PRJNA546444), Carassius gibelio (PRJNA546443), Danio rerio (GCF_000002035.6), and Poropuntius huangchuchieni26 were compared with the Qihe gibel carp genome to establish homology-based prediction using Exonerate27 version 2.2.0 and Liftoff28 (version 1.6.3) with default parameters.

Based on the third-generation sequencing ISO-seq data, the Iso-seq reads were aligned to the genome sequence using HISAT2 version 2.1.0 with default parameters29. The transcripts were predicted using Stringtie (version 1.3.5) with default parameters30, and the intron sequences of all samples were extracted for subsequent auxiliary annotation by TransDecoder (version 5.5.0) (https://github.com/TransDecoder/TransDecoder). In addition, the integrated gene database was successfully constructed using MAKER2 (version 2.31.10) with default parameters31 and HiFAP with default parameters. During the database processing, variable splicing was removed and the transcriptome-specific genes were added. Totally, the assembled genome was improved by filtering out the parts with only a single coding sequence and corresponding genes without functional annotations.

In total, 46,131 protein-coding genes were identified from the Qihe gibel carp genome (Table 4). Subsequently, the protein-coding genes were functionally annotated in Kyoto Encyclopedia of Genes and Genomes (KEGG), Gene Ontology (GO), Non-redundant (NR), EuKaryotic Orthologous Groups (KOG), InterPro, Swiss-Prot, and TrEMBL databases using Diamond (version 2.0.14) with--evalue 1e-05 parameters32. A total of 44,509 protein-coding genes were annotated, accounting for 96.48% of the total identified protein-coding genes (46,131) (Table 4).

Table 4 Statistics of functional annotation result.

There were 49,299 non-coding RNAs in the Qihe gibel carp genome (Table 5). Non-coding RNAs mainly include transfer RNA, ribosomal RNA, micro RNA, and small nuclear RNA. Transfer RNAs were identified based on their structural characteristics using tRNAscan-SE (version 1.3.1) with default parameters33. Ribosomal RNAs were identified using BLASTN software. Micro RNAs and small nuclear RNAs were annotated using Rfam (version 14.8) with cmscan–rfam–nohmmonly parameters34.

Table 5 Statistics of ncRNA annotation result.

Genomic synteny analysis

The genomic synteny analysis between Qihe gibel carp and C. auratus or C. gibelio was performed using WGDI (version 0.5.6) with score = 100, evalue = 1.0E-5, ks_col = ks_NG86 parameters35. First, all versus and alignment of the protein sequences from both species was performed using BLASTP version 2.11.0 + (https://blast.ncbi.nlm.nih.gov/Blast.cgi). Second, gene position and chromosome length information were extracted. Finally, a graphical presentation of genome synteny analysis was produced using JCVI36 (version 1.1.22). The results showed that there was no fission event between Qihe gibel carp and C. auratus or C. gibelio, which implies high concordance of the genomes between Qihe gibel carp and both of the other fishes and also indirectly confirms the credibility of the genome assembly of Qihe gibel carp (Fig. 3).

Fig. 3
figure 3

Chromosome sequence synteny comparisons between Qihe gibel carp and C. auratus or C. gibelio. Each line connects a pair of homologous sequences between the fish.

Data Records

The ISO-seq sequencing data by PacBio Sequel II, genomic survey sequencing data by DNBSEQ-T7, genomic PacBio HiFi sequencing data by PacBio, and Hi-C sequencing data by DNBSEQ-T7 from Qihe gibel carp were deposited in the NCBI SRA database (accession number: SRP533627)37. The assembled Chromosome-level genome of Qihe gibel carp was submitted to NCBI (accession number: GCA_043790415.1)38. The annotated files for the Qihe gibel carp genome were deposited at figshare39.

Technical Validation

Benchmarking Universal Single-Copy Orthologue (BUSCO) is tool that is used to search for single-copy orthologs between species to evaluate the integrity and redundancy of one new assembled genome. In this study, the assembled results of the Qihe gibel carp genome were evaluated using BUSCO40 (version 5.3.1) with -m genome–limit 10 parameters. The complete BUSCO at contig-level assembly was 97.4%, while the complete BUSCO at chromosome-level assembly was 97.7%, similar with 98.16% of C. gibelio12, and these values indicate that a high-quality Qihe gibel carp genome was obtained in this study. In addition, 97.3% of all protein-coding genes and 97.1% of functionally annotated protein-coding genes were identified as complete (Table 6).

Table 6 Completeness of the assembled genomes and sets of protein-coding genes evaluated by BUSCO analysis.

The genome assembled at the chromosome level was divided into bins in equal lengths of 100 kb using Hi-C data. The Hi-C read pair numbers covered between any two bins were counted as the signal intensity of the interaction between the two bins, and the data were used to create a heat map. The intensity of interactions located at diagonal positions was higher than that at non-diagonal positions in each group, suggesting that the interaction intensity was high between adjacent sequences at diagonal positions and weak between non-adjacent sequences at non-diagonal positions. These results were consistent with the principle of genome assembly assisted by Hi-C, illustrating that the genomic assembly was highly effective (Fig. 4).

Fig. 4
figure 4

Chromosomal Hi-C interaction heatmap of the Qihe gibel carp genome assembly.