Background & Summary

The rock carp, Procypris rabaudi (Tchang, 1930) (Fig. 1A), belongs to the Cyprinidae family and is an important endemic economic species of the middle-upper reaches of the Yangtze River, including the Jinsha River, Jialing River, Minjiang River, Chishui River and Tuojiang River1. Rock carp is a benthic fish that lives mainly in deep waters with rocky substrate, and produces adhesive eggs to attach to the gravel substrate at the bottom during the spawning season2. Since 1970s, the populations of rock carp have declined sharply due to anthropogenic influence like pollution, overfishing, among others3. At present, the rock carp is listed as the second-class aquatic animals in the National List of Key Protected Wild Animals of China. In recent decades, the artificial propagation and release has been applied to restore the populations of rock carp4. The great efforts have been adopting to protect and recover their wild stocks in the fields. Moreover, molecular genetics provides critical reference for phylogeographic research and breeding, but this relies on sequencing effort and genome assembly.

Fig. 1
figure 1

(A) Genomic landscape of the rock carp. The rings, from the outermost to the innermost layer, represent the chromosomes of the P. rabaudi genome (a), gene density (b), GC density (c), short reads density (d), long reads density (e), outer ring shows the homology SNPs density, inner ring shows the heterozygosis SNPs density (f), outer ring shows the homology InDel density, inner ring shows the heterozygosis InDel density (g). The completed mapped BUSCO genes density, single-copy BUSCOs is blue and duplicated BUSCOs is red (a-g). The analysis was conducted using 50-kb genomic windows. (B) Chromosomal Hi-C heatmap of the P. rabaudi genome assembly.

With the development of sequencing technologies, the sequencing cost has dramatically reduced5. This further promotes the application of high-quality genomes in basic biology. A study on the origin and subsequent subgenome evolution patterns has published twenty-one cyprinids genomes including the rock carp6,7. Although the genome assembly of rock carp have been available, there is still space for improvement in genome assembly and annotation. In-depth research of the rock carp is still be constrained. Presently, the assembly of telomere-to-telomere (T2T) gap-free genome have been possible for the advances in sequencing and assembly technology. In bony fish, T2T genome assemblies of several important species have been reported and provided new insight, like the zig-zag eel T2T genome revealed the origin and evolutionary of its sex chromosome8, the Chinese sea bass T2T genome provided reference for further analysis of its genome structure and mining of breeding genes for disease resistance9. For the rock carp, its genome is relatively complex as a tetraploid species, so a higher quality genome is necessary to analyze its genome structure. In addition, this will further promote the genetic breeding, disease resistance research and resource conservation of the rock carp. Herin, we aim to assemble the genome of P. rabaudi to the T2T level, and this will provide a high-quality reference for the in-depth study of this species.

In this study, we generated Pacific Biosciences (PacBio), High-fdelity (HiFi) long-reads, ultra-long Oxford Nanopore (ONT) and Hi-C sequencing reads for P. rabaudi T2T genome assembly. Our genome assembly improves the previous chromosome-level rock carp genome and provides significant genomic resources of evolutionary and breeding research.

Methods

Sample collection and sequencing procedures

Adult female specimens of Procypris rabaudi were sourced from a fish farm located in Chongqing, China (coordinates: 122.212 E, 29.979 N). The research protocol was approved by the Animal Care and Use Committee at the Fishery College of Southwest University. Muscle and blood tissue was specifically chosen for DNA and ultra-long ONT extraction and sequencing respectively. Various tissues, including muscle, brain, blood, skin, liver, gonad and skull were harvested and stored for subsequent RNA isolation for genome annotation. A PacBio HiFi-read library, featuring insert sizes ranging from 10 to 40 kb, was prepared using the SMRTbell Express Template Prep Kit 2.0 (Pacific Biosciences, USA) and sequenced on the PacBio Sequel II platform. For ONT ultra-long reads, an SDS-based extraction method was employed. Additionally, a Hi-C short-read library was constructed from purified DNA following the protocol established by Belton et al., and sequencing was conducted on the Illumina Novaseq 6000 platform10. The sequencing yielded 53.20 Gb of PacBio HiFi long-reads with an N50 of 19.11 kb, 20.21 Gb of ONT ultra-long reads with an N50 of 100 kb, and 142.59 Gb of clean Hi-C reads for genome assembly (refer to Table 1).

Table 1 Statistics of sequencing reads data.

For genome annotation, RNA was extracted from all collected tissues. Total RNA was isolated using TRIzol reagent (Invitrogen, MA, USA) and processed into an RNA-seq library with the NEBNext® Ultra™ RNA Library Prep Kit (NEB, USA). Sequencing was performed on the Illumina Novaseq 6000 platform.

Genome assembly and gap filling

The initial genome assembly was conducted using HiFiasm (v0.16.0), integrating HiFi data, ONT ultra-long reads, and Hi-C sequencing data11. This resulted in a draft genome of 1.65 Gb with a contig N50 of 29.64 Mb (see Table 2). Chromosome-level assembly was achieved using Hi-C data. Clean Hi-C reads were aligned to the contig-level genome using Bowtie2 (v2.3.4.3), producing 227.07 million uniquely mapped paired-end reads, with 49.22% being valid pairs (Tables S1, S2)12. The 3D-DNA pipeline (v180922) and JuiceBox (v1.11.08) were utilized to calculate chromosomal interaction frequencies and correct scaffolding errors, respectively13,14. This process yielded 50 pseudo-chromosomes covering 99.83% of the genome, with a contig N50 of 28.65 Mb (Fig. 1B).

Table 2 Statistics for the P. rabaudi preliminary genome assembly.

To achieve a near-T2T genome, ONT ultra-long reads were mapped to the pseudo-chromosomes using minimap215. TGS-GapCloser was used for gap filling and 3 rounds genome correction were performed by Pilon, resulting in a 1.64 Gb genome with a contig N50 of 32.36 Mb and 43 fully assembled chromosomes (Fig. 2, Table S3)16,17.

Fig. 2
figure 2

The contigs in the chromosomes of the P. rabaudi genome. The blue sections at each end of chromosomes and black sections inside each chromosomes represent the identified telomeres and centromeres respectively.

Telomere and centromeric regions analysis

The quarTeT (v1.1.4) software was used in telomere and centromere analysis18. Telomeres were identified by scanning the genome for the TTAGGG/CCCTAA motif, with at least four repetitions required for recognition. Based on the annotation results of genome and the identification results of quarTeT software, regions with dense distribution of Satellite and TandemRepeat were considered as candidate centromere regions. The estimated telomeres and centromere regions were obtained as shown in Fig. 2, Table S3.

Repetitive sequence analysis

Repetitive sequences, including tandem and interspersed repeats, were identified using de novo prediction and homology-based methods. A repeat library was constructed using RepeatModeler (open-1.0.11) and LTR-FINDER_parallel (v1.0.7)19,20. Tandem repeats were detected with TRF (v4.09), while RepeatMasker (open-4.0.9) and RepeatProteinMask were used for homology-based predictions21,22. In total, 0.79 Gb of repetitive sequences, accounting for 48.34% of the genome, were identified, including 28.73% DNA elements, 6.09% LINEs, and 6.04% LTRs (Table 3). And the information of repetitive sequences can be founded in Table S4.

Table 3 Transposable elements statistics for the P. rabaudi genome.

Protein-coding gene annotation

The repeat-masked genome was subjected to ab initio gene prediction using AUGUSTUS (v3.3.2), Genscan (v1.0), and GlimmerHMM (v3.0.4)23,24,25. GeneWise (v2.4.1) was used for precise protein mapping and splice site identification26. RNA-seq reads were aligned to the genome using HISAT2 (v2.2.1), and transcripts were assembled with StringTie (v2.2.0) and PASA (v2.3.2)27,28,29. MAKER2 (v2.31.10) and HiFAP integrated these predictions, resulting in 44,402 protein-coding genes (Table 4)30. Comparative genomics with related species (Carassius carassius, Carassius gibelio, Cirrhinus molitorella, and Cyprinus carpio) was performed using TBLASTN (e-value ≤ 1e-5) to identify protein-coding regions31. Gene structures were compared with homologous species (Fig. 3).

Table 4 Statistics on transposable elements in the P. rabaudi genome.
Fig. 3
figure 3

Distribution of genes in different species.

Functional annotation of protein-coding genes was performed using InterPro, GO, KEGG, SwissProt, TrEMBL, TF, Pfam, NR, and KOG databases32,33,34,35,36. InterProScan (v5.61–93.0) was employed to annotate conserved domains and motifs, with 98.34% (43,663) of genes functionally annotated (Table 5)37.

Table 5 Putative protein-coding gene functional annotations of the P. rabaudi genome.

Non-coding gene identification

The tRNA sequences were predicted using tRNAscan-SE (v1.3.1), while rRNA genes were identified via BLASTN. miRNA and snRNA sequences were predicted using INFERNAL based on the Rfam database (v14.8)38,39. Results are summarized in Table 6.

Table 6 Statistics of the noncoding RNA in the P. rabaudi genome.

Data Records

All sequencing data from three sequencing platforms and the assembled genome have been uploaded to the NCBI SRA database and could be accessed with BioProject number PRJNA117582740,41,42,43,44,45. The genome annotation files could be founded in figshare: https://doi.org/10.6084/m9.figshare.2858838246.

Technical Validation

DNA and RNA quality assessment

Prior to sequencing, DNA and RNA quality (OD260/280 and OD260/230 ratios) and concentration were measured using the NanoDrop 2000 Spectrophotometer and Qubit 3.0 Fluorometer. Sample integrity was confirmed via agarose gel electrophoresis and the Agilent 2100 Bioanalyzer.

Genome assembly quality evaluation

Short reads were aligned to the assembled genome using BWA (v0.7.17 -r 1188), while HiFi and ONT ultra-long reads were mapped with Minimap2 (v2.24_x64-linux)15,47. Mapping rates were 99.88% for short reads and 99.99% for HiFi and ONT reads (Tables S5, S6). BUSCO (v5.4.3) analysis based on the actinopterygii_odb10 database indicated that 98.1% of 3571 single-copy orthologs were complete (Table 7)48.

Table 7 Statistics of BUSCO analysis of the P. rabaudi genome.