Background & Summary

White bream Parabramis pekinensis belongs to family Cyprinidae and is widespread in all major freshwaters of Asia, mainly in China1. It is also an important freshwater aquaculture economic fish in China. However, bream was always regarded as a generic term such as Megalobrama terminalis, Megalobrama amblycephala and P. pekinensis2. They have no significant morphological differences and their chromosome number are all 2n = 48. Above breeds could be crossed with each other and maintain the progeny of both sexes. Thus, they are good materials for exploring the fertility of hybrids3.

The amino acid content in P. pekinensis was about 167.26 ± 24.07 mg/g, which was significantly higher than Ctenopharyngodon idella (134.11 ± 18.98 mg/g) and M. amblycephala (149.62 ± 13.6 mg/g)4. As a herbivorous fish, P. pekinensis showed more delicious meat and nutritive value. People living in China were favourite of them and the selling price in market was several times higher than other freshwater herbivorous fishes. The domestic bream market capacity had expanded with a growth rate of 10% since 2008. At present, the demand of domestic bream consumption market is more than 350,000 tons on average5. Our previous studies mainly focused on the developmental and reproductive physiology of P. pekinensis. E.g. we investigated the growth rules of P. pekinensis and its potential regulatory mechanisms6, clarified the characteristics of gonadal development7,8, and developed artificial breeding techniques for P. pekinensis9. However, research on this fish was lack of genomics and genetic breeding. Bream were valuable aquatic species in biological research, but only the genome of M. amblycephala has been published so far.

Recently, High-through chromosome conformation capture (Hi-C)10 technology has been used in whole genome sequencing (WGS), which is based on Chromosome Conformation Capture (3 C Technology)11 and high-throughput sequencing. Hi-C could identify chromatin interactions in the whole genome exactly and obtain chromosome level genomes data completely, which also could be used to discover the potential genetic structure of species, explore species diversity, construct haplotype maps and conduct genome-wide association analysis. At present, WGS, transcriptome sequencing, whole-genome re-sequencing, simplified genome has been widely used in nonpattern organism to candidate sex determination genes or interlocking molecular marker by combining genome-wide association analysis (GWAS), quantitative trait locus (QTL), expression analysis and functional verification, such as in Channa argus12, Ctenopharyngodon idellus13, Larimichthys crocea14 and Salmo salar15.

In the present study, we constructed a high-quality chromosome-level reference genome using long-reads generated by a PacBio platform, short reads generated by an Illumina platform, and the Hi-C analysis technology. In addition, phylogenetic analyses based on single-copy genes were performed to understand the relationship between P. pekinensis and other species. It is the first genome assembly of P. pekinensis, which will be helpful to understand the gene structure, function and arrangement of this species, providing a basis for subsequent studies on genetic breeding, evolutionary analysis and germplasm resource conservation.

Methods

Sample collection, sequencing

We extracted genomic DNA from the muscle samples of a P. pekinensis (~1.5 kg, estimated age 2 years, female), provided by Jingjiang Binjiang aquaculture seed farm, Taizhou City, Jiangsu Province, China. Genomic DNA of each sample was extracted from muscle tissue using the QIAamp DNA Mini Kit (Qiagen, Valencia, CA, USA) following the manufacturer’s instructions. The quality and quantity of the extracted DNA were assessed via agarose gel electrophoresis and an Agilent 2100 Bioanalyzer (Agilent Technologies, Palo Alto, CA, USA). An Illumina DNA library of P. pekinensis with an insert size of 350 bp was constructed using the Genomic DNA Sample Prep kit (Illumina, San Diego, CA, USA) following the manufacturer’s instructions, and sequenced on an Illumina HiSeq X Ten platform using a 150-bp paired-end strategy. A total of 58.28 Gb short reads was generated for P. pekinensis and used for genome size estimation.

For PacBio HiFi long-read sequencing, about 10 µg gDNA was used to construct long-read libraries by using a SMRTbell Express Template Prep Kit 2.0 based on PacBio’s standard protocol (Pacific Biosciences, USA), which were sequenced through a PacBio Sequel II System. A total of 77.96 Gb HiFi reads with N50 sizes of 22,713 bp was obtained using the CCS v6.0.016 (Circular Consensus Sequencing) software with the parameter -min-passes 1, which covered approximately 88 × genome (Table 1).

Table 1 Statistics of Hi-C assisted assembly.

For the high-throughput chromosome conformation capture (Hi-C) sequencing, muscle tissue (about 1 g) from a female individual was collected, and Hi-C library was constructed by using GrandOmics Hi-C kit (the applied restriction enzyme is DpnII, GrandOmics, China) according to the manufacturer’s protocol. The Hi-C libraries were then sequenced using DNBSEQ T7 platform (MGI, China) with the paired-end module. In total, 103.47 Gb of raw reads were generated for P. pekinensis. Subsequently, fastp v0.19.517 was applied to filter the adaptors and low-quality reads. Finally, a total of 101.42 Gb high-quality reads were retained for construction of pseudo-chromosomes.

RNA extraction and transcriptome sequencing

The total RNA samples were extracted from muscle, liver, ovary, kidney and gill tissues using a standard Trizol protocol (Invitrogen, Carlsbad, California, USA), and purified using a Qiagen RNeasy mini kit (Qiagen, USA). RNA with equal amounts from each tissue was mixed for creating Illumina cDNA library followed the manufacture’s guideline, which was then sequenced on a HiSeq X Ten platform (Illumina, USA) with paired end mode and PacBio platform (Pacific Biosciences, USA). Around 36 Gb (HiSeq: 34.04 Gb and HiFi: 1.96 Gb) transcriptome data were generated for assistance to gene annotations.

Genome size assessment and preliminary assembly

To evaluate the genome size of P. pekinensis, we analyzed the K-mer (K = 17) frequency distribution of the filtered 54.13 Gb Illumina reads and obtained the K-mer depth distribution curve (Fig. 1b). The depth of K-mer peak was 34, and the number of k-mer was 31,941,945,635. Therefore, the genome size of P. pekinensis estimated about 939.47 Mb Fig. 2. The genome was assembled using the hifiasm18 software with default parameters, and obtained the genome of preliminary assembly. These contigs were then polished with Nextpolish (v1.10)19 using the Illumina short reads to fix possible base errors. The primary genome assembly was 1.03 Gb in length and N50 size was 31.22 Mb, consistent with the estimated genome size. Statistics were showed in Table 2.

Fig. 1
figure 1

Parabramis pekinensis (a). Frequency distribution of 17-mer depth in the second-generation sequencing data of P. pekinensis (b). Hi-C chromatin interaction map of the P. pekinensis assembly (c).

Fig. 2
figure 2

The Venn figure of function annotations from different databases.

Table 2 Statistics of sequencing data of P. pekinensis.

Chromosome construction and genome assembly

Firstly, the clean paired-end reads were mapped to the draft assembled sequence using bowtie2 (v2.3.2) (-end-to-end --very-sensitive -L 30)20 to get the unique mapped paired-end reads. Subsequently, HiC-Pro (v2.8.1)21 pipeline was applied to detect valid ligation products and only valid contact paired reads were retained for further analysis. Finally, the scaffolds were further clustered, ordered, and oriented scaffolds onto chromosomes using LACHESIS22 software with optimized parameters (CLUSTER_MIN_RE_SITES = 100, CLUSTER NONINFORMATIVE RATIO = 1.4, CLUSTER_MAX_LINK_DENSITY = 2.5, ORDER MIN N RES IN SHREDS = 60, ORDER MIN N RES IN TRUNK = 60). After this, Juicebox v1.11.0823 was employed to visually inspect and rectify any errors in the order and orientation of the contigs or assembly errors within the contigs. We hence obtained the final genome assembly with a size of 1.03 Gb, of which 98.31% are anchored into 24 chromosomes (Table 3). The scaffold and contig N50 values of the overall chromosome-level genome assembly are 40.04 Mb and 29.69 Mb, respectively (Table 1).

Table 3 Scaffold clustering 24 chromosome length statistics.

Finally, we used BUSCO (Benchmarking Universal Single-Copy Orthologs)24 and CEGMA25 to assess the completeness of the genome and core gene. BUSCO against actinopterygii_odb10 database was employed to assess the completeness of this genome assembly, showing that the assembly contains 98.35% of complete BUSCO genes including 96.73% single-copies and 1.62% duplicates, which indicated major conserved genes were assembled completely and results were highly credible. Moreover, we obtained 243 core genes (97.98%), which including 183 complete core genes (73.79%), and it indicated that assembled genome has complete core genes.

Repeats and gene annotation

A combined strategy based on de novo searches and homologue alignments was used to annotate whole-genome repeat elements. GMATA26 was used to search for simple repeats (SSRs) with short repeat units, and the genome was subjected to SSR softmask. Tandem Repeats were re-searched with Tandem Repeats Finder (TRF)27, and softmask was performed on tandem repeats. MITE-hunter was used to construct the TE library, and RepeatModeler28 was used to search the repeat sequences de novo to construct the RepMod library. The combination of TE, RepMod and Repbase library revealed that 572.45 Mb repeats were predicted which accounting for 55.52%. Tandem repeats were predicted 9.50 Mb, which accounted for 0.92%. Short interspersed nuclear elements (SINEs) and long terminal repeats (LTRs) accounted for 1.17% and 14.21% of the whole genome, respectively, and long interspersed nuclear elements (LINEs) accounted for 8.46%.

For protein-coding gene prediction, a comprehensive strategy combining homology-based prediction, transcript-based prediction and de novo prediction was employed. For de novo predictions, the gene structures were identified with AUGUSTUS v3.2.129. For homology-based predictions, protein sequences from Oreochromis niloticus, Megalobrama amblycephala, Larimichthys crocea, Ictalurus punctatus, Cyprinus carpio, Oryzias latipes, Lates calcarifer, Danio rerio, and Ctenopharyngodon idell were downloaded from NCBI and GeMoMa v1.6.430 was adopted with default parameters. For transcript-based prediction, the short reads generated from the Illumina platform were assembled using Trinity v2.5.231, the gene structures were identified using PASA v2.3.332. Finally, gene sets were integrated by the Evidence Modeler (EVM) pipeline v1.0, which based on PASA prediction >  = GeMoMa prediction >  = ab initio prediction to count results. A total of 26,542 protein-coding genes were predicted and average gene length was 19,708.88 bp (Table 4). The BUSCO completeness values were 96.04% of the predicted proteins. Next, we compared the number of genes, gene length, coding DNA sequence (CDS) length, exon length, and intron length distributions with those of other fish species.

Table 4 Protein coding gene annotation statistics.

We used Blastp and InterproScan to compare P. pekinensis protein genes codes with public protein databases, which involved NCBI non-redundan (NR), Kyoto Encyclopedia of Genes and Genomes (KEGG), Clusters of orthologous groups for eukaryotic complete genomes (KOG), Gene Ontology (GO), Swissport. The results of these comparisons were merged and integrated to predict the functional information of the obtained genes. In this study, 26,542 proteins were predicted and 24,581 genes encoding proteins were compared to functional databases, accounted for 92.61% of total genes.

Compared to the KEGG database, the results of functional pathway clustering were shown that 16,860 protein-coding genes were enriched in 35 biochemical pathways, forming five major categories: Cellular Processes, Environmental Information Processing, Genetic Information Processing, Metabolism and Organismal Systems. Compared to the GO database, the results of functional clustering showed that 16,810 genes were annotated into 60 secondary functions from three aspects, including Cellular component, biological process and Molecular function.

Phylogenetic tree

A species tree was constructed using those single-copy orthologs from the whole genomes of P. pekinensis and other 10 representative ray-finned species, including Lepisosteus oculatus, O. nilotica, D. rerio, O. latipes, L. calcarifer, I. punetaus, C. carpio, C. idella, M. amblycephala and L. crocea were downloaded from the NCBI. The protein-coding genes of 11 representative species were selected for gene family analysis using OrthoMCL v1.133 with default parameters (Li et al.33). A total of 1,224 homologous single-copy genes were obtained. These homologous single-copy genes were compared using the “-align” parameter of Muscle v3.734 (Edgar34). Gblock v0.19b35 was employed to extract conserved sequences in comparison results with the parameter “-b4 = 5 -t = d -e = 0.2”. The phylogenetic tree was subsequently constructed by PhyM v3.036 (Guindon et al.36). Based on the phylogenetic topology, MCMCTREE in the PAML v4.9e37 package was employed to estimate divergence times among P. pekinensis and other examined species, with assistance of fossil records from the TimeTree v5.038. Here, our phylogenetic tree showed that P. pekinensis was evolutionarily closer to M. amblycephala, which also belongs to Culterinae, with a divergence time of 7.6 million years ago (Mya). In addition, the two species have a common ancestor with C. idella, which belong to the same family Cyprinidae, and the divergence time of the two clades was about18.4 Mya (Fig. 3).

Fig. 3
figure 3

The phylogenetic analysis of P. pekinensis and other 10 species. The numbers in blue near the nodes indicate the divergence time (million years ago, MYA), and the blue numbers inside the brackets are bootstrap values.

Based on the protein coding sequences and gene structure, JCVI v19021339 was used to perform chromosomal collinearity analysis among P. pekinensis, M. amblycephala and C. idella. Through collinearity analysis, it can be seen that these four species have a good collinearity relationship, and the chromosomes show a one-to-one correspondence relationship, which also indicates that the assembled genome is complete and high-quality (Fig. 4).

Fig. 4
figure 4

A synteny analysis of the chromosomes among P. pekinensis, M. amblycephala and C. idella Chromosomes are ordered from the longest to the shortest.

Data Records

The raw reads of the MGI, HiFi, Hi-C and transcriptome sequencing for P. pekinensis were deposited at NCBI under the accession number PRJNA1082057. Raw reads are available in the Sequence Reads Archive (SRA) with the accession number SRP50938040. The genome assemblies were deposited at GenBank with the accession number GCA_041684115.141. The annotation files have been submitted to appropriate public Figshare database42.

Technical Validation

The final genome assembly of P. pekinensis was 1.03 Gb in size with an N50 of 31.22 Mb, consisting of 24 chromosomes. Also, the Hi-C chromosome interaction intensity signal demonstrated the high quality of genome assembly. We also performed BUSCO analysis with the actinopterygii_odb10 database to assess the completeness of the genome assembly and protein-coding gene for P. pekinensis. More than 98.35% and 96.04% of complete BUSCOs were identified in genome assembly and protein-coding genes, respectively, showing the genome assembly and annotation of P. pekinensis are complete and high-quality. Furthermore, we compared the conservation synteny among P. pekinensis, M. amblycephala and C. idella to validate the chromosome assembly. We observed highly conserved synteny and strict correspondence of chromosome assignment (Fig. 4).