Background & Summary

The black carp (Mylopharyngodon piceus, NCBI Taxonomy ID: 75356) belongs to the genus Mylopharyngodon within the family Xenocyprididae and order Cypriniformes1. This demersal fish is native to East and Southeast Asia, predominantly distributed in amur river basin, as well as various rivers and lakes across southern China and Vietnam1. It has also been introduced to numerous countries in North America, Europe, Africa, and the Middle East2.

Recognized as one of the largest cyprinids in the world, the black carp can reach a maximum body length of 1.5 meters and a body weight of 70 kilograms3. Renowned for its tasty meat, high nutritional value, simple aquaculture process, and fast growth rate, black carp is one of the most traditionally economic fish in China, where it has been cultivated and consumed since the Tsin Dynasty (A.D. 265–420)4,5. Currently, black carp remains one of the most dominant freshwater farmed fish in China, commonly referred to as one of the culturally significant “four famous domestic fishes” along with the grass carp (Ctenopharyngodon idella), silver carp (Hypophthalmichthys molitrix), and bighead carp (Hypophthalmichthys nobilis)5. According to the China Fishery Statistical Yearbook, black carp farming production in China reached 748,026 tons in 2022, marking an increase of more than 50% from a decade earlier4,6.

Beyond its culinary appeal, black carp also plays a role in biological control within aquaculture. As a carnivorous fish that feeds primarily on snails, it has been introduced to many countries, including the United States, for the management of snail populations in aquaculture facilities that can otherwise impact aquaculture operations3,7.

The large-scale application of sequencing technologies since the 21st century has facilitated the availability of whole-genome sequences for around 1400 fish species in NCBI databases8. The wealth of genomic information has been instrumental in reconstructing the evolutionary history of fish and advancing basic research in areas such as disease resistance, immune responses and morphogenesis9,10,11. These genomic insights also support conservation and aquaculture practices. The first draft genome of black carp derived from short-read sequencing was reported in 202212, and it was optimized in 202313, with a contig number reduced to 3,436 and a contig N50 enhanced to 2.9 Mb. However, both of the published assemblies remain at the scaffold level, which constrains the understanding and utilization of valuable genomic information for this species. Recently, a chromosome-level genome of black carp assembled from PacBio HiFi reads was reported14, with a genome size of 893.89 Mb and a scaffold N50 of 36.19 Mb.

In this study, we present a high-quality chromosome-anchored genome assembly of black carp from a different source. We combined our Nanopore long reads with published Illumina short reads12,15 to generate a genome assembly of black carp, followed by Hi-C sequencing to scaffold the assembled sequences to chromosomes. The resulting genome spanned 848.70 Mb, with a contig N50 of 3.37 Mb and a scaffold N50 of 34.13 Mb, and was anchored onto 24 chromosomes. Compared to the previously reported chromosome-level genome of black carp, our genome assembly exhibits a size difference of over 45 Mb, reflecting genetic diversity between populations. Furthermore, our assembly showed high completeness, with a BUSCO score of 97.6%. A total of 23.98% (203.54 Mb) of the genome was identified as repeat sequences, and 37,418 protein-coding genes were predicted. Our chromosome-level genome offers critical genomic data for black carp, generated using a distinct methodology and representing a different genetic population. It provides an alternative but equally valuable reference genome to advance breeding programs and biological studies for this species. Additionally, our assembly enriches the genomic resource pool of black carp, enabling further investigations into genome evolution, local adaptation, and genetic diversity through comparative genomics.

Methods

Sample collection and genome sequencing

One healthy black carp with a body length of 23.2 cm (sex not determined), collected from Jingyang Hatchery in Guangdong Province, China, was sampled for Nanopore and Hi-C sequencing. After sampling, the muscle tissues of these fish were first frozen in liquid nitrogen and subsequently delivered to a refrigerator at -80°C for storage until sequencing began.

For Nanopore sequencing, the genomic DNA of muscle tissue was extracted using the Monarch Genomic DNA Purification kit (NEB, #T3010), following the standard protocol. The genomic DNA was assessed in terms of quantity, purity, and integrity, using NanoDrop Spectrophotometer (Thermo Fisher Scientific) and 1.5% agarose gel electrophoresis. The Nanopore libraries were constructed from 4 µg of high-quality genomic DNA by utilizing the LSK-110 (ONT, Oxford, United Kingdom) library preparation kit. Long-read sequencing was performed on the in-house MinION (ONT) platform using R.9.4.1 Flow Cell (FLO-MIN106, ONT). The resulted fast5 files were base-called with Guppy v6.0.1 (https://community.nanoporetech.com/docs/prepare/library_prep_protocols/Guppy-protocol/v/gpb_2003_v1_revax_14dec2018) in the “sup” accurate model. After the removal of adapters, 35.83 Gb of Nanopore sequencing reads were obtained, with an average read length of 3,293 bp and an N50 read length of 7,485 bp (Table 1). This equates to a 42x coverage of the genome based on the size of our final genome assembly.

Table 1 Sequencing data for the black carp genome assembly.

To achieve the chromosome-level genome assembly of black carp, Hi-C sequencing was performed using DNA from the muscle tissue. All steps, from sample preparation to DNA extraction, were performed in-house following the restriction enzyme Pore-C (RE-Pore-C) protocol16 provided by Oxford Nanopore Technologies. The prepared DNA libraries were then sequenced by Illumina platform (Illumina Inc., San Diego, CA, USA) at Novogene Co., Ltd. (Hong Kong, China) using the 150 bp paired-end (PE) mode, which yielded 53.25 Gb of raw sequencing data (Table 1).

Genomic feature survey

The genome size and heterozygosity of black carp were analysed using the available Illumina sequencing data (NCBI SRA ID SRR14181237)12,15. The Illumina short reads served as input for Jellyfish v2.3.017, from which a k-mer (k = 21) frequency distribution was obtained, as depicted in Fig. 1. After discarding the k-mers with abnormal depth, the genome size was estimated by using the formula genome size = k-mers number/average depth of k-mers. Consequently, the estimated genome size of black carp was 765.00 Mb, with a heterozygosity rate of 0.355%.

Fig. 1
figure 1

k-mer (k = 21) frequency distribution generated using Illumina sequencing data of black carp. The x-axis and y-axis represent the depth and frequency of the k-mer.

De novo genome assembly and evaluation

The de nove genome assembly of black carp was performed by combining the Nanopore and Illumina data. Briefly, the Nanopore sequencing reads were used for genome assembly by NextDenovo v2.5.0 (https://github.com/Nextomics/NextDenovo) with default parameters. Then, to enhance the accuracy of the assembly, NextPolish v1.4.118 was employed for polishing with Nanopore long reads and published Illumina short reads12,15. This step generated an assembly of 848.42 Mb, containing 721 contigs, with a contig N50 of 3.37 Mb.

For Hi-C reads, initial filtering was conducted to remove those of low quality by using Trimmomatic v0.3919 with the parameters of “ILLUMINACLIP:NexteraPE-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36”. Subsequently, Juicer v1.620 was used to find the potential linkage between contigs and scaffolds within the assembly. 3D-DNA phasing branch 20100821 was used to anchor the assembled sequences onto chromosomes, leading to the generation of a contact map that was visualized using Juicebox v1.11.0822. Finally, we obtained a chromosome-level genome assembly of 848.70 Mb, which was scaffolded by Hi-C technique onto 24 pseudo-chromosomes with sizes ranging from 20.78 Mb to 48.76 Mb (Table 2 and Fig. 2). The assembly comprises 172 scaffolds in total, with a scaffold N50 of 34.13 Mb.

Table 2 Statistics of the assembled black carp genome.
Fig. 2
figure 2

Characteristics of the black carp genome. (a) Hi-C contact map of the black carp genome assembly. (b) Circos plot of the black carp genome. From the inner to the outer layers: gene density, transposable element (TE) density, GC content, and chromosome ideograms.

To evaluate the completeness of the assembled genome, we employed BUSCO (Benchmarking Universal Single-Copy Orthologs) v5.1.223, utilizing 3,640 orthologs from the Actinopterygii_odb10 database as reference. The BUSCO analysis revealed 3,552 complete Benchmarking Universal Single-Copy Orthologs within our assembly, indicating a 97.6% completeness level (Table 2). We also mapped the published Illumina short reads of black carp to the assembled genome by BWA v0.7.1824 to assess its consistency. Analysis of the mapping results using SAMtools v1.2025 confirmed that 97.97% of the Illumina reads were successfully mapped to the genome, with a coverage of 99.53%.

Repetitive sequence annotation

For the prediction of repeats in our black carp genome assembly, we integrated both homology-based and de novo methodologies. First, RepeatModeler v2.0.326 was employed for the de novo discovery of repetitive sequences, creating a custom library tailored to our genome. Subsequently, a comprehensive library was constructed by merging sequences from Dfam v3.627 and Repbase28 with those identified by RepeatModeler. With this enriched library as a reference, we utilized RepeatMasker v4.1.329 to detect the repetitive sequences throughout the genome. This approach led to the identification of approximately 203.54 Mb of repeats, which constitute 23.98% of the total genome assembly (Table 3).

Table 3 Statistics of the repetitive elements in the black carp genome assembly.

Protein-coding gene annotation

The pipelines based on ab initio prediction, transcript-assisted strategies, and protein alignment were used to predict protein-coding genes. Seventeen RNA-Seq datasets from various tissues, including hindgut, foregut, eye, fin, gill, head, kidney, liver, muscle, bladder, brain, skin, and spleen, were retrieved from the NCBI repository30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46 for the prediction. Firstly, BRAKER v3.0.847 was employed for an ab initio gene structure prediction, utilizing the transcript evidence derived from the alignment of RNA-seq reads by HISAT v2.2.148 against the repeat-masked genome assembly. The second pipeline involved both de novo and genome-guided transcriptome assembly strategies by Trinity v2.15.149 prior to protein-coding gene prediction. Then the two versions of transcriptome assemblies were combined and the gene structures were predicted by using the PASA pipeline v2.5.350. For protein alignment-based prediction, the protein sequences from the UniProt (https://www.uniprot.org) were input to miniport v0.13-r262-dirty51 with default parameters to generate the protein-base file. Ultimately, EVidenceModeler v2.0.052 was used to produce the final gene model by merging the output resulting from PASA, BRAKER, and miniprot with the weights in Table 4. Overall, our approach, combining the ab initio prediction, transcript-based strategies, and protein alignment, predicted a total of 37,418 protein-coding genes within our genome assembly. These genes exhibited an average of 7.78 exons and an average exon length of 160 base pairs (Table 5). The BUSCO score for the predicted genes is 91.2%, indicating a high level of completeness in the predictions. For gene functional annotation, we used eggNOG-mapper v2.1.1253 and InterProScan v5.69-101.054 for ortholog- and protein domain-based annotations, which yielded 24,734 (66.1%) and 29,667 (79.3%) hits, respectively. By integrating the results from both annotation methods, a total of 30,278 (80.9%) genes were successfully assigned with functional annotations.

Table 4 The weights used in EVidenceModeler.
Table 5 Statistics of predicted protein-coding genes in the black carp genome assembly.

Data Records

The sequencing data and genome assembly have been submitted to the public repositories. The Hi-C and Nanopore sequencing data have been deposited in the NCBI Sequence Read Archive database under the SRA accession # SRR2876216455 and SRR2876216556, and BioProject accession # PRJNA1102922. The genome assembly has been deposited at the NCBI GenBank under the accession # JBCHWC00000000057.

Technical Validation

To evaluate the quality of our assembly, we used BUSCO v5.1.223 to assess its completeness. The BUSCO results confirmed that 3,552 (97.6%) of the 3,640 conserved single-copy genes in Actinopterygii are in our assembled genome, implying a high degree of completeness. Additionally, to evaluate the accuracy of our assembly, we mapped the Illumina short reads to the genome using BWA v0.7.1824. Based on the statistical analysis performed with SAMtools v1.2025, 97.97% of the Illumina short reads were mapped to the genome and the coverage was 99.53%, which indicated a high level of accuracy. Collectively, these results demonstrated that our assembly is of high quality.