Background & Summary

The barbel chub Squaliobarbus curriculus (Cypriniformes: Xenocyprididae)1 is an endemic fish of East Asia, found in China, North Korea, South Korea, eastern Russia, and Vietnam. In China, this species is commonly known as “red-eye rod” or “wild grass carp” due to its red spots around the eyes and its body shape resembling that of the grass carp Ctenopharyngodon idella. Owning to its high adaptability to various environmental conditions, S. curriculus is widely distributed across rivers and lakes, except for the Qinghai-Tibet Plateau and the Hexi Corridor2. The species has an average age-at mature of three years. Similar to the Four Major Chinese Carps (i.e. the black carp (Mylopharyngodon piceus), the grass carp (C. idellus), the silver carp (Hypophthalmichthys molitrix), and the big-head carp (Hypophthalmichthys nobilis)), S. curriculus migrates from rivers to lakes to complete its reproduction during spawning season, which lasts from April to September. Its eggs are pelagic and need long rivers for their eggs drifting and hatching.

S. curriculus is an economically important freshwater fish species in China due to its high nutritional value. The meat of the fish contains 18 amino acids, of which 46.12% are essential, and the content of essential amino acids in S. curriculus is significantly higher than other economic fish species such as the Four Major Chinese Carps in China2. In Meizhou, a city in eastern Guangdong province of southern China, S. curriculus is particularly popular as the main ingredient in “Meizhou Yusheng”, a traditional Hakka raw fish salad that originated in the Qin dynasty (221-206 BCE) and flourished during the Tang dynasty (618–907 CE). Consequently, the demand of S. curriculus is high in the Pearl River Delta region, especially in areas with large Hakka populations.

Being the seventh most harvested fish species, S. curriculus is an important commercial fishing species in the Pear River, particularly in the western Pearl River Estuary3. However, it has experienced a sharp decline in fisheries production, with the population now dominated by small-sized individuals caused by dam construction, overfishing, and environmental pollution4. To address this, the National Aquatic germplasm resources protection area for S. curriculus has been established in the Xijiang River (the main stream of the Pearl River)5, a spot with high genetic diversity of this species that deserves further monitoring and exploration6. Efforts to recover its natural populations include stock enhancement and artificial breeding. Currently, artificial breeding techniques are well-developed and several fish farms for this species can be found in Guangdong Province. Additionally, measures to control fishing intensity have also been implemented, such as optimizing spawning biomass per recruitment and suggesting optimal fishing age and body length based on previous studies.

Developing aquaculture of S. curriculus is a promising strategy for balancing fisheries supply and consumption demand, thanks to the success of artificial breeding. Nevertheless, the lack of selected populations or strains with fast growth rates is hindering the expansion of S. curriculus aquaculture. Studies have shown that the growth rate of S. curriculus varies among populations from different water systems7,8, as well as between populations in the upper and lower reaches of the same river3. However, the underlying molecular basis remains unknown. The lack of genomic resources is a key bottleneck in addressing these questions. Generating a high-quality reference genome is the first step toward advancing this field. Genomic resources of S. curriculus will enable us to investigate genomic markers and regions associated with important phenotypic traits, such as body size, body weight and growth rate. Moreover, these resources will provide the opportunities to explore additional aspects, including the mechanism of sex determination and high environmental adaptability of S. curriculus9,10,11, which will also be helpful in subsequent genetic breeding efforts.

In this study, using a combination of HiFi sequencing, Hi-C sequencing, Iso-seq and short-reads RNA-seq, a chromosome-level of S. curriculus has been de novo generated. This assembly was 910.27 Mb in size with a contig N50 length of 34.70 Mb and 24 chromosomes supported by Hi-C contact map. BUSCOs assessment showed 3,626 (99.61%) BUSCOS was complete. We believe our high-quality S. curriculus reference genome will serve as a valuable genomic resource for genetic breeding, population genomics, and sex-related marker identifications for future research.

Methods

Ethics statement

Fishes used in this study complied with China animal welfare laws, guidelines and policies. The protocols were approved by Laboratory Animal Ethics Committee of Jiaying University (permit reference number No. 2022ZDJS086). Fishes were collected for experiment purposes and under conservation laws of this species. Sampled fish was fatally anesthetized with MS-222 (Sigma).

Sample collection and DNA extraction

One adult male individual of S. curriculus was collected from a fish farm in Meizhou City, Guangdong Province, China. A piece of muscle (~ 2 g) was collected along the dorsal fin of the fish and the whole tissue was frozened in liquid nitrogen quickly for 30 minutes. The high molecular weight of genomic DNA was extracted using QIAGEN Genomic DNA extraction kit according to the manufacturer’s instructions. The quality of extracted DNA was evaluated by 1% agarose gel and Qubit 3.0 Fluorometer (Invitrogen, USA).

Library construction and DNA sequencing

There were two libraries type used in the assembly. For PacBio HiFi sequencing, a 20 kb long-read sequencing library (SMRT bell library) was constructed according to PacBio’s standard protocol (Pacifc Biosciences, Menlo Park, CA, USA). After passing the quality assessment, the library was sequenced on a PacBio Revio System. All circular consensus sequencing (CCS) reads were produced using the CCS module in SMRT Link v9.012. Finally, approximately 31.14 Gb PacBio HiFi reads with an N50 of 20.47 kb were generated, covering 34.21× of the genome in depth (Table 1).

Table 1 Sequencing data for Squaliobarbus curriculus genome assembly.

For Hi-C sequencing, libraries were constructed using the GrandOmics Hi-C kit with DpnII enzyme (GrandOmics, China) by following the standard manufacturer’s protocol. These Hi-C libraries were sequenced on a MGISEQ-2000 platform (MGI, BGI Shenzhen, China). A total of 97.98 Gb raw Hi-C paired-end reads were generated and fed to fastp v0.19.513 to filter low quality reads. After filtering, a total of 94.98 Gb (104.34×) clean reads with 149 bp mean length were obtained and subsequently used for chromosome-level scaffolding.

RNA extraction and sequencing

For assisting gene structure annotation, both Iso-seq and short-reads RNA-seq were employed to achieve a better solution. Total RNA from multiple tissues (heart, liver, gill, muscle, skin, fin and gonad) were equally mixed and extracted by using a TRIZOL Kit (Invitrogen, Carlsbad, CA, USA) following the manufacturer’s instructions. RNA integrity and quality was checked by the Nanodrop 2000 spectrophotometer and the Agilent 2100 Bioanalyzer System (Agilent Technologies, Santa Clara, CA, USA). RNA with RIN (RNA integrity number) ≥ 7.0 were selected for library construction. For Iso-seq, procedures described in previous study14 were performed. Briefly, the extracted RNA was used for cDNA synthesis followed by a large-scale PCR amplification step. PCR products were purified and subjected to the construction of SMRTbell template libraries. Finally, the SMRT bell libraries were sequenced on a PacBio Revio platform. For short-reads RNA-seq, cDNA libraries with insert sizes of ~350 bp were constructed and sequenced on a MGISEQ-2000 platform (MGI, BGI Shenzhen, China). 146.21 Gb and 19.34 Gb raw data were generated from Iso-seq and short-reads RNA-seq, respectively (Table 1).

Genome assembly

For the initial contig-level assembly, raw HiFi reads were assembled using hifiasm v0.19.5-r58715 with default parameters. This primary assembly was about 910.27 Mb in size, consisting of 67 contigs. The length of contig N50 was 34.70 Mb. To further scaffold these contigs, Hi-C reads were mapped onto the primary assembly using BWA v0.7.816 (-5SP). The output sam file was piped to samtools v1.19.217 (view -S -h -b -F 3340) to generate a bam file. The resulted bam file was dealt with HapHiC v1.0.518 pipeline to generate a scaffold assembly and a Hi-C contact map. Briefly, the bam was filtered by python script (filter_bam.py input.bam 1–NM 3). The filtered bam file was set as an input for haphic pipeline (chromosome number set as 24 according to the diploid chromosome number of 4819) which could result in a chromosome-level assembly. The Hi-C contact map was visualized by using haphic plot module. We finally obtained a genome size of 910.27 Mb (including gap regions), comprising 41 sequences with N50 length of 35.62 Mb (Fig. 1). 24 of these sequences were chromosome-level in length supported by strong Hi-C signals (Fig. 2). The length ranges from 28.10 Mb to 69.93 Mb, accounting 99.50% of the total genome size. The chromosome numbers detected by the Hi-C heat map was also in agreement with a published karyotype study of S. curriculus19.

Fig. 1
figure 1

Circos plot of Squaliobarbus curriculus genome. (a) chromosome sizes, (b) gene density, (c) GC density, (d) repeat elements abundance, (e) DNA transposons, (f) LTRs, and (g) ncRNAs.

Fig. 2
figure 2

Chromosome heatmaps of Hi-C data of Squaliobarbus curriculus genome.

Repeat elements annotation

We used two methods (homology and de novo prediction) to annotate repeat elements in the S. curriculus genome. For de novo prediction, a novel library was generated using RepeatMasker v4.1.2-p120 based on Repbase TE v21.0121. Then, types of repetitive sequences were detected and classified by RepeatModeler v2.0.322 and LTR-FINDER v1.0.623. For homology prediction, repeat sequences were searched using RepeatProteinMask v4.1.2-p120 and RepeatMasker v4.1.2-p120 with default parameters. The outputs showed 445.23 Mb (48.91%) was identified to be repetitive sequences (Table 2), in which DNA transposons accounting for 25.74% (234.27 Mb), LTR 3.89% (35.37 Mb), LINE 2.55% (23.24 Mb) and SINE 0.19% (1.68 Mb). The masked genome was subsequently used as an input for gene structure prediction in ab initio prediction.

Table 2 Statistics of repetitive sequences.

Gene structure prediction and functional annotation

Gene structure was predicted using three approaches: (1) Ab initio prediction: for ab initio prediction, AUGUSTUS v3.5.024 was performed (–species = zebrafish–gff3 = on–softmasking = True–stopCodonExcludedFromCDS = False); (2) Homology-based prediction: we used GeMoMa v1.925 to do homology-based prediction. Genome and gff files of five representative species (C. idella, Danio rerio, Megalobrama amblycephala, Oreochromis niloticus, Xiphophorus maculatus) were download from the NCBI database. Using these data as references, gene structures in the S. curriculus genome were predicted using GeMoMa v1.925 (tblastn = false); (3) Transcriptome-based: for transcriptome-based predictions, we integrated two kinds of RNA-seq data, Iso-seq and short-reads RNA-seq. For short-reads RNA-seq, raw reads were filtered using fastp13 (-a auto–adapter_sequence_r2 auto–dedup–dup_calc_accuracy 3). After filtering, 17.88 Gb clean reads were mapped onto the S. curriculus genome using HISAT2 v2.2.126. The gft file was generated using stringtie v2.2.127. For Iso-seq, bam format file was converted to fastq using isoseq pipeline28. For the short reads, stringtie v2.2.127 was called to output the gtf file. These two gft files were combined using TACO29 (–filter-min-expr 0.0). For the latter two approaches, an unmasked genome was used as inputs. Finally, gene structures predicted from three approaches were integrated by EVidenceModeler v1.1.130. Genes with a length below 150 bp were removed from the final dataset. The final resulting output comprised consistent and non-overlapping sequence assemblies, which described as the gff file of S. curriculus genome.

To annotate the function of predicted genes, protein sequences based on gff file were extracted from the S. curriculus genome and blasted against six commonly used protein databases (NR, Swissprot, KEGG, KOG, GO, Pfam) using BLASTP v2.2.2631 with an E value of 1e−5.

Non-coding RNA (ncRNAs, i.e., tRNAs, rRNAs, miRNAs and snRNAs) in the S. curriculus genome were also annotated. We first utilized tRNAscan-SE v1.3.132 to predict tRNA in the assembly. For the rRNA genes, RNAmmer v1.233 was used (-S euk -m lsu,ssu,tsu -gff). MiRNAs and snRNAs were searched by CMSAN v1.1.234 sofware against the Rfam v14.10 database35 (–cut_ga–rfam–nohmmonly–tblout–fmt 2). Finally, 2,041 miRNAs, 16,426 tRNAs, 5,488 rRNAs and 1,536 snRNAs were annotated in the S. curriculus genome (Table 3).

Table 3 Statistics of non-coding RNAs.

Ab initio prediction using AUGUSTUS v3.5.024 found 26,240 genes in the S. curriculus genome. Homology-based prediction suggested there were 25,475 to 30,335 genes according to different reference genome. Using RNA-seq as evidence, 33,108 genes were predicted using short-reads RNA-seq while TACO found 29,567 gene structures based on a combination of Iso-seq and short-reads RNA-seq data (Table 4). After integration by EVidenceModeler v1.1.130, 28,329 protein-coding genes were annotated in the end. Functional annotation using six public databases showed 14,239 to 27,137 hits of 28,329 protein sequences. A total of 27,207 genes (96.04%) had at least one database annotation (Table 5).

Table 4 Statistics of gene prediction.
Table 5 Statistics of gene functional annotation.

Data Records

Raw reads sequenced in this study have been submitted to the National Genomics Data Center (https://ngdc.cncb.ac.cn/, BioProject number: PRJCA029958, GSA: CRA01886436, Run IDs: CRR1288665-CRR1288668). The genome sequences and annotation files were deposited at figshare (https://doi.org/10.6084/m9.figshare.26968774)37 and NCBI (accession number: JBJUSD00000000038).

Technical Validation

For validation of the quality of our genome assembly, we mapped the HiFi reads onto our reference genome using Minimap2 v2.22-r110139, the results showed that the mapping rate was 100%, suggesting the high accuracy of our assembly. Chromosome numbers of our assembly were confirmed by Hi-C heat map (Fig. 2). The quality of the assembly was assessed using compleasm v0.2.640 with the actinopterygii_odb10 database (3,640 BUSCOs). As a result, 3,626 (99.61%) BUSCOs were identified as complete in total, of which 3,612 (99.23%) and 14 (0.38%) were single-copy and duplicated, respectively. Completeness assessment of protein sequences showed that a total of 3,401 (93.5%) were identified as complete BUSCOs. Of these, 3,347 (92.0%) were single-copy and 54 (1.5%) were duplicated BUSCOs (Fig. 3). All the evidence above suggested the high quality of genome assembly and annotation of S. curriculus.

Fig. 3
figure 3

BUSCO assessment results of protein sequences of Squaliobarbus curriculus genome.