Background & Summary

Xenocypris davidi Bleeker, 1871, a species of fish extensively distributed throughout the water systems south of the Yellow River in China, is commonly found in rivers, lakes, and reservoirs, and is classified as a benthic species. It primarily feeds on bottom-dwelling algae and detritus. These fish typically inhabit the mid-to-upper reaches of rivers and migrate upstream to shallow rapids for spawning during the breeding season1. Owing to its identity as a benthic scraper, it actively contributes to the enhancement of aquatic ecological environments and the amelioration of water quality, rendering it an ideal candidate for artificial propagation and release, as well as for pond polyculture practices. Research on the dietary habits of X. davidi in a certain reservoir has revealed that adult fish primarily consume a substantial amount of detritus, supplemented by a small number of sessile diatoms and oscillatorian cyanobacteria. This diet does not compete with that of other economically important fish species such as silver carp and bighead carp in the reservoir, indicating that X. davidi is a preferred species for integrated aquaculture2. Studies by Tang et al. have demonstrated that X. davidi has significant potential in managing the overgrowth of filamentous algae3. Appreciated for its delicate texture and savory taste, X. davidi has become one of the specialty fish species favored by consumers. Recent research has also explored the dietary nutritional requirements of X. davidi broodstock and the impact of adding bee pollen and glutamine to their feed on the growth of this species4,5,6. In the field of toxicology, it has been discovered that X. davidi exhibits a low tolerance to heavy metal copper ions. A hepatic transcriptome database for X. davidi has been constructed to analyze gene expression profiles, providing molecular insights into the species’ response to environmental pollutants7.

X. davidi, classified within the order Cypriniformes, family Cyprinidae, subfamily Xenocyprininae, and genus Xenocypris, has been the subject of evolutionary studies primarily based on the analysis of the complete mitochondrial genome8,9. Assessments in the field of population genetics, utilizing microsatellite markers, have evaluated the impact of stocking activities on wild populations of X. davidi in the Qiantang River, with results indicating that these activities have not posed genetic risks to the wild populations10. Additionally, research has identified genetic diversity variations among different aquaculture populations of X. davidi11. When analyzing wild populations from various water systems, significant genetic divergence was observed between populations in Qiandao Lake and the Yangtze River12. Nonetheless, the scant genomic data resources limit the understanding of X. davidi’s evolutionary adaptation and molecular mechanisms of its traits, restricting our full appreciation and exploitation.

In this research, we have achieved the construction of a chromosome-level genome assembly for X. davidi, representing the initial case of jointly employing PacBio long-reads, Illumina short reads, and Hi-C sequencing data. The obtained genome assembly has a total length of 1.05 Gb. The scaffold N50 length reaches 33.99 Mb, and a remarkable 95.12% of the genome sequence is effectively anchored to 24 pseudochromosomes. By using a combined method including de novo gene predictions, RNA-seq data, and homologous protein evidence, a sum of 27,360 protein-coding genes have been identified, among which 26,672 have been functionally annotated. The genome sequence is a valuable asset for understanding X. davidi’s agronomic trait molecular basis and facilitating its genetic improvement.

Methods

Ethics statement

The Experimental Animal Care and Ethics Committee of Xinyang Agriculture and Forestry University approved all fish sampling experiment procedures.

Samling and genome survey

A healthy female specimen of X. davidi, weighing 659.36 grams, was collected from the Pohe Reservoir in Guangshan County, Xinyang City, Henan Province, China. Following euthanasia with eugenol at an anesthetic concentration of 16 mg L^-313, the fish was rapidly rinsed three times with sterile physiological saline and dabbed dry with sterile cotton. Tissues including those from the muscle, heart, spleen, liver, and kidney were promptly dissected and promptly immersed in liquid nitrogen for conservation, and then stored at -80 °C until the DNA extraction process. High-quality genomic DNA from the muscle was extracted with the PureLink™ Genomic DNA Mini Kit (K182001, Thermo Fisher Scientific, USA). Meanwhile, RNA from diverse tissues was isolated by TRIzol™ Reagent (15596026CN, Thermo Fisher Scientific, USA). The quality and concentration of DNA were evaluated via 1% agarose gel electrophoresis and a Qubit 2.0 Fluorometer (Invitrogen, Thermo Fisher Scientific, USA). In contrast, for RNA, its purity and integrity were further appraised using a NanoDropTM One spectrophotometer (Thermo Fisher Scientific, USA) and Agilent 2100 Bioanalyzer (Agilent Technologies, Inc., USA).

In the context of the genomic survey, a 10 μg DNA sample was utilized to construct a 350 bp library, employing a paired-end 150 bp (PE150) sequencing strategy on the Illumina Novaseq 6000 platform. A total of 50.51 Gb of raw data and 168,353,928 read pairs were generated. The Jellyfish14 (version 2.2.7) was used to construct the 17-mer frequency depth distribution and estimate the genome size. Genome assembly was performed using SOAPdenovo215. Ultimately, survey analysis conducted with a Kmer size of 17 estimated the genome size to be 1,020.95 Mbp, which was refined to 1,004.32 Mbp. The heterozygosity rate was determined to be 0.52%, with the proportion of repetitive sequences constituting 53.91% (Table 1). Assembly with a Kmer size of 41 yielded a contig N50 of 2,166 bp, with a total length of 876,892,788 bp, and a scaffold N50 of 3,222 bp, with a total length of 899,312,128 bp. (Table 2).

Table 1 Kmer = 17 Analysis of genomic characteristic statistics.
Table 2 Statistical summary of the genome survey assembly results for X. davidi.

PacBio and Hi-C based whole-genome sequencing

Regarding PacBio sequencing, high-quality DNA samples (with the main band being greater than 30 kb) were fragmented into segments ranging from 15 to 18 kb by utilizing a Covaris ultrasonic disruptor. Afterwards, the large DNA fragments were enriched and purified through magnetic beads. Subsequently, the fragmented DNA underwent damage repair as well as end repair. Hairpin sequencing adapters were then ligated to both ends of the DNA fragments, and exonucleases were used to eliminate any fragments that had unsuccessful ligation. The properly prepared library was then sequenced on the PacBio Sequel IIe platform in the CCS mode. After filtering polymerase reads, raw subreads were acquired, and these were further processed using SMARTLink (version 11.0, parameters: filter_min_qv = 20) to generate HiFi reads. Ultimately, a total of 27.86 Gb HiFi reads were obtained, along with a read number of 1,908,298 and a mean read length of 14,601 bp (as shown in Table 3 and Table 4).

Table 3 Statics of different types of sequencing reads.
Table 4 A statistical analysis of the sequencing data obtained from Hi-Fi.

In the case of Hi-C sequencing, muscle tissue was processed with paraformaldehyde to stabilize the intracellular DNA conformation. After cell lysis, the crosslinked DNA was digested by the restriction enzyme MboI to create sticky ends. Subsequently, the DNA termini were biotinylated, and DNA ligase was used to connect the DNA fragments. Thereafter, proteases were applied to reverse the crosslinking of DNA. The purified DNA was then randomly fragmented into pieces ranging from 300 to 500 bp and utilized to construct a Hi-C library16,17. Hi-C sequencing was carried out on the Illumina Novaseq 6000 platform following a paired-end 150 bp (PE-150) sequencing strategy. The raw sequence data obtained were processed through the HiCUP18 (v 0.8.3), which includes hicup_truncater for identifying and mapping chimeric sequences, hicup_filter for filtering mapped reads, and hicup_deduplicator for removing duplicate contacts. After conducting the aforementioned quality assessments on the Hi-C data, a total of 5,564,221 Total Reads Pairs were obtained, with 2,754,693 Total Paired (mapped) reads (accounting for 49.51%), 2,334,774 Valid Pairs, and 2,237,596 Unique di-Tags (Table 5).

Table 5 Statistical analysis of sequencing data from Hi-C.

Genome assembly

The genomic assembly of X. davidi was facilitated by employing the default settings within the Hifiasm software19 (v 0.16.1-r375). This assembly approach commences from the uncollapsed genomic data, thereby maximizing the retention of haplotype information. Hifiasm was provisioned with HiFi long reads to produce a monoploid assembly and a pair of contig graphs that resolve haplotypes. Consequently, the assembly process resulted in the construction of 134 contigs with a combined length of 1.05 Gb. This genomic assembly size is marginally larger than what was initially anticipated based on the survey results. The average length, maximum contig size, and N50 were 7.8, 57.52, and 33.99 Mb (Table 6), respectively.

Table 6 Statistics for assembly at the contig level.

Hi-C assembly and Chromosome anchoring

The Hi-C technology was employed to differentiate contigs or scaffolds into distinct chromosomes based on the higher probability of intra-chromosomal interactions compared to inter-chromosomal interactions. Additionally, it facilitated the ordering and orientation of contigs or scaffolds on the same chromosome, as the interaction probability decreases with increasing interaction distance along the chromosome. After the Hi-C corrected contigs were integrated into the ALLHiC pipeline20 for pruning, partition, rescue, optimization, and building, a substantial 95.12% of the assembled sequences were anchored to 24 pseudochromosomes21 (Fig. 1), with chromosome lengths varying from 31.29 Mb to 60.28 Mb (Table 7). This outcome aligns with the karyotype results derived from cytological observations22, which are consistent with the chromosome numbers of 2n = 48 observed in several Xenocyprididae fish species, such as Plagiognathops microlepis23, Chanodichthys erythropterus24, M. amblycephala25, and Ctenopharyngodon idella26. Moreover, we manually refined the Hi-C scaffolding based on the chromatin contact matrix within Juicebox27 (Fig. 2). The 24 pseudochromosomes can be clearly identified on the heatmap, and there is a strong signal intensity around the diagonal, which suggests the high quality of the genome assembly. After Hi-C correction, the final assembled genome had a total span of 1.05 Gb, along with a scaffold N50 of 40.13 Mb (Table 8).

Fig. 1
figure 1

A circus plot of 24 chromosomes of X. davidi. The tracks from outside to inside are: 24 chromosomes, the distributions of transposable element and bar plot for gene density profile.

Table 7 Statistics of the 24 anchored chromosomes of X. davidi genome.
Fig. 2
figure 2

Hi-C chromatin interaction heatmap of the X. davidi assembly.

Table 8 Assembly statistics for Hi-C.

Genome annotation

In the annotation of repetitive sequences within the genome of X. davidi, this study employed two complementary approaches: homology-based alignment28 and de novo prediction29. The homology-based alignment method was predicated on the RepBase database30 and leveraged two software tools, RepeatMasker31 (version 4.1.0, parameters: -a -nolow -no_is -norna -parallel 4) and RepeatProteinMask (version 4.1.0, parameters: -noLowSimple -pvalue 0.0001 -engine ncbi), to identify sequences with similarity to known repetitive elements. Conversely, the de novo prediction approach commenced with the establishment of a de novo repetitive sequence library utilizing LTR_FINDER32 (version 1.06, parameters: -C -w 2), RepeatScout (version 1.0.5), and RepeatModeler33 (version 2.0.1, parameters: -engine ncbi -pa 15), followed by the application of RepeatMasker for prediction. Additionally, within the de novo prediction methodology, the software TRF34 (version 4.09, parameters: 2 7 7 80 10 50 2000 -d -h -ngs) was employed to detect tandem repeats within the X. davidi genome. Ultimately, all predicted results were consolidated and duplicates were eliminated, yielding the identification of 533.34 Mb of repetitive sequences, constituting 50.88% of the assembled genome. The predominant element among these repetitive sequences was long terminal repeats (LTR), which accounted for 45.59% (477.87 Mb) of the assembled genome (Table 9), a significant departure from the genome of the fine-scaled gudgeon, where DNA transposons were the most abundant, comprising 31.55%23. Long interspersed nuclear elements (LINE) constituted 1.81% of the genome, short interspersed nuclear elements (SINE) constituted 0.02% of the genome, and DNA elements constituted 4.56% of the genome, respectively (Table 9).

Table 9 Summary of the transposable elements in X. davidi genome.

For the prediction of gene structures within the genome of X. davidi, this manuscript employed a triad of methodologies: de novo prediction, homology-based prediction, and annotation using transcriptome data35. The de novo prediction outcomes were derived from the utilization of Augustus36 (version 3.2.3, parameters: --species = pasa1--uniqueGeneId = TRUE --noInFrameStop = TRUE --GFF3 = on --genemodel = complete --strand = both), GlimmerHMM37 (version 3.0.4, parameters: -d pasa1 -f -g), SNAP (version 2013.11.29, parameters: -gff pasa1.hmm), Geneid (version 1.4, parameters: -P homo_sapiens.param -v -G -p geneid), and Genscan (version 1.0, parameters: HumanIso.smat) software. In the homology-based prediction approach, protein sequences from Carassius auratus38 (GenBank: GCA_003368295.1), Cyprinus carpio39 (GenBank: GCA_000951615.2), Ctenopharyngodon idellus26 (GenBank: GCA_019924925.1), Danio rerio40 (GenBank: GCA_000002035.4), Onychostoma macrolepis41 (GenBank: GCA_012432095.1), M. amblycephala25 (GenBank: GCA_018812025.1), and Opsariichthys bidens42 (GenBank: GCA_037194245.1) were downloaded from the NCBI database and used to predict gene structures within the genome through alignment software such as Blastall43 (version 2.2.26, parameters: -e 1e-05 -F T -m 8), Solar (version 0.9.6, parameters: -a prot2genome2 -z -f m8), and Genewise44 (version 2.4.1, parameters: -tfor -genesf -gff -sum) (Fig. 3). For the transcriptome data annotation method, high-quality RNA from muscle, heart, spleen, liver, and kidney were used to construct RNAseq libraries. Subsequently, these libraries were sequenced on the Illumina Novaseq 6000 platform, and 150 bp paired-end reads were obtained as a result. Post-sequencing, 37.73 Gb of raw data was generated, which was filtered to yield 34.03 Gb of clean data (Table 3). Subsequent de novo assembly was performed using Trinity (version 2.1.1, parameters:–normalize_reads–full_cleanup–min_glue 2–min_kmer_cov 2–KMER_SIZE 25), alignment analysis with Hisat2 (version 2.0.4), and assembly annotation with Stringtie (version 1.3.3). The gene sets predicted by the above-mentioned three methods were combined into a non-redundant gene set with the help of EvidenceModeler (EVW)45 (version 1.1.1, parameters:–segmentSize 200000 –overlapSize 20000–min_intron_length 20). Finally, PASA (http://pasa.sourceforge.net/) was utilized, in conjunction with transcriptome assembly results, to refine the EVW annotation, incorporating UTR and alternative splicing information, to arrive at the final gene set. The genomic prediction for X. davidi resulted in 27,360 genes, with an average transcript length of 10,053.32 bp, an average coding region length of 1,121.54 bp, an average of 6.28 introns per gene, an average intron length of 178.57 bp, and an average exon length of 1,691.46 bp (Table 10, Fig. 4A).

Fig. 3
figure 3

Comparisons of the genomic elements of closely related species.

Table 10 Statistical analyses of the gene structure annotation of the X. davidi.
Fig. 4
figure 4

Gene prediction and functional annotation of the X. davidi genome. (A) Venn diagram of the gene set prediction. (B) Venn diagram of functional annotation based on different databases.

To conduct the functional annotation of protein-coding genes, this study employed a dual strategy utilizing Blastp46 (version 2.2.26, parameters: -max_target_seqs. 1 -evalue 1e-4) and Diamond47 (version 0.8.22, parameters:–more-sensitive -k 10 -e 1e-5 -f 6 qseqid qlen qstart qend sseqid slen sstart send pident ppos qcovhsp bitscore evalue–salltitles–threads 10) for aligning the protein-coding genes against the SwissProt48, NCBI Non-redundant protein (NR) (https://www.ncbi.nlm.nih.gov/refseq/about/nonredundantproteins/), KEGG49, InterPro50, Gene Ontology (GO)51, and Pfam52 protein databases. The identification of protein domains and motifs was facilitated through the application of InterProScan53 (version 5.35–74.0, parameters: -cpu 20 -format tsv -appl ProDom, SMART, ProSiteProfiles, PRINTS, Pfam, Panther -iprlookup -dp -goterms). Ultimately, 26,672 (97.50%) of the 27,360 predicted genes received annotations from at least one of the databases (Table 11). Among the functionally annotated proteins, 20,599 (75.29%) were corroborated by annotations from all four databases (Fig. 4B).

Table 11 Non-coding RNA statistical results of X. davidi.

The annotation of non-coding RNAs encompasses tRNA, rRNA, miRNA, and snRNA. To identify tRNA sequences within the genome, the structural characteristics of tRNA were leveraged using the tRNAscan-SE software54 (version 1.4). Given the high conservation of rRNA, the rRNA sequences of closely related species were chosen as reference sequences. Moreover, the identification of rRNA within the genome was made easier through BLAST alignment (v 2.2.26, parameters: -e 1e-10 -v 10000 -b 10000). Furthermore, employing the covariance models from the Rfam family, the INFERNAL software, which is integrated within the Rfam55 (v 14.1) suite, can be utilized to predict microRNA (miRNA) and small nuclear RNA (snRNA) sequence information on the X. davidi genome. Ultimately, four types of non-coding RNAs were identified from the X. davidi genome, including 2,009 miRNAs, 11,325 tRNAs, 13,213 rRNAs, and 2,412 snRNAs (Table 12).

Table 12 Statistical analysis of the functional gene annotations of the X. davidi genome.

Data Records

The raw sequence data of RNAseq data, HiC data, PacBio data and Illumina short reads data reported in this paper have been deposited in the Genome Sequence Archive in National Genomics Data Center, China National Center for Bioinformation / Beijing Institute of Genomics, Chinese Academy of Sciences (GSA: CRA02081456, CRA02081757, CRA02081858, CRA02081959, CRA02082060, CRA02082161, CRA02082262, CRA02082363). The whole genome sequence data reported in this paper have been deposited in the GenBank (JBLRZY000000000)64 and figshare database65. The genome annotation files will already be uploaded and shared publicly in the figshare database66.

Technical Validation

Evaluation of the genome assembly and annotation

The quality assessment of the genome assembly and annotation was conducted with meticulous attention to detail. The completeness of the assembled genome was assessed by utilizing BUSCO67 (version 5.4.3) with the actinopterygii_odb10 database, yielding a 98.3% complete BUSCO score within the assembled genomes (Table 11), which is a testament to the high degree of completeness of our genomic assemblies. The genomic consistency was further appraised by aligning Illumina short-reads to the assembled genomes with BWA68 (version 0.7.17), resulting in exceptionally high mapping rates (99.36%) and coverage (99.96%) against the assembled genomes (Table 13). Employing Merqury69 (version 1.4.1), the consensus quality value (QV), indicative of per-base consensus accuracy, was calculated to be 50.458 for the assembled genomes. Additionally, a comparative analysis of the length distributions of genes, coding sequences (CDSs), introns, and exons across the genomes of C. idellus, O. bidens, O. macrolepis, M. amblycephala, C. auratus, C. carpio, and D. rerio was performed, revealing similarities (Fig. 3), which substantiates the reliability of our genome annotation. Collectively, the outcomes from these four methodologies demonstrate the high accuracy and completeness of the final genome assembly.

Table 13 Completeness and accuracy evaluation of the genome.