Background & Summary

The subject of this study, S. caldwelli, belonging to the family Cyprinidae, inhabits various river basins across China, including the Yangtze, Qiantang, Min, Pear and Jiulong rivers1. Initially described from specimens collected in Fujian Province2, this species has been the subject of debates regarding its taxonomic classification. A recent investigation employing both morphological and molecular techniques affirmed the status of S. caldwelli as a distinct species3, rather than as a junior synonym of Spinibarbus holland as previously thought4,5. S. caldwelli (Fig. 1) grows faster, has a mixed diet, few diseases, delicate flesh with a delicious flavor, making it an important freshwater economic fish. However, due to uncontrolled fishing and river pollution, the wild resources of this fish in China are now significantly reduced compared to the early 1970s6. To safeguard the dwindling S. caldwelli fishery, the Chinese government has been carrying out stock enhancement activities since 2000, following advancements in artificial culture and breeding technologies7. However, the impact of these release programs has raised concerns, as it is recognized that there may be genetic risks associated with these release initiatives. Accurate assessment of the genetic risk of released individuals on wild populations requires high-quality genomic data8. Several high-quality chromosome-level assemblies of Cyprinidae reference genomes have been assembled, including those of Cyprinus carpio9, Danio rerio10, and Ctenopharyngodon idellus11. However, there is still a gap in reporting on the genome of the genus Spinibarbus. Therefore, assembling a chromosome-level genome for S. caldwelli is significant.

Fig. 1
figure 1

The map of Spinibarbus caldwelli.

In the realm of genomic exploration, the advent of third-generation sequencing technology has ushered in a new era of precision and integrity. Through the ingenious fusion of single-molecule real-time (SMRT) sequencing technology and the chromatin conformation capture (Hi-C) technology, scientists have unlocked the chromosome-level genomes with high fidelity12,13,14. In this study, we employed PacBio long-read circular consensus sequence (CCS) data and the Hi-C technology to obtain a high-quality, chromosome-level assembly of the S. caldwelli genome. The availability of reference genomes for S. caldwelli will offer the opportunity to comprehend genome structure and function, thereby laying a solid foundation for further management and conservation efforts of this important species.

Methods

Ethics statement

All experiments were authorised by Xiamen University College of Ocean and Earth Sciences and the University’s Animal Welfare Ethical Review Body, under ethics approval permit XMULAC20220222.

Samples collection, library construction and sequencing

Cultured individuals of S. caldwelli from the farm in Guangzhou, Guangdong Province, China, were selected for genome sequencing. High-molecular weight DNA was isolated from fresh muscle tissue of S. caldwelli using a standard SDS extraction method. For PacBio sequencing, high-quality DNA was used to construct SMRTbell libraries according to PacBio’s standard protocol (Pacific Biosciences, CA, USA) using 20 kb preparation solutions. The SMRTbell library construction involved DNA shearing, end repair, and the ligation of DNA fragments with hairpin adapters to create circular templates for SMRT sequencing. The constructed 20-kb libraries were then sequenced using the PacBio Sequel II platform (Pacific Biosciences, Menlo Park, CA, USA). In total, 35.80 Gb of sequence data were generated with an N50 read length of 17,459 bp (Table 1).

Table 1 Statistics of different types of sequencing reads.

For Hi-C library preparation, chromatin was fixed in place with formaldehyde in the nucleus. Fixed chromatin was digested using DpnII restriction endonuclease, and then the 5′ overhangs were repaired with biotinylated nucleotides, and free blunt ends were ligated. Following ligation, cross-links were reversed, and the DNA was purified from proteins. Subsequently, the purified DNA underwent treatment to remove any biotin that was not internal to the ligated fragments. After that, the DNA was sheared to approximately a 350 bp insert size, and a paired-end sequencing library was constructed following standard Hi-C library preparation protocols15. The library was sequenced on the DNBSEQ-T7 platform to capture spatial interactions between chromosomal regions. As a result, 184.79 Gb of Hi-C read data was obtained and used for genome assembly, with an average sequencing coverage of 104.22 × (Table 1). The quality assessment of Hi-C sequencing was conducted using HiCUP16. The results indicated an effect rate of 29.33% (Unique di-Tags/Total Paired (mapped) = 2,288,078/7,801,445), with 88.9% of read pairs deemed valid (Valid pairs/Total Reads Processed = 2,563,406/ 2,882.894) (Table 2).

Table 2 Statistics of Hi-C sequencing.

RNA was collected from seven tissues of the S. caldwelli, including brain, liver, muscle, spleen, heart, gill, and kidney. Total RNA was extracted using TRIzol extraction reagent (Invitrogen, USA) according to the manufacturer’s protocol. Library construction and transcriptome sequencing were performed on the DNBSEQ-T7 platform in accordance with the manufacturer’s protocols. A total of 76.30 Gb data (about 10.9 Gb for each tissue) were generated for transcript and genome annotation.

De novo assembly

PacBio single-molecule long reads from SMRT sequencing underwent data quality control using the ccs software (https://github.com/PacificBiosciences/ccs) with a parameter min-rq = 0.99. The resulting HiFi reads, after quality control, were utilized for genome assembly using hifiasm v0.16.1-r37517. The assembled contig genome was then combined with sequenced Hi-C data for ALL-HiC chromosome clustering18, orientation, and sorting, with parameters set as enz = DpnII and CLUSTER = n, to achieve near-chromosome-level resolution. Subsequently, Juicebox software (version 1.11.0819) was employed for manual correction based on chromosome interaction strength, ultimately resulting in a chromosome-level genome. The final assembly was obtained with a total of 330 scaffold (Table 3). The first chromosome-level genome assembly of S. caldwelli is about 1.77 Gb with scaffold and contig N50 sizes of 33.91 Mb and 11.83 Mb, respectively (Table 2). 1.72 Gb (97.01%) of the contig (total 682 contigs) sequences were anchored onto fifty chromosomes (Table 4). Moreover, the result of Hi-C was evaluated based pseudo-chromosomes construction. The 50 scaffolds are clearly distinguishable in the heatmap, the interaction signal around the diagonal is evident (Fig. 2), revealing the high-quality of the pseudochromosomes assembly.

Table 3 Assembly statistics of Spinibarbus caldwelli.
Table 4 Summary of assembled 50 chromosomes of Spinibarbus caldwelli.
Fig. 2
figure 2

Genome-wide Hi-C heatmap of Spinibarbus caldwelli.

Repetitive sequence annotation

A combined strategy based on homology alignment and de novo search to scan the whole genome repeats were used in our repeat annotation pipeline (Fig. 3). The homolog prediction commonly applied Repbase database20 employing RepeatMasker (open-4.1.4)21 and its in-house scripts (RepeatProteinMask) with default parameters to extracted repeat regions. And ab initio prediction built de novo repetitive elements database by RepeatModeler version open-2.0.422 with default parameters, then all repeat sequences with lengths >100 bp and gap ‘N’ less than 5% constituted the raw transposable element (TE) library. A custom library (a combination of Repbase and de novo TE library which was processed by UCLUST23 to yield a non-redundant library) was supplied to RepeatMasker for DNA-level repeat identification. According to these analyses, about 876.02 Mb repeat sequences were finally revealed, which accounted for 49.41% of the S. caldwelli genome (Table 5).

Fig. 3
figure 3

Diagrammatic sketch of the annotation pipeline.

Table 5 Classification of repeat elements in Spinibarbus caldwelli.

Annotation of gene structure

The structural annotation of the genome, which incorporates ab initio prediction, homology-based prediction, and RNA-Seq-assisted prediction, was utilized to annotate gene models. Sequences of homologous proteins were downloaded from Ensembl/NCBI/others. Protein sequences were aligned to the genome assembly using TblastN v2.2.2624 (E-value ≤ 1e−5), and then the matching proteins were aligned to the homologous genome sequences from C. carpio, D. rerio and C. idellus for accurate spliced alignments with GeneWise v2.4.125 software which was used to predict gene structure contained in each protein region. For gene predication based on ab initio methods, Augustus v3.526 and SNAP v2013.11.2927 were used in our automated gene prediction pipeline. Transcriptome read assemblies were generated with Trinity v2.8.528 for the genome annotation. To optimize the genome annotation, the RNA-Seq reads from different tissues were aligned to genome fasta using Hisat v2.2.129 with default parameters to identify exons region and splice positions. The alignment results were then used as input for Stringtie v2.2.130 with default parameters for genome-based transcript assembly. The non-redundant reference gene set was generated by merging genes predicted by three methods with EvidenceModeler (EVM v1.1.131) using PASA v2.3.3631 (Program to Assemble Spliced Alignment) terminal exon support, including masked transposable elements as input into gene prediction. In order to obtain information on UTRs and alternative splicing variation information, we used PASA to update the gene models31. Finally, we successfully generated reference gene structures within S. caldwelli genome, which is composed of 49,377 protein-coding genes with an average gene length and an average CDS length of 15,627.88 bp and 1,574.95 bp, respectively, for each gene (Table 6). The average number of exons is 9.18, with an average exon length of 171.58 bp and an average intron length of 1,718.14 bp for each gene. The statistics of gene models, including CDS, intron, and exon in S. caldwelli were comparable to those of other species (Table 6 and Fig. 4).

Table 6 The statistics of gene models of protein-coding genes annotated in Spinibarbus caldwelli.
Fig. 4
figure 4

The composition of gene elements in the Spinibarbus caldwelli genome with other species. (a) CDS length distribution and comparison with other species. (b) Exon length distribution and comparison with other species. (c) Exon number distribution and comparison with other species. (d) Gene length distribution and comparison with other species. (e) Intron length distribution and comparison with other species.

We also predicted gene structures of tRNAs, rRNAs and other non-coding RNAs (Table 7). A total of 10,804 tRNAs were predicted using t-RNAscan-SE v1.432. For rRNAs, which are highly conserved, we chose relative species’ rRNA sequence as references and then predicted 3,858 rRNA genes using BLAST33 with default parameters. Other ncRNAs, including miRNAs, snRNAs were identified by searching against the Rfam34 database with default parameters using the Infernal v1.1.235 software.

Table 7 Classification of ncRNAs in Spinibarbus caldwelli genome.

Functional annotations

Gene functions within S. caldwelli genome were assigned by comparing with public databases including InterPro36, Swiss-Prot37, the NCBI non-reduntant protein database (NR), and Kyoto Encyclopedia of Gene and Genomes (KEGG) pathway38. The motifs and domains were annotated using InterProScan v5.3939 by searching against InterPro database. Gene Ontology (GO40) IDs for each gene were assigned according to the corresponding InterPro entry. Swiss-Prot database and KEGG pathways were mainly used to map the constructed gene set for identifying best gene matches. As a result, 47,724 genes were functionally annotated, accounting for 96.7% of all predicted genes (Table 8 and Fig. 5).

Table 8 The number of genes with homology or functional classification for Spinibarbus caldwelli.
Fig. 5
figure 5

Venn diagram of the number of genes with functional annotation using multiple public databases.

Data Records

The raw sequence data of S. caldwelli, including the PacBio long-read data, Hi-C data and RNA-seq data, have been deposited in the Genome Sequence Archive (GSA41) at the National Genomics Data Center42 under the accession CRA01577743. Additionally, the raw data has also been deposited in SRA at NCBI with the accession number SRP50063544. The assembled genome sequences of S. caldwelli have been deposited in the NCBI GenBank with the accession number GCA_039654775.145. The genome annotation has been deposited at Figshare46.

Technical Validation

DNA sample quality

DNA quality was assessed using 1% agarose gel.

RNA sample quality

The quality of the purified RNA molecules was checked by Nanodrop ND-1000 spectrophotometer (LabTech, USA).

Evaluation the quality of the genome assembly

The quality of assembled genome was evaluated by Benchmarking Universal Single-Copy Orthologs (BUSCO) v5.7.047 based on a benchmark of 3354 conserved Actinopterygii genes to assess the predicted gene set (Table 9). The genome mode result showed that 99.3% of all BUSCOs were assembled, including 99.0% and 0.3% of all BUSCOs were completely and partially assembled, also suggesting a high level of completeness for the de novo assembly. In addition, the results generated with protein mode based on all predicted genes showed that 98.5% of all BUSCOs were assembled, including 1.6% of all BUSCOs that were partially predicted. These data largely support a high-quality genome assembly of S. caldwelli, which can be used for further investigation.

Table 9 Genome quality assessment statistics of the S. caldwelli genome.