Background & Summary

Spinibarbus caldwelli, which belongs to the family Cyprinidae, subfamily Barbinae, is native to the middle and lower reaches of the Yangtze River and its associated river system in China, Laos and Vietnam. This species specifically inhabits clear-water, gravel-bottomed river segments1,2, making it an indicator species for assessing environmental health in freshwater ecosystems. As an economic freshwater fish, S. caldwelli is regarded as ideal for pond and large-surface water aquaculture. With the spread of domestication and breeding techniques, S. caldwelli has become a key species in aquaculture in eastern and southern China3. However, since the 1980s, overfishing, environmental pollution, and water conservancy projects have led to a sharp decline in the wild population size of S. caldwelli3.

Yuan et al. investigated the genetic diversity and population structure of S. caldwelli using mitochondrial DNA gene as molecular markers4. Tang et al. found significant genetic differentiation among the wild populations, mainly due to geographical isolation or human activities5. Ai et al. sequenced and assembled the complete mitochondrial genome of S. caldwelli, which exhibits a gene composition, arrangement, and transcriptional direction identical to those of most vertebrates6. Tang et al. studied the vasa homologs in S. caldwelli and provided insights into the molecular mechanism of germ cell development and differentiation7. Previous studies were mainly based on mitochondrial genes or particular gene cloning in S. caldwelli. However, a high-quality chromosome-level genome assembly remains unavailable, limiting the sustainable development and utilization of S. caldwelli.

Here, we employed a combined strategy using Illumina, PacBio, and Hi-C technologies to generate sequencing data for S. caldwelli genome assembling. The assembled genome size was 1.77 Gb with contig and scaffold N50 reaching 24.27 Mb and 35.29 Mb, respectively, demonstrating excellent genome integrity and sequence continuity. The annotated genome contains 49.41% repetitive sequences, 51,505 predicted genes, and 90.83% of these genes were functionally annotated. Synteny analysis showed that S. caldwelli and S. sinensis shares highly genome collinearity, indicating minor difference in chromosome karyotype. The construction of genomic resources for S. caldwelli provides support for elucidating the genetic basis of important traits and facilitates ecological conservation through the establishment of germplasm resource banks, promoting the sustainable development of S. caldwelli.

Methods

Sample collection

A female S. caldwelli (Fig. 1A) was collected from the Taojiang River (25.631087°N, 115.014197°E) in Ganzhou City, Jiangxi Province, China. Genomic DNA was extracted from muscle tissue and used for Illumina short-read, PacBio long-read, and Hi-C sequencing. RNA was extracted from various tissues, including scale, skin, fin, muscle, gill, liver, intestine, gonad, heart, bladder, head kidney, eye, brain, intermuscular bone, spleen, and embryo. The extracted RNA was then pooled into four samples for RNA sequencing. All tissue samples were immediately frozen in liquid nitrogen and stored at −80 °C.

Fig. 1
Fig. 1
Full size image

Illustration and karyotype of S. caldwelli. (A) An illustration of S. caldwelli. Scale bar equals 1 cm. (B) Karyotype of S. caldwelli. Scale bar equals 10 μm.

Karyological analysis

Chromosome preparations from the head kidney were made using established methods8,9. Live fish were treated with colchicine, then incubated in hypotonic KCl before fixation in ethanol-acetic acid. Chromosomes were stained with Giemsa and analyzed with KaryoType software10. Chromosomes were categorized based on Levan et al.11. The karyotype of S. caldwelli shows a chromosome complement of 2n = 100, confirming consistency with reference genome assembly data (Fig. 1B). The karyotype of S. caldwelli contains 9 pairs of metacentric (m), 16 pairs of submetacentric (sm), 15 pairs of sub-telocentric (st), and 10 pairs of telocentric (t) chromosomes. No specific sex chromosomes were observed (Fig. 1B).

Library preparation and genome survey

For genome survey purposes, 1–1.5 μg of genomic DNA from S. caldwelli was randomly fragmented using a Covaris system. The resulting fragments were purified and size-selected to an average of 200–400 bp using the Agencourt AMPure XP-Medium kit (Beckman Coulter, Inc., CA, USA). DNA fragments were processed through end-repair, 3’adenylation, adapter ligation, and PCR amplification, followed by purification using the AxyPrep Mag PCR Clean-up Kit (Axygen, Hangzhou, China). The double-stranded PCR products were heat-denatured and circularized using a splint oligo sequence to generate single-stranded circular DNA, which was formatted as the final library and verified by quality control. The prepared library was sequenced on the Illumina NovaSeq 6000 platform. After acquiring 56.96 Gb of raw sequence data, paired-end raw reads were quality-filtered using fastp (v 0.20.0)12 to remove low-quality reads, adapters, and poly-N sequences. Contamination was checked by aligning 100,000 random reads to the NT database.

PacBio and Hi-C based whole-genome sequencing

SMRTbell target size libraries were constructed for sequencing according to standard protocol of PacBio (Pacific Biosciences, CA, USA) using 15 kb preparation solutions. Sheared gDNA underwent DNA damage repair, blunt-end ligation with hairpin adapters, exonuclease treatment, and size selection. Libraries were purified and validated using Agilent 2100 Bioanalyzer (Agilent Technologies, CA, USA). Sequencing was performed on the PacBio Sequel II platform and raw polymerase reads were processed via SMRT Link v8.0 for adapter trimming and quality filtering.

Hi-C libraries were constructed from genomic DNA of S. caldwelli. Optimized formaldehyde crosslinking preserved chromatin conformation, followed by biphasic lysis and DpnII digestion with NEB buffer and BSA. Biotin-14-dATP facilitated fill-in labeling for capture. DNA was purified, sheared to ~400 bp, and prepared using the NEBNext Ultra II Kit13 (New England Biolabs, MA, USA). Hi-C sequencing on the Illumina NovaSeq 6000 platform produced 176.53 Gb of raw data, comprising 370 million paired-end reads.

Finally, we obtained high-quality sequencing data comprising 63.07 Gb of PacBio HiFi reads (~35.20×), 56.96 Gb of Illumina reads (32.01×), and 176.53 Gb of Hi-C data (42.03×), providing robust genomic coverage for downstream analysis (Table 1).

Table 1 Statistics of the sequencing data generated for Spinibarbus caldwelli.

Genome size estimation and de novo genome assembling

To characterize the genomic features of S. caldwelli, k-mer analysis was performed on the Illumina sequencing data prior to assembly. Quality-filtered reads underwent 17-mer frequency analysis via the KMC14 tool. Genome size was estimated using G = K-num/K-depth, analyzing 17-mer depth distribution from cleaned 350-bp library reads via gce15 and FindGSE16. The estimated genome size is 1,609,818,105.00 bp with a heterozygosity of 0.80%.

After obtaining the PacBio subreads, the genome was de novo assembled into contigs using the overlap-layout-consensus (OLC) algorithm implemented in Falcon17. All PacBio SMRT reads were then aligned to the assembled contigs using BLASR18, and Quiver19 was employed to correct sequencing errors based on the alignments, using default parameters. To further improve base-level accuracy, the contigs were polished with Nextpolish2 (v 0.2.1)20, incorporating Illumina short reads under default settings. Finally, to remove potentially redundant contigs and generate a non-redundant primary assembly, similarity-based filtering was performed with thresholds of identity ≥0.8 and overlap ≥0.8. This process resulted in a preliminary monoploid assembly of 1.77 Gb, with a contig N50 length of 24.27 Mb, representing a high-quality draft genome.

To achieve chromosome-level assembly, Hi-C reads were first aligned to the draft genome of S. caldwelli using Bowtie2 (v 2.3.2)21. Valid interaction pairs were identified and retained by HiC-Pro (v 2.8.1)22 from uniquely mapped paired-end reads, while invalid pairs, such as dangling ends, self-cycles, re-ligations, and dumped products, were filtered out. The processed reads were then analyzed with LACHESIS23 to cluster, order, and orient the contigs, thereby facilitating the construction of a chromosome-level assembly. The diploid chromosome number of S. caldwelli (2n = 100) informed the scaffolding assembly process (Fig. 2A). Detailed chromosomal statistics and their alignment to the assembled genome are provided in Table 2. After Hi-C-assisted scaffolding, the final assembly yielded 50 chromosome-scale scaffolds, corresponding to the 50 haploid chromosomes of S. caldwelli, and a total of 136 scaffolds overall. The final genome size of S. caldwelli was 1.76 Gb, with a scaffold N50 of 35.29 Mb and a contig N50 of 23.94 Mb (Table 3).

Fig. 2
Fig. 2
Full size image

Features of S. caldwelli genome assembly. (A) The genomic landscape of S. caldwelli. The circus plot from the outer to inner refers to chromosome (chr01A-chr25B) (a), gene density (b), guanine-cytosine (GC) content (c), tandem repeats abundance (d), repeat elements abundance (e), and ncRNA abundance (f), respectively. (B) Hi-C interaction heatmap of S. caldwelli genome assembly.

Table 2 Chromosome status of Spinibarbus caldwelli.
Table 3 Summary of Spinibarbus caldwelli genome assembly.

Genome annotation

A de novo specific repeat library for S. caldwelli was constructed using RepeatModeler (v 1.0.11)24 LTR sequences were predicted and deduplicated using LTR_FINDER_parallel25 and LTR_retriever (v 2.9.0)26. The two libraries were merged to create a TE library file (TE.lib). The genome was masked once, replacing repetitive sequences with “N”, and a subsequent de novo search for repetitive sequences was conducted using RepeatModeler to generate a de novo library file (RepMod.lib).

RepeatMasker (v 1.331)27 was used to identify and mask repeat elements in the S. caldwelli genome using a custom library. Analysis revealed that approximately 42.75% of the genome consists of repetitive elements, including LTR elements (6.81%), DNA transposons (23.78%), LINE elements (6.44%), and simple repeats (0.38%). The distribution of these elements, including simple repeats and transposable elements (TEs), was mapped across each chromosome (Fig. 3A).

Fig. 3
Fig. 3
Full size image

The repeat elements distribution and identified protein-coding genes in the S. caldwelli genome. (A) Distribution of divergence rate for transposable elements in the S. caldwelli genome. (B) Venn diagram showing the number of shared and unique genes annotated with different databases in the S. caldwelli genome.

Non-coding RNAs (ncRNAs) and transfer RNAs (tRNAs) in S. caldwelli were identified using Infernal (v 1.1.2)28 for ncRNAs and tRNAscan-SE (v 2.0)29 for tRNAs. The analysis detected a total of 39,459 ncRNAs in the S. caldwelli genome, including 4,186 rRNAs, 5,419 small RNAs, 13,502 regulatory RNAs and 16,352 tRNAs. (Revision Note: The numbers of ncRNAs and tRNAs have been corrected in the revised manuscript, due to an earlier misstatement).

Gene prediction was performed on the repeat-masked genome using three independent approaches: reference-guided transcriptome assembly, ab initio prediction, and homology-based annotation.

For RNA-seq–based prediction, RNA-seq datasets from four libraries were first quality-checked using FastQC30, aligned to the genome using STAR (v 2.7.3a)31, and assembled into transcripts with Stringtie (v 1.3.4 d)32. Open reading frames (ORFs) within the assembled transcripts were predicted using PASA (v 2.3.3)33.

For ab initio prediction, RNA-seq reads were also de novo assembled using StringTie and analyzed with PASA to generate a gene structure training set. This training set was used to guide Augustus (v 3.3.1)34 under default parameters. GeneMark-S35 was used to support unsupervised training of gene models based on transcript evidence.

For homology-based annotation, protein sequences from the NCBI and NGDC database for Carassius auratus (GCF_003368295.1), Danio rerio (GCF_000002035.6), Megalobrama amblycephala (GCF_018812025.1), Sinocyclocheilus anshuiensis (GCF_001515605.1), Sinocyclocheilus grahami (GCF_001515645.1), Sinocyclocheilus rhinocerous (GCF_001515625.1), Ctenopharyngodon idella (GCF_019924925.1) and Spinibarbus sinensis (CRA008955)36. These sequences were then mapped to the S. caldwelli genome using GeMoMa37 (Table 4).

Table 4 Comparison of gene structure details between Spinibarbus caldwelli and other species.

All gene models derived from the three approaches were integrated using EVidenceModeler (EVM) with default parameters. Confidence weights were assigned in the order: PASA > GeMoMa > ab initio. Genes containing transposable elements (TEs) were identified and removed using TransposonPSI38. Additionally, miscoded genes, such as those with frameshift mutations or internal stop codons, were filtered out. Untranslated regions (UTRs) and alternative splicing isoforms were annotated using PASA, and only the longest isoform per locus was retained as the representative transcript.

Functional annotation of protein-coding genes was performed by aligning the predicted sequences against NR39, SwissProt40, KEGG41, KOG42 databases using Blastp (v 2.7.1)43 and InterProScan (v 5.32–71.0)44 for GO45 databases. Gene motifs and functional domains were identified via InterProScan, and GO terms were assigned based on matches to InterPro or UniProt entries. The average gene and CDS length were 18.33 kb and 1.60 kb, respectively, and the average number of exons per gene was 9.76. A total of 51,505 protein-coding genes were predicted, of which 46,782 (90.83%) were successfully functionally annotated through these integrated databases and tools (Table 5 & Fig. 3B).

Table 5 Gene functional annotation statistics for Spinibarbus caldwelli.

Data Records

The raw sequencing data reported in this study have been deposited in the NCBI Sequence Read Archive (SRA) database under the accession numbers, SRR3120416246, SRR3120416347, SRR3107819448, SRR3107819549, SRR3107819650, SRR3107819851 and SRR3107819952, SRR3301485353. The assembled genome of S. caldwelli is available in GenBank under the accession number GCA044721935.154. Two whole-genome sequencing datasets employing long-read and short read (SRR31204162 and SRR31078198) were generated using the PacBio Sequel II and Illumina HiSeq X Ten platforms, respectively. Two Hi-C datasets (SRR31204163 and SRR31078199) were generated on the Illumina NovaSeq 6000 platform. Four RNA-seq datasets (SRR31078194, SRR31078195, SRR31078196, and SRR33014853) were generated on the Illumina HiSeq X Ten platform. The genome annotation results have been deposited in the figshare database (https://doi.org/10.6084/m9.figshare.26494309)55.

Technical Validation Quality

Quality evaluation of the genome assembly and annotation

The completeness of the genome assembly of S. caldwelli was evaluated using BUSCO56 and CEGMA57. BUSCO analysis was performed using the vertebrata_odb10 dataset, revealing that 99.14% (3,325 out of 3,354) of the expected orthologous genes were present as complete, of which 22.15% were single-copy and 76.98% were duplicated. Only 0.27% were fragmented, and 0.60% were missing (Table 6), indicating a highly complete assembly. CEGMA identified 240 of the 248 core eukaryotic genes (96.77%), including 208 complete genes (83.87%), indicating a high level of completeness in conserved genomic elements.

Table 6 Completeness evaluation of Spinibarbus caldwelli genome assembly.

To assess the base-level accuracy of the genome assembly, Illumina paired-end reads were aligned to the assembled genome using BWA58, and alignment statistics were analyzed with SAMtools31 and BCFtools59. The sequencing reads showed a high mapping rate of 99.95%, covering 99.49% of the genome at 1 × coverage.

For chromosome quality evaluation, strong interactive signals were observed along the diagonals of Hi-C heatmaps, with no significant noise detected in other areas (Fig. 2B), supporting the accuracy of the chromosome assembly.

The quality of the gene prediction was also evaluated using BUSCO, which showed that 98.27% (3,296 out of 3,354) of conserved genes were present in the predicted gene set of S. caldwelli, suggesting high gene annotation completeness. These results indicate a comprehensive and functionally informative annotation set for S. caldwelli.

Chromosomal synteny analysis

A genomic synteny analysis was performed between S. caldwelli and S. sinensis to evaluate their structural characteristics and validate the accuracy of our genome assembly. Similar gene pairs and syntenic blocks were identified and visualized using Mummer (v 4)60 and R (v 4.4.1). The analysis revealed a high degree of collinearity between the two assemblies, with most chromosomes in S. caldwelli retaining their structure compared to S. sinensis (Fig. 4). This high level of consistency suggests the high quality of the assembled and annotated genomes.

Fig. 4
Fig. 4
Full size image

Genomic synteny analysis between S. caldwelli and S. sinensis. Sc01-Sc50 represent chromosome 1–50 of S. caldwelli. Ss01-Ss50 represent chromosome 1–50 of S. sinensis.