Abstract
Spinibarbus caldwelli is an important freshwater economic fish in China. Owing to uncontrolled fishing, wild resources of S. caldwelli have decreased rapidly and may be on the verge of extinction. In this study, utilizing single-molecule real-time (SMRT) sequencing technology and chromatin interaction mapping (Hi-C) technologies, we assembled the first chromosome-scale genome for S. caldwelli about 1.77 Gb in size, with a contig N50 length of 11.83 Mb and scaffold N50 length of 33.91 Mb. In total 1.72 Gb (97.01%) of the contig sequences were anchored onto fifty chromosomes with the longest scaffold being 56.20 Mb. Furthermore, proximately 49.41% of the genome was composed of repetitive elements. In total, 49,377 protein-coding genes were predicted, of which 47,724 (96.65%) genes have been functionally annotated. The high-quality chromosome-level reference genome and annotation are vital for supporting basic genetic studies and will be contribute to genetic structure, functional elucidation, evolutionary inquiry, and germplasm conservation for S. caldwelli.
Similar content being viewed by others
Background & Summary
The subject of this study, S. caldwelli, belonging to the family Cyprinidae, inhabits various river basins across China, including the Yangtze, Qiantang, Min, Pear and Jiulong rivers1. Initially described from specimens collected in Fujian Province2, this species has been the subject of debates regarding its taxonomic classification. A recent investigation employing both morphological and molecular techniques affirmed the status of S. caldwelli as a distinct species3, rather than as a junior synonym of Spinibarbus holland as previously thought4,5. S. caldwelli (Fig. 1) grows faster, has a mixed diet, few diseases, delicate flesh with a delicious flavor, making it an important freshwater economic fish. However, due to uncontrolled fishing and river pollution, the wild resources of this fish in China are now significantly reduced compared to the early 1970s6. To safeguard the dwindling S. caldwelli fishery, the Chinese government has been carrying out stock enhancement activities since 2000, following advancements in artificial culture and breeding technologies7. However, the impact of these release programs has raised concerns, as it is recognized that there may be genetic risks associated with these release initiatives. Accurate assessment of the genetic risk of released individuals on wild populations requires high-quality genomic data8. Several high-quality chromosome-level assemblies of Cyprinidae reference genomes have been assembled, including those of Cyprinus carpio9, Danio rerio10, and Ctenopharyngodon idellus11. However, there is still a gap in reporting on the genome of the genus Spinibarbus. Therefore, assembling a chromosome-level genome for S. caldwelli is significant.
In the realm of genomic exploration, the advent of third-generation sequencing technology has ushered in a new era of precision and integrity. Through the ingenious fusion of single-molecule real-time (SMRT) sequencing technology and the chromatin conformation capture (Hi-C) technology, scientists have unlocked the chromosome-level genomes with high fidelity12,13,14. In this study, we employed PacBio long-read circular consensus sequence (CCS) data and the Hi-C technology to obtain a high-quality, chromosome-level assembly of the S. caldwelli genome. The availability of reference genomes for S. caldwelli will offer the opportunity to comprehend genome structure and function, thereby laying a solid foundation for further management and conservation efforts of this important species.
Methods
Ethics statement
All experiments were authorised by Xiamen University College of Ocean and Earth Sciences and the University’s Animal Welfare Ethical Review Body, under ethics approval permit XMULAC20220222.
Samples collection, library construction and sequencing
Cultured individuals of S. caldwelli from the farm in Guangzhou, Guangdong Province, China, were selected for genome sequencing. High-molecular weight DNA was isolated from fresh muscle tissue of S. caldwelli using a standard SDS extraction method. For PacBio sequencing, high-quality DNA was used to construct SMRTbell libraries according to PacBio’s standard protocol (Pacific Biosciences, CA, USA) using 20 kb preparation solutions. The SMRTbell library construction involved DNA shearing, end repair, and the ligation of DNA fragments with hairpin adapters to create circular templates for SMRT sequencing. The constructed 20-kb libraries were then sequenced using the PacBio Sequel II platform (Pacific Biosciences, Menlo Park, CA, USA). In total, 35.80 Gb of sequence data were generated with an N50 read length of 17,459 bp (Table 1).
For Hi-C library preparation, chromatin was fixed in place with formaldehyde in the nucleus. Fixed chromatin was digested using DpnII restriction endonuclease, and then the 5′ overhangs were repaired with biotinylated nucleotides, and free blunt ends were ligated. Following ligation, cross-links were reversed, and the DNA was purified from proteins. Subsequently, the purified DNA underwent treatment to remove any biotin that was not internal to the ligated fragments. After that, the DNA was sheared to approximately a 350 bp insert size, and a paired-end sequencing library was constructed following standard Hi-C library preparation protocols15. The library was sequenced on the DNBSEQ-T7 platform to capture spatial interactions between chromosomal regions. As a result, 184.79 Gb of Hi-C read data was obtained and used for genome assembly, with an average sequencing coverage of 104.22 × (Table 1). The quality assessment of Hi-C sequencing was conducted using HiCUP16. The results indicated an effect rate of 29.33% (Unique di-Tags/Total Paired (mapped) = 2,288,078/7,801,445), with 88.9% of read pairs deemed valid (Valid pairs/Total Reads Processed = 2,563,406/ 2,882.894) (Table 2).
RNA was collected from seven tissues of the S. caldwelli, including brain, liver, muscle, spleen, heart, gill, and kidney. Total RNA was extracted using TRIzol extraction reagent (Invitrogen, USA) according to the manufacturer’s protocol. Library construction and transcriptome sequencing were performed on the DNBSEQ-T7 platform in accordance with the manufacturer’s protocols. A total of 76.30 Gb data (about 10.9 Gb for each tissue) were generated for transcript and genome annotation.
De novo assembly
PacBio single-molecule long reads from SMRT sequencing underwent data quality control using the ccs software (https://github.com/PacificBiosciences/ccs) with a parameter min-rq = 0.99. The resulting HiFi reads, after quality control, were utilized for genome assembly using hifiasm v0.16.1-r37517. The assembled contig genome was then combined with sequenced Hi-C data for ALL-HiC chromosome clustering18, orientation, and sorting, with parameters set as enz = DpnII and CLUSTER = n, to achieve near-chromosome-level resolution. Subsequently, Juicebox software (version 1.11.0819) was employed for manual correction based on chromosome interaction strength, ultimately resulting in a chromosome-level genome. The final assembly was obtained with a total of 330 scaffold (Table 3). The first chromosome-level genome assembly of S. caldwelli is about 1.77 Gb with scaffold and contig N50 sizes of 33.91 Mb and 11.83 Mb, respectively (Table 2). 1.72 Gb (97.01%) of the contig (total 682 contigs) sequences were anchored onto fifty chromosomes (Table 4). Moreover, the result of Hi-C was evaluated based pseudo-chromosomes construction. The 50 scaffolds are clearly distinguishable in the heatmap, the interaction signal around the diagonal is evident (Fig. 2), revealing the high-quality of the pseudochromosomes assembly.
Repetitive sequence annotation
A combined strategy based on homology alignment and de novo search to scan the whole genome repeats were used in our repeat annotation pipeline (Fig. 3). The homolog prediction commonly applied Repbase database20 employing RepeatMasker (open-4.1.4)21 and its in-house scripts (RepeatProteinMask) with default parameters to extracted repeat regions. And ab initio prediction built de novo repetitive elements database by RepeatModeler version open-2.0.422 with default parameters, then all repeat sequences with lengths >100 bp and gap ‘N’ less than 5% constituted the raw transposable element (TE) library. A custom library (a combination of Repbase and de novo TE library which was processed by UCLUST23 to yield a non-redundant library) was supplied to RepeatMasker for DNA-level repeat identification. According to these analyses, about 876.02 Mb repeat sequences were finally revealed, which accounted for 49.41% of the S. caldwelli genome (Table 5).
Annotation of gene structure
The structural annotation of the genome, which incorporates ab initio prediction, homology-based prediction, and RNA-Seq-assisted prediction, was utilized to annotate gene models. Sequences of homologous proteins were downloaded from Ensembl/NCBI/others. Protein sequences were aligned to the genome assembly using TblastN v2.2.2624 (E-value ≤ 1e−5), and then the matching proteins were aligned to the homologous genome sequences from C. carpio, D. rerio and C. idellus for accurate spliced alignments with GeneWise v2.4.125 software which was used to predict gene structure contained in each protein region. For gene predication based on ab initio methods, Augustus v3.526 and SNAP v2013.11.2927 were used in our automated gene prediction pipeline. Transcriptome read assemblies were generated with Trinity v2.8.528 for the genome annotation. To optimize the genome annotation, the RNA-Seq reads from different tissues were aligned to genome fasta using Hisat v2.2.129 with default parameters to identify exons region and splice positions. The alignment results were then used as input for Stringtie v2.2.130 with default parameters for genome-based transcript assembly. The non-redundant reference gene set was generated by merging genes predicted by three methods with EvidenceModeler (EVM v1.1.131) using PASA v2.3.3631 (Program to Assemble Spliced Alignment) terminal exon support, including masked transposable elements as input into gene prediction. In order to obtain information on UTRs and alternative splicing variation information, we used PASA to update the gene models31. Finally, we successfully generated reference gene structures within S. caldwelli genome, which is composed of 49,377 protein-coding genes with an average gene length and an average CDS length of 15,627.88 bp and 1,574.95 bp, respectively, for each gene (Table 6). The average number of exons is 9.18, with an average exon length of 171.58 bp and an average intron length of 1,718.14 bp for each gene. The statistics of gene models, including CDS, intron, and exon in S. caldwelli were comparable to those of other species (Table 6 and Fig. 4).
The composition of gene elements in the Spinibarbus caldwelli genome with other species. (a) CDS length distribution and comparison with other species. (b) Exon length distribution and comparison with other species. (c) Exon number distribution and comparison with other species. (d) Gene length distribution and comparison with other species. (e) Intron length distribution and comparison with other species.
We also predicted gene structures of tRNAs, rRNAs and other non-coding RNAs (Table 7). A total of 10,804 tRNAs were predicted using t-RNAscan-SE v1.432. For rRNAs, which are highly conserved, we chose relative species’ rRNA sequence as references and then predicted 3,858 rRNA genes using BLAST33 with default parameters. Other ncRNAs, including miRNAs, snRNAs were identified by searching against the Rfam34 database with default parameters using the Infernal v1.1.235 software.
Functional annotations
Gene functions within S. caldwelli genome were assigned by comparing with public databases including InterPro36, Swiss-Prot37, the NCBI non-reduntant protein database (NR), and Kyoto Encyclopedia of Gene and Genomes (KEGG) pathway38. The motifs and domains were annotated using InterProScan v5.3939 by searching against InterPro database. Gene Ontology (GO40) IDs for each gene were assigned according to the corresponding InterPro entry. Swiss-Prot database and KEGG pathways were mainly used to map the constructed gene set for identifying best gene matches. As a result, 47,724 genes were functionally annotated, accounting for 96.7% of all predicted genes (Table 8 and Fig. 5).
Data Records
The raw sequence data of S. caldwelli, including the PacBio long-read data, Hi-C data and RNA-seq data, have been deposited in the Genome Sequence Archive (GSA41) at the National Genomics Data Center42 under the accession CRA01577743. Additionally, the raw data has also been deposited in SRA at NCBI with the accession number SRP50063544. The assembled genome sequences of S. caldwelli have been deposited in the NCBI GenBank with the accession number GCA_039654775.145. The genome annotation has been deposited at Figshare46.
Technical Validation
DNA sample quality
DNA quality was assessed using 1% agarose gel.
RNA sample quality
The quality of the purified RNA molecules was checked by Nanodrop ND-1000 spectrophotometer (LabTech, USA).
Evaluation the quality of the genome assembly
The quality of assembled genome was evaluated by Benchmarking Universal Single-Copy Orthologs (BUSCO) v5.7.047 based on a benchmark of 3354 conserved Actinopterygii genes to assess the predicted gene set (Table 9). The genome mode result showed that 99.3% of all BUSCOs were assembled, including 99.0% and 0.3% of all BUSCOs were completely and partially assembled, also suggesting a high level of completeness for the de novo assembly. In addition, the results generated with protein mode based on all predicted genes showed that 98.5% of all BUSCOs were assembled, including 1.6% of all BUSCOs that were partially predicted. These data largely support a high-quality genome assembly of S. caldwelli, which can be used for further investigation.
Code availability
No custom code was used in this study. All commands and pipelines used in data processing were performed according to those manuals and protocols of the applied bioinformatics software. The versions of the software used, along with their corresponding parameters, have been thoroughly described in the Methods section.
References
Ai, W., Peng, X., Huang, X., Xiang, D. & Chen, X. Complete mitochodrial genome of Spinibarbus caldwelli (Cypriniformes, Cyprinidae). Mitochondrial DNA 26, 131–132, https://doi.org/10.3109/19401736.2013.815171 (2015).
JT, N. Some Chinese freshwater fishes. 11. Certain apparently undescribed carps from Fukien. Am Mus Novit 185, 1–7 (1925).
Tang, Q., Liu, H., Yang, X. & Nakajima, T. Molecular and morphological data suggest that Spinibarbus caldwelli (Nichols)(Teleostei: Cyprinidae) is a valid species. Ichthyological Research 52, 77–82, https://doi.org/10.1007/s10228-004-0259-x (2005).
Oshima, M. Contributions to the study of the fresh water fishes of the island of Formosa. Ann Carnegie Mus 12, 169–328 (1919).
Yang, J. & Chen, Y. Systematic revision of Spinibarbus fishes (Cypriniformes: Cyprinidae). Zoological Research 15, 1–10 (1994).
Yuan, X., Yang, X., Ge, H. & Li, H. Genetic Structure of Spinibarbus caldwelli Based on mtDNA D-Loop. Agricultural Sciences 10, 173, https://doi.org/10.4236/as.2019.102015 (2019).
Guo, S. et al. Investigation on fish resources of Spinibarbu scaldwelli National Aquatic Germplasm Resources Reserve in Huyangxi River, Yongchun County, Fujian Province in winter. Journal of Fisheries Research 46, 279, https://doi.org/10.14012/j.jfr.2023120 (2024).
Breed, M. F. et al. The potential of genomics for restoring ecosystems and biodiversity. Nature Reviews Genetics 20, 615–628, https://doi.org/10.1038/s41576-019-0152-0 (2019).
Xu, P. et al. The allotetraploid origin and asymmetrical genome evolution of the common carp Cyprinus carpio. Nature communications 10, 4625, https://doi.org/10.1038/s41467-019-12644-1 (2019).
Broughton, R. E., Milam, J. E. & Roe, B. A. The complete sequence of the zebrafish (Danio rerio) mitochondrial genome and evolutionary patterns in vertebrate mitochondrial DNA. Genome research 11, 1958–1967, https://doi.org/10.1101/gr.156801 (2001).
Wu, C.-S. et al. Chromosome-level genome assembly of grass carp (Ctenopharyngodon idella) provides insights into its genome evolution. BMC genomics 23, 271, https://doi.org/10.1186/s12864-022-08503-x (2022).
Zhang, W. et al. Chromosome-level genome assembly and annotation of the yellow grouper, Epinephelus awoara. Scientific Data 11, 151, https://doi.org/10.1038/s41597-024-02989-8 (2024).
Wang, Y., Zhang, H., Xian, W. & Iwasaki, W. Chromosome genome assembly and annotation of the spiny red gurnard (Chelidonichthys spinosus). Scientific Data 10, 443, https://doi.org/10.1038/s41597-023-02357-y (2023).
Wang, F. et al. Chromosome-level assembly of Gymnocypris eckloni genome. Scientific Data 9, 464, https://doi.org/10.1038/s41597-022-01595-w (2022).
Gong, G. et al. Chromosomal-level assembly of yellow catfish genome using third-generation DNA sequencing and Hi-C analysis. GigaScience 7, giy120, https://doi.org/10.1093/gigascience/giy120 (2018).
Wingett, S. et al. HiCUP: pipeline for mapping and processing Hi-C data. F1000Research 4, https://doi.org/10.12688/f1000research.7334.1 (2015).
Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nature methods 18, 170–175, https://doi.org/10.1038/s41592-020-01056-5 (2021).
Zhang, X., Zhang, S., Zhao, Q., Ming, R. & Tang, H. Assembly of allele-aware, chromosomal-scale autopolyploid genomes based on Hi-C data. Nature plants 5, 833–845, https://doi.org/10.1038/s41477-019-0487-8 (2019).
Durand, N. C. et al. Juicebox provides a visualization system for Hi-C contact maps with unlimited zoom. Cell systems 3, 99–101, https://doi.org/10.1016/j.cels.2015.07.012 (2016).
Bao, W., Kojima, K. K. & Kohany, O. Repbase Update, a database of repetitive elements in eukaryotic genomes. Mobile Dna 6, 1–6 (2015).
Bedell, J. A., Korf, I. & Gish, W. MaskerAid: a performance enhancement to RepeatMasker. Bioinformatics 16, 1040–1041, https://doi.org/10.1093/bioinformatics/16.11.1040 (2000).
Flynn, J. M. et al. RepeatModeler2 for automated genomic discovery of transposable element families. Proceedings of the National Academy of Sciences 117, 9451–9457, https://doi.org/10.1073/pnas.1921046117 (2020).
Edgar, R. C. Search and clustering orders of magnitude faster than BLAST. Bioinformatics 26, 2460–2461, https://doi.org/10.1093/bioinformatics/btq461 (2010).
Altschul, S. F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic acids research 25, 3389–3402, https://doi.org/10.1093/nar/25.17.3389 (1997).
Birney, E., Clamp, M. & Durbin, R. GeneWise and genomewise. Genome research 14, 988–995, http://www.genome.org/cgi/doi/10.1101/gr.1865504 (2004).
Stanke, M. & Morgenstern, B. AUGUSTUS: a web server for gene prediction in eukaryotes that allows user-defined constraints. Nucleic acids research 33, W465–W467, https://doi.org/10.1093/nar/gki458 (2005).
Korf, I. Gene finding in novel genomes. BMC bioinformatics 5, 1–9, https://doi.org/10.1186/1471-2105-5-59 (2004).
Haas, B. J. et al. De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis. Nature protocols 8, 1494–1512, https://doi.org/10.1038/nprot.2013.084 (2013).
Kim, D., Langmead, B. & Salzberg, S. L. HISAT: a fast spliced aligner with low memory requirements. Nature methods 12, 357–360, https://doi.org/10.1038/nmeth.3317 (2015).
Kovaka, S. et al. Transcriptome assembly from long-read RNA-seq alignments with StringTie2. Genome biology 20, 1–13, https://doi.org/10.1186/s13059-019-1910-1 (2019).
Haas, B. J. et al. Automated eukaryotic gene structure annotation using EVidenceModeler and the Program to Assemble Spliced Alignments. Genome biology 9, 1–22, https://doi.org/10.1186/gb-2008-9-1-r7 (2008).
Lowe, T. M. & Eddy, S. R. tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic acids research 25, 955–964, https://doi.org/10.1093/nar/25.5.955 (1997).
Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. Journal of molecular biology 215, 403–410, https://doi.org/10.1016/S0022-2836(05)80360-2 (1990).
Griffiths-Jones, S. et al. Rfam: annotating non-coding RNAs in complete genomes. Nucleic acids research 33, D121–D124, https://doi.org/10.1093/nar/gki081 (2005).
Nawrocki, E. P. & Eddy, S. R. Infernal 1.1: 100-fold faster RNA homology searches. Bioinformatics 29, 2933–2935, https://doi.org/10.1093/bioinformatics/btt509 (2013).
Finn, R. D. et al. InterPro in 2017—beyond protein family and domain annotations. Nucleic acids research 45, D190–D199, https://doi.org/10.1093/nar/gkw1107 (2017).
Bairoch, A. & Apweiler, R. The SWISS-PROT protein sequence data bank and its supplement TrEMBL in 1999. Nucleic acids research 27, 49–54, https://doi.org/10.1093/nar/27.1.49 (1999).
Kanehisa, M. & Goto, S. KEGG: kyoto encyclopedia of genes and genomes. Nucleic acids research 28, 27–30, https://doi.org/10.1093/nar/28.1.27 (2000).
Jones, P. et al. InterProScan 5: genome-scale protein function classification. Bioinformatics 30, 1236–1240, https://doi.org/10.1093/bioinformatics/btu031 (2014).
Ashburner, M. et al. Gene ontology: tool for the unification of biology. Nature genetics 25, 25–29, https://doi.org/10.1038/75556 (2000).
Chen, T. et al. The genome sequence archive family: toward explosive data growth and diverse data types. Genomics, Proteomics and Bioinformatics 19, 578–583, https://doi.org/10.1016/j.gpb.2021.08.001 (2021).
Database resources of the national genomics data center, China national center for bioinformation in 2023. Nucleic acids research 51, D18-D28, https://doi.org/10.1093/nar/gkac1073 (2023).
NGDC Genome Sequence Archive https://bigd.big.ac.cn/gsa/browse/CRA015777 (2024).
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRP500635 (2024).
NCBI GenBank, https://identifiers.org/ncbi/insdc.gca:GCA_039654775.1 (2024).
Ding, S. & Wu, L. pasa2.longest.filter.gff3. figshare https://doi.org/10.6084/m9.figshare.25824793 (2024).
Manni, M., Berkeley, M. R., Seppey, M., Simão, F. A. & Zdobnov, E. M. BUSCO update: novel and streamlined workflows along with broader and deeper phylogenetic coverage for scoring of eukaryotic, prokaryotic, and viral genomes. Molecular biology and evolution 38, 4647–4654, https://doi.org/10.1093/molbev/msab199 (2021).
Acknowledgements
We would like to thank the referees and editor for their valuable comments and suggestions, as well as the careful corrections of our manuscript. We thank Qingmin Zeng for providing fish samples. We also thank Xiaoying Cao and Weiwei Zhang for their helpful suggestions about this study. This study was supported by the Open Innovation Fund for undergraduate students of Xiamen University (KFJJ-202214).
Author information
Authors and Affiliations
Contributions
L.W. and S.D. conceived and designed the study. L.W. and S.Gu. conducted the genome assembly and bioinformatics analysis. L.W. drafted the manuscript. P.W. and S.Guo. provided samples. L.W. and L.L. provided suggestions for manuscript improvement. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Wu, L., Gu, S., Wen, P. et al. Chromosome-level genome assembly and annotation of the Spinibarbus caldwelli. Sci Data 11, 933 (2024). https://doi.org/10.1038/s41597-024-03796-x
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41597-024-03796-x







