A chromosome-level genome assembly of critically endangered Ochetobius elongatus

Li, Lekang; Liang, Xuanguang; Zhang, Jiatong; Xu, Qun; Wang, Li; Gao, Xiaoping; Song, Xiangchun; Zhang, Bao; Huang, Dan; Wang, Hong; Wang, Xianyong; Luo, Zhen; Kong, Chiping; Lu, Jianguo

doi:10.1038/s41597-024-04223-x

Download PDF

Data Descriptor
Open access
Published: 19 December 2024

A chromosome-level genome assembly of critically endangered Ochetobius elongatus

Lekang Li¹^na1,
Xuanguang Liang²^na1,
Jiatong Zhang²,
Qun Xu¹,
Li Wang³,
Xiaoping Gao¹,
Xiangchun Song⁴,
Bao Zhang¹,
Dan Huang⁴,
Hong Wang⁴,
Xianyong Wang¹,
Zhen Luo¹,
Chiping Kong¹ &
…
Jianguo Lu ORCID: orcid.org/0000-0002-3966-8812²

Scientific Data volume 11, Article number: 1399 (2024) Cite this article

2903 Accesses
1 Citations
10 Altmetric
Metrics details

Subjects

Abstract

Ochetobius elongatus, a critically endangered species found in the Yangtze River was the subject of our study in which we leveraged PacBio and Hi-C data to assemble a chromosome-scale genome. This assembly comprises 24 pseudo-chromosomes, yielding a genome size of 883.1 Mb with a scaffold N50 length of 35.1 Mb, indicative of a highly contiguous assembly. A BUSCO assessment ascertained the comprehensiveness of the genome at 98.3%. Annotation efforts identified 28,674 putative protein-coding genes, with 44.63% of the assembled genome annotated as repetitive sequences. Collinearity analysis between O. elongatus and two other species from the family Xenocyprididae revealed high collinearity, indicating good assembly quality of O. elongatus The completion of the O. elongatus genome assembly in this study signifies a critical advancement for its conservation, enabling deeper insights into its genetic diversity and facilitating the development of targeted preservation strategies.

Chromosome-level genome assembly of a critically endangered species Leuciscus chuanchicus

Article Open access 15 March 2025

Chromosome-level genome assembly and annotation of Chinese herring (Ilisha elongata)

Article Open access 21 April 2025

Chromosome-level assembly and gene annotation of Kappaphycus striatus genome

Article Open access 12 February 2025

Background & Summary

The Yangtze River Basin, ranked as the world’s third-largest river basin, spans an extensive area of 1,800,000 square kilometers and is home to over 400 species of fish. However, due to intensive development, there has been a notable decline in biodiversity¹. Surveys conducted throughout the basin between 2017 and 2018 identified 332 species of fish, yet 140 historically recorded species were not found, with most considered to be critically endangered². For example, the Yangtze River dolphin (Lipotes vexillifer) saw its population plummet from approximately 400 individuals in the 1980s to fewer than 50 by the end of the 20th century, with no sightings reported after 2006^1,3, and it is now considered functionally extinct⁴. To restore the ecosystem, a ten-year fishing ban was implemented in the Yangtze River. Preliminary results indicate an improvement in fishery resources, with 11 native species that had long disappeared reappearing in the Chishui River, a tributary of the Yangtze⁵.

O. elongatus was once a significant freshwater economic fish in China, widely distributed in the Yangtze River⁶ (Fig. 1). However, due to environmental degradation and human activities, the population of this species has rapidly declined. Based on the assessment criteria A2acd from the IUCN Red List Categories and Criteria, which indicates a significant reduction in population over the past decade, the species was observed to have declined by at least 80% according to direct observations, a decrease in extent of occurrence, area of occupancy, or habitat quality, and actual or potential levels of exploitation. Consequently, the species was listed as Critically Endangered (CR) in the Red List of China’s Vertebrates of Endangered Animals in 2016^7,8,9. Studies in 2018 indicated that the wild population of O. elongatus was in a state of decline⁷. Following the implementation of the ten-year fishing ban in the Yangtze River, in December 2020, seven individuals of O. elongatus were discovered in the Gong’an section of the Yangtze River. This marked the third time the species had been observed since June 2017 and November 2020, and it was the first instance in recent years of multiple individuals being recorded (https://www.cafs.ac.cn/info/1024/36845.htm). The reappearance of O. elongatus provides an opportunity for conservation. Despite the application of conservation genetics and genomics in the protection of several endangered species within the Yangtze River Basin—such as the two subspecies of Neophocaena asiaeorientalis (sunameri¹⁰ and asiorientalis¹¹), the endangered species Gobiocypris rarus¹², and the vulnerable species Leptobotia elongata¹³—all of which have had their genomes sequenced and assembled, the critically endangered O. elongatus, also inhabiting the Yangtze River Basin, has yet to have its genome assembled. The assembly of a genome is crucial for understanding the genetic diversity, identifying unique adaptations, and developing effective conservation strategies for endangered species. The lack of a genome assembly for O. elongatus represents a significant obstacle to its conservation efforts.

In this study, we assembled the chromosome-scale genome sequence of O. elongatus utilizing a combination of PacBio long reads and Hi-C technology (Table 1). The initial assembly spanned 898.4 Mb across 144 contigs with a contig N50 of 24.8 Mb (Table 2). Following redundancy removal and Hi-C scaffolding, 99.83% of the assembly, totaling 881.6 Mb, was anchored to 24 pseudo-chromosomes, yielding a scaffold N50 of 35.1 Mb with 40 scaffolds (Fig. 2, Tables 2, 3). Upon conducting BUSCO analysis with the Actinopterygii_odb10 database, the results included 3,581 (98.3%) complete BUSCOs, 3,536 (97.1%) single-copy BUSCOs, 45 (1.2%) duplicated BUSCOs, 21 (0.6%) fragmented BUSCOs, and 38 (1.1%) missing BUSCOs (Table 4). Annotation revealed 394.1 Mb of repetitive sequences (Fig. 3, Table 5), mainly composed of DNA transposons (221.10 Mb, 25.04%). A total of 28,674 protein-coding genes were predicted using a combined approach of homology-based, RNA-Seq-assisted, ab initio methods, with 28,637 genes (99.87%) annotated (Table 6). Collinearity analysis between O. elongatus and two other species from the family Xenocyprididae revealed high collinearity, indicating good assembly quality of O. elongatus (Fig. 4). The accomplishment of assembling the O. elongatus genome in this study represents a pivotal step forward in conservation efforts, providing a more profound understanding of its genetic diversity. This genomic information serves as a cornerstone for the formulation of precise and effective preservation strategies.

Table 1 Sequencing data for the O. elongatus genome assembly.

Full size table

Table 2 Assembly statistics of the O. elongatus genome.

Full size table

Table 3 Statistical results of the 24 pseudochromosomes of O. elongatus genome.

Full size table

Table 4 Genome quality assessment statistics of the O. elongatus genome.

Full size table

Table 5 Statistics of repetitive sequences in the O. elongatus genome.

Full size table

Table 6 Summary of the functional gene annotation of the O. elongatus genome.

Full size table

Methods

Sample collection and sequencing

A healthy individual of O. elongatus was collected from the Jiujiang section of the Yangtze River, Jiangxi Province, China, on March 21, 2024. DNA was extracted from their muscle tissue samples for long-read Single Molecule Real-Time (SMRT) sequencing, and Hi-C sequencing. For RNA sequencing, total RNA was prepared by pooling RNA extracted from ten distinct tissues (muscle, gill, liver, brain, heart, caudal fin, gut, gonad, skin). All samples were rapidly frozen in liquid nitrogen and subsequently stored at −80 °C to ensure preservation of integrity.

The processes for DNA extraction, library preparation, and sequencing were executed by Nextomics Biosciences (Wuhan, China), strictly adhering to the protocols provided by the manufacturers. Genomic DNA of high molecular weight was obtained from muscle samples, and only DNA of high quality was employed for the construction of sequencing libraries and subsequent high-throughput sequencing.

For PacBio sequencing, SMRTbell libraries were constructed using a 20-kilobase preparation kit according to the standard protocols outlined by Pacific Biosciences (CA, USA)¹⁴. The library preparation process included steps such as DNA shearing, end-repair, and ligation with hairpin adapters to generate circular templates compatible with SMRT sequencing technology. Sequencing was conducted on the PacBio Revio platform using Circular Consensus Sequencing (CCS) mode, in line with the manufacturer’s guidelines. Quality control of the raw data was performed using the ccs software (https://github.com/PacificBiosciences/ccs), with parameters set to “-min-passes 1 -min-rq 0.99 -min-length 100”. Sequencing a single SMRT cell yielded 105.0 Gb of PacBio CCS reads, providing a coverage depth of 118.9×, which was utilized for subsequent genome assembly. The average length of the subreads was 18,447 base pairs (Table 1).

For Hi-C sequencing, freshly collected muscle tissues were stabilized through fixation with 2% formaldehyde to facilitate DNA-protein crosslinking. The preparation of Hi-C libraries involved steps including the digestion of cross-linked DNA, biotin labeling, proximity-based ligation, and DNA purification¹⁵. Subsequently, the Hi-C libraries were sequenced on the MGISEQ-2000 platform using 150 bp paired-end reads to capture the spatial interactions between chromosomal regions. This resulted in generating 81.4 Gb of Hi-C data with an average sequencing depth of 92.2× (Table 1).

Total RNA was extracted from ground tissue using TRIzol reagent (Tiangen, Germany) under cryogenic conditions, following the manufacturer’s instructions. RNA from muscles, liver, intestine, head kidney, heart, spleen, kidney, gills, brain, and gonads was combined for RNA-Seq on the MGISEQ-2000 platform. This process generated 18.6 Gb of RNA-seq data, which was utilized for comprehensive genome-wide prediction of protein-coding genes (Table 1).

Genome survey and assembly

The de novo genome assembly of a total 105.0 Gb PacBio long-read dataset (Table 1) was performed utilizing Hifiasm v0.16.1¹⁶. This approach yielded an assembly of 898.4 Mb, comprising 144 contigs with an N50 contig length of 24.8 Mb (Table 2).

After the initial genome assembly, the assembly was further refined using purge_dups v1.2.5¹⁷ in conjunction with minimap2 v2.22¹⁸. Minimap2 was used to align the sequencing reads to the assembled contigs, assessing coverage across different segments and conducting self-alignments to identify repetitive structures. Purge_dups then utilized this alignment information to detect and classify repetitive sequences, differentiating primary assembly components from potential haplotypes, thereby aiding in the removal of redundancies and resolving haplotypes. The final deduplicated genome size was 883.1 Mb, consisting of 90 contigs with an N50 contig length of 24.8 Mb (Table 2).

Subsequently, Hi-C data was utilized to anchor and orient the draft genome contigs into chromosome-scale assemblies. The deduplicated genome assembly was indexed using bwa v0.7.17¹⁹ and samtools v1.7²⁰. Hi-C reads were aligned to the genome using bwa mem, and the resulting alignment files were processed and sorted with samtools sort. PCR duplicates were removed from the alignments using bammarkduplicates2 in biobambam2 v2.0.87²¹. Further refinement of the assembly was achieved using yahs v1.1²², which leveraged Hi-C data to enhance scaffold ordering and orientation, yielding an updated assembly. Gap and telomere analyses of this assembly were conducted using Pretextmap v0.1.9 (https://github.com/sanger-tol/PretextMap), followed by manual curation with PretextView v0.2.5 (https://github.com/sanger-tol/PretextView), where users adjust the position of sequence fragments based on visualized genomic interaction information. Scaffolds exhibiting strong interaction signals were clustered together, enabling demarcation of chromosomal boundaries.

Ultimately, 99.83% of the initial assembled sequences were anchored onto 24 pseudo-chromosomes, with sizes ranging from 27.61 to 54.33 Mb, resulting in a total genome assembly length of 883.1 Mb, comprised of 40 scaffolds with a scaffold N50 of 35.1 Mb (Fig. 2, Table 3). To evaluate the completeness of the genome, BUSCO version 5.4.7²³ was utilized with parameters ‘-l actinopterygii_odb10 -g genome’, an assessment of genomic integrity was conducted against the Actinopterygii reference dataset. Of the 3,640 benchmarking sets, 3,581 (98.3%) were identified as complete, indicative of a high-quality genome assembly. The assessment also revealed minor fragmentation with 21 (0.6%) fragmented BUSCOs and 38 (1.1%) missing BUSCOs. Notably, the assembly exhibited exceptional continuity, with inter-sequence gaps accounting for merely 0.001%, affirming a highly contiguous and accurate genomic structure (Table 4).

Repetitive sequence annotation

For the annotation of repetitive sequences, we utilized RepeatMasker v4.1.6²⁴ with Dfam database utilizing advanced Hidden Markov Models (HMMs) for known repeats, and with RepBase for a complete family representation. For the identification of detecting species-specific repeats not catalogued in public databases, RepeatModeler v2.0.5²⁵ was implemented to construct de novo repeat libraries through iterative clustering and refinement of sequence data. The annotations derived from Dfam, RepBase, and RepeatModeler were then consolidated into a unified dataset, with overlapping annotations merged and redundant entries removed. The outcomes of repetitive sequence annotation are summarized in Table 5, revealing that DNA transposons constitute 25.04%, making them the most prevalent type of repetitive sequence (Fig. 3).

Protein-coding gene prediction and annotation

Gene predictions were conducted through a combination of homology, transcriptome-based prediction and de novo prediction methods.

For homology-based gene prediction, full-genome protein sequences from three Xenocyprididae family species were utilized. Grass carp (Ctenopharyngodon idella, GCF_019924925.1), Wuchang bream (Megalobrama amblycephala, GCF_018812025.1), and predatory carp (Chanodichthys erythropterus, GCF_024489055.1) genomes were downloaded from GenBank as sources of homologous proteins. To align multi-species homologous protein sequences against the target genome, MMseqs v15-6f452²⁶ was employed, applying a filter criterion of “identity > 0.1, evalue < 1e-3”. Overlapping High Scoring Segment Pairs (HSPs) resulting from alternative splicing were merged, and a stricter filter was applied with “identity > 0.2, evalue < 1e-9, query coverage > 0.3”. To refine the alignment accuracy, genewise v2.4.1²⁷, gth v1.7.3²⁸, and exonerate v2.2.0²⁹ were used to perform precise spliced alignments of matched proteins to their homologous protein sequences, aiding in the prediction of gene structures for each protein region. RNA-Seq datasets from ten tissues were subjected to quality control using Trimmomatic v0.39³⁰ and the trimmed reads were aligned to the reference genome sequence using HISAT2 v2.1.0³¹. Open Reading Frames (ORFs) were predicted from the assembled transcripts with TransDecoder v5.5.0³². Additionally, Augustus v3.5.0³³ was used for de novo prediction of gene structures. The gene predictions from gene predictions were combined to create a non-redundant reference gene set, resulting in a total of 28,674 protein-coding genes (Table 6).

Protein-coding genes were annotated by the aligning genomic sequences against the NT database using Blast v2.13.0³⁴ with an e-value threshold set at 1e-10. The predicted protein sequences were further compared against the NR, Uniprot³⁵, GO³⁶, KEGG³⁷, COG³⁸, Pfam³⁹ databases utilizing diamond v2.1.8⁴⁰. Ultimately, 28,637 genes (99.87%) received successful annotations (Table 6).

Colinearity analysis

Genome collinearity analysis and visualizations were performed using the MCScanX⁴¹ tool from TBtools v2.096⁴² with parameters set as follows: e-value was set to 1e-10, and the Number of BlastHits was set to 5. Collinearity analysis between O. elongatus and two other species from the family Xenocyprididae revealed high collinearity, indicating good assembly quality of O. elongatus (Fig. 4).

Data Records

The sequencing dataset and the genome assembly are stored in public databases. The sequencing data used for genome assembly, including PacBio, Hi-C, and RNA-seq, have been deposited in the National Center for Biotechnology Information’s (NCBI) Sequence Read Archive (SRA) under BioProject PRJNA1163015⁴³. The complete genomic sequence data has been submitted to NCBI GenBank database with the accession number GCA_041950425.1⁴⁴. The results of the genome annotation have been made available in the Figshare database⁴⁵.

Technical Validation

For DNA intended for library preparation and sequencing, quality and purity were evaluated using 0.75% agarose gel electrophoresis, which showed that the main band size was greater than 23,130 bp. Additionally, purity and concentration were assessed using both a NanoDrop One UV-Vis spectrophotometer (Thermo Fisher Scientific, USA) and Qubit Fluorometer (Invitrogen, USA). The concentrations measured by both instruments were greater than 461.0 ng/µL. The NanoDrop results also indicated a 260/280 ratio of 1.87 and a 260/230 ratio of 2.43.

The RNA used for library preparation and sequencing had a concentration greater than 640.0 ng/µL as measured by NanoDrop and Qubit. Its integrity was assessed using the Agilent 2100 Bioanalyzer (Agilent, USA) with Agilent RNA 6000 Nano Kit (Agilent, USA), yielding an RNA Integrity Number (RIN) of 8.8, indicating good integrity.

BLAST v2.13.0 was used to align genomic sequences to the NT database, enabling the identification of protein-coding genes and the assessment of genomic sequence contamination. A stringent 1e-10 e-value cutoff was selected. The analysis confirmed that there were no bacterial or artificial contaminants in our constructed genome.

Code availability

All software used in this study are in the public domain, with parameters being clearly described in Methods. If no detail parameters were mentioned for the software, default parameters were used as suggested by developer.

References

Chen, T., Wang, Y., Gardner, C. & Wu, F. Threats and protection policies of the aquatic biodiversity in the Yangtze River. Journal for Nature Conservation 58, 125931, https://doi.org/10.1016/j.jnc.2020.125931 (2020).
Article Google Scholar
Zhang, H. et al. Extinction of one of the world’s largest freshwater fishes: Lessons for conserving the endangered Yangtze fauna. Science of the Total Environment 710, 136242, https://doi.org/10.1016/j.scitotenv.2019.136242 (2020).
Article CAS PubMed Google Scholar
Zhang, X. et al. The Yangtze River dolphin or baiji (Lipotes vexillifer): population status and conservation issues in the Yangtze River, China. Aquatic Conservation: Marine and Freshwater Ecosystems 13(1), 51–64, https://doi.org/10.1002/aqc.547 (2003).
Article ADS Google Scholar
Bin, W., Weiping, W., Haihua, W. & Gang, H. A retrospective analysis on the population viability of the Yangtze river Dolphin or Baiji (Lipotes vexillifer). Indian Journal of Animal Research 56(6), 775–779, https://doi.org/10.18805/ijar.B-1238 (2022).
Article Google Scholar
Liu, F. et al. Changes in fish resources 5 years after implementation of the 10-year fishing ban in the Chishui River, the first river with a complete fishing ban in the Yangtze River Basin. Ecological Processes 12, 51, https://doi.org/10.1186/s13717-023-00465-6 (2023).
Article Google Scholar
Weitao, C. et al. Genetic structure analysis of Ochetobius elongatus between Yangtze River and Pearl River using multiple loci. South China Fisheries Science 18(6), 19–25, https://doi.org/10.12131/20220007 (2022).
Article Google Scholar
Yang, J., Li, C., Chen, W., Li, Y. & Li, X. Genetic diversity and population demographic history of Ochetobius elongatus in the middle and lower reaches of the Xijiang River. Biodiversity Science 26(12), 1289, https://doi.org/10.17520/biods.2018121 (2018).
Article Google Scholar
Yang, J. et al. Development and characterization of 26 SNP markers in Ochetobius elongatus based on restriction site-associated DNA sequencing (RAD-seq). Conservation genetics resources 12, 53–55, https://doi.org/10.1007/s12686-018-1075-3 (2020).
Article CAS Google Scholar
Jiang, Z. et al. Red list of China’s vertebrates. Biodiversity Science 24(5), 500–551, https://doi.org/10.17520/biods.2016076 (2016).
Article Google Scholar
Yin, D. et al. Telomere-to-telomere gap-free genome assembly of the endangered Yangtze finless porpoise and East Asian finless porpoise. GigaScience 13, giae067, https://doi.org/10.1093/gigascience/giae067 (2024).
Article ADS PubMed PubMed Central Google Scholar
Yuan, Y. et al. Genome sequence of the freshwater Yangtze finless porpoise. Genes 9(4), 213, https://doi.org/10.3390/genes9040213 (2018).
Article CAS PubMed PubMed Central Google Scholar
Hu, X. et al. Genomic deciphering of sex determination and unique immune system of a potential model species rare minnow (Gobiocypris rarus). Science Advances 4 8(5), eabl7253, https://doi.org/10.1126/sciadv.abl7253 (2022).
Article ADS CAS Google Scholar
Wen, Z. et al. Chromosome-level genome assemblies of vulnerable male and female elongate loach (Leptobotia elongata). Scientific data 11, 924, https://doi.org/10.1038/s41597-024-03789-w (2024).
Article CAS PubMed PubMed Central Google Scholar
Eid, J. et al. Real-Time DNA Sequencing from Single Polymerase Molecules. Science 323, 133–138, https://doi.org/10.1126/science.1162986 (2009).
Article ADS CAS PubMed Google Scholar
Rao, S. et al. A 3D Map of the Human Genome at Kilobase Resolution Reveals Principles of Chromatin Looping. Cell 159, 1665–1680, https://doi.org/10.1016/j.cell.2014.11.021 (2014).
Article CAS PubMed PubMed Central Google Scholar
Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nature Methods 18, 170–175, https://doi.org/10.1038/s41592-020-01056-5 (2021).
Article ADS CAS PubMed PubMed Central Google Scholar
Guan, D. et al. Identifying and removing haplotypic duplication in primary genome assemblies. Bioinformatics 36(9), 2896–2898, https://doi.org/10.1093/bioinformatics/btaa025 (2020).
Article CAS PubMed PubMed Central Google Scholar
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34(18), 3094–3100, https://doi.org/10.1093/bioinformatics/bty191 (2018).
Article CAS PubMed PubMed Central Google Scholar
Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754–1760, https://doi.org/10.1093/bioinformatics/btp324 (2009).
Article CAS PubMed PubMed Central Google Scholar
Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25(16), 2078–2079, https://doi.org/10.1093/bioinformatics/btp352 (2009).
Article CAS PubMed PubMed Central Google Scholar
Tischler, G. & Leonard, S. biobambam: tools for read pair collation based algorithms on BAM files. Source Code for Biology and Medicine 9, 13, https://doi.org/10.1186/1751-0473-9-13 (2014).
Article PubMed Central Google Scholar
Zhou, C., McCarthy, S. A. & Durbin, R. YaHS: yet another Hi-C scaffolding tool. Bioinformatics (Oxford, England) 39(1), btac808, https://doi.org/10.1093/bioinformatics/btac808 (2023).
Article CAS PubMed Google Scholar
Simao, F. A. et al. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics 31, 3210–3212, https://doi.org/10.1093/bioinformatics/btv351 (2015).
Article CAS PubMed Google Scholar
Bao, W., Kojima, K. K. & Kohany, O. Repbase Update, a database of repetitive elements in eukaryotic genomes. Mobile DNA 6, 11, https://doi.org/10.1186/s13100-015-0041-9 (2015).
Article PubMed PubMed Central Google Scholar
Flynn, J. M. et al. RepeatModeler2 for automated genomic discovery of transposable element families. Proceedings of the National Academy of Sciences 117(17), 9451–9457, https://doi.org/10.1073/pnas.1921046117 (2020).
Article ADS CAS Google Scholar
Steinegger, M. & Söding, J. MMseqs. 2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnology 35, 1026–1028, https://doi.org/10.1038/nbt.3988 (2017).
Article CAS PubMed Google Scholar
Doerks, T., Copley, R. R., Schultz, J., Ponting, C. P. & Bork, P. Systematic identifcation of novel protein domain families associated with nuclear functions. Genome Research 12, 47–56, https://doi.org/10.1101/gr.203201 (2002).
Article CAS PubMed PubMed Central Google Scholar
Gremme, G., Brendel, V., Sparks, M. E. & Kurtz, S. Engineering a software tool for gene structure prediction in higher organisms. Information and Software Technology 47(15), 965–978, https://doi.org/10.1016/j.infsof.2005.09.005 (2005).
Article Google Scholar
Slater, G. S. C. & Birney, E. Automated generation of heuristics for biological sequence comparison. BMC Bioinformatics 6, 31, https://doi.org/10.1186/1471-2105-6-31 (2005).
Article CAS PubMed PubMed Central Google Scholar
Bolger, A. M., Lohse, M. & Usadel, B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 30(15), 2114–2120, https://doi.org/10.1093/bioinformatics/btu170 (2014).
Article CAS PubMed PubMed Central Google Scholar
Kim, D., Paggi, J. M., Park, C., Bennett, C. & Salzberg, S. L. Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nature Biotechnology 37, 907–915, https://doi.org/10.1038/s41587-019-0201-4 (2019).
Article CAS PubMed PubMed Central Google Scholar
Haas, B. et al. De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis. Nature protocols 8(8), 1494–1512, https://doi.org/10.1038/nprot.2013.084 (2013).
Article CAS PubMed Google Scholar
Stanke, M. et al. AUGUSTUS: ab initio prediction of alternative transcripts. Nucleic Acids Research 34, W435–439, https://doi.org/10.1093/nar/gkl200 (2006).
Article CAS PubMed PubMed Central Google Scholar
McGinnis, S. & Madden, T. BLAST: at the core of a powerful and diverse set of sequence analysis tools. Nucleic Acids Research 32, W20–25, https://doi.org/10.1093/nar/gkh435 (2004).
Article CAS PubMed PubMed Central Google Scholar
UniProt Consortium, T. UniProt: the universal protein knowledgebase. Nucleic Acids Research 46, 2699, https://doi.org/10.1093/nar/gky092 (2018).
Article CAS PubMed PubMed Central Google Scholar
Ashburner, M. et al. Gene Ontology: tool for the unification of biology. Nature Genetics 25, 25–29, https://doi.org/10.1038/75556 (2000).
Article CAS PubMed PubMed Central Google Scholar
Kanehisa, M. et al. Data, information, knowledge and principle: back to metabolism in KEGG. Nucleic Acids Research 42, D199–205, https://doi.org/10.1093/nar/gkt1076 (2014).
Article CAS PubMed Google Scholar
Tatusov, R., Galperin, M., Natale, D. & Koonin, E. The COG database: a tool for genome-scale analysis of protein functions and evolution. Nucleic Acids Research 28(1), 33–6, https://doi.org/10.1093/nar/28.1.33 (2000).
Article CAS PubMed PubMed Central Google Scholar
Finn, R. et al. Pfam: the protein families database. Nucleic Acids Research 42, D222–30, https://doi.org/10.1093/nar/gkt1223 (2014).
Article CAS PubMed Google Scholar
Buchfink, B., Xie, C. & Huson, D. Fast and sensitive protein alignment using DIAMOND. Nature Methods 12, 59–60, https://doi.org/10.1038/nmeth.3176 (2015).
Article CAS PubMed Google Scholar
Wang, Y. et al. MCScanX: a toolkit for detection and evolutionary analysis of gene synteny and collinearity. Nucleic acids research 40(7), e49–e49, https://doi.org/10.1093/nar/gkr1293 (2012).
Article CAS PubMed PubMed Central Google Scholar
Chen, C. et al. TBtools: an integrative toolkit developed for interactive analyses of big biological data. Molecular plant 13(8), 1194–1202, https://doi.org/10.1016/j.molp.2020.06.009 (2020).
Article CAS PubMed Google Scholar
NCBI Sequence Read Archive. https://identifiers.org/ncbi/insdc.sra:SRP533749 (2024).
Liang, X. Ochetobius elongatus isolate XLOel202024, whole genome shotgun sequencing project. GenBank. https://identifiers.org/ncbi/insdc:JBHHFX000000000 (2024).
Liang, X. The genomic annotation-related data for Ochetobius elongatus. figshare. Dataset. https://doi.org/10.6084/m9.figshare.27088858.v1 (2024).

Download references

Acknowledgements

This project was supported by the Jiujiang Academy of Agricultural Sciences, China, and the project ‘Research on Breeding Technology of Candidate Species for Guangdong Modern Marine Ranching’ (Project: 2024-MRB-00-001).

Author information

These authors contributed equally: Lekang Li, Xuanguang Liang.

Authors and Affiliations

Jiujiang Academy of Agricultural Sciences, Jiujiang, 332000, Jiangxi, China
Lekang Li, Qun Xu, Xiaoping Gao, Bao Zhang, Xianyong Wang, Zhen Luo & Chiping Kong
School of Marine Sciences, Sun Yat-sen University, Southern Marine Science and Engineering Guangdong Laboratory (Zhuhai), Zhuhai, 519082, Guangdong, China
Xuanguang Liang, Jiatong Zhang & Jianguo Lu
Agro-Tech Extension Center of Guangdong Province, Guangzhou, 510000, Guangdong, China
Li Wang
Ruichang Yangtze River Four Major Chinese Carps Original Strains Breeding Center, Ruichang, 332200, Jiangxi, China
Xiangchun Song, Dan Huang & Hong Wang

Authors

Lekang Li
View author publications
Search author on:PubMed Google Scholar
Xuanguang Liang
View author publications
Search author on:PubMed Google Scholar
Jiatong Zhang
View author publications
Search author on:PubMed Google Scholar
Qun Xu
View author publications
Search author on:PubMed Google Scholar
Li Wang
View author publications
Search author on:PubMed Google Scholar
Xiaoping Gao
View author publications
Search author on:PubMed Google Scholar
Xiangchun Song
View author publications
Search author on:PubMed Google Scholar
Bao Zhang
View author publications
Search author on:PubMed Google Scholar
Dan Huang
View author publications
Search author on:PubMed Google Scholar
Hong Wang
View author publications
Search author on:PubMed Google Scholar
Xianyong Wang
View author publications
Search author on:PubMed Google Scholar
Zhen Luo
View author publications
Search author on:PubMed Google Scholar
Chiping Kong
View author publications
Search author on:PubMed Google Scholar
Jianguo Lu
View author publications
Search author on:PubMed Google Scholar

Contributions

Jianguo Lu: Conceptualization, Project administration, Supervision, Funding acquisition, Writing-review & editing; Chiping Kong: Conceptualization, Project administration, Supervision, Funding acquisition, Resources, Writing-review & editing; Lekang Li: Resources, Writing-review & editing; Xuanguang Liang: Formal analysis, Data curation, Validation, Visualization, Writing-original draft, Writing-review & editing; Jiatong Zhang: Data curation, Writing-review & editing; Qun Xu: Resources, Writing-review & editing; Li Wang: Writing-review & editing; Xiaoping Gao: Resources, Writing-review & editing; Xiangchun Song: Resources; Bao Zhang: Resources, Writing-review & editing; Dan Huang: Resources; Hong Wang: Resources; Xianyong Wang: Resources; Zhen Luo: Resources. Lekang Li and Xuanguang Liang contributed equally to this work.

Corresponding authors

Correspondence to Chiping Kong or Jianguo Lu.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Li, L., Liang, X., Zhang, J. et al. A chromosome-level genome assembly of critically endangered Ochetobius elongatus. Sci Data 11, 1399 (2024). https://doi.org/10.1038/s41597-024-04223-x

Download citation

Received: 01 October 2024
Accepted: 02 December 2024
Published: 19 December 2024
Version of record: 19 December 2024
DOI: https://doi.org/10.1038/s41597-024-04223-x