Background & Summary

The Yangtze River Basin, ranked as the world’s third-largest river basin, spans an extensive area of 1,800,000 square kilometers and is home to over 400 species of fish. However, due to intensive development, there has been a notable decline in biodiversity1. Surveys conducted throughout the basin between 2017 and 2018 identified 332 species of fish, yet 140 historically recorded species were not found, with most considered to be critically endangered2. For example, the Yangtze River dolphin (Lipotes vexillifer) saw its population plummet from approximately 400 individuals in the 1980s to fewer than 50 by the end of the 20th century, with no sightings reported after 20061,3, and it is now considered functionally extinct4. To restore the ecosystem, a ten-year fishing ban was implemented in the Yangtze River. Preliminary results indicate an improvement in fishery resources, with 11 native species that had long disappeared reappearing in the Chishui River, a tributary of the Yangtze5.

O. elongatus was once a significant freshwater economic fish in China, widely distributed in the Yangtze River6 (Fig. 1). However, due to environmental degradation and human activities, the population of this species has rapidly declined. Based on the assessment criteria A2acd from the IUCN Red List Categories and Criteria, which indicates a significant reduction in population over the past decade, the species was observed to have declined by at least 80% according to direct observations, a decrease in extent of occurrence, area of occupancy, or habitat quality, and actual or potential levels of exploitation. Consequently, the species was listed as Critically Endangered (CR) in the Red List of China’s Vertebrates of Endangered Animals in 20167,8,9. Studies in 2018 indicated that the wild population of O. elongatus was in a state of decline7. Following the implementation of the ten-year fishing ban in the Yangtze River, in December 2020, seven individuals of O. elongatus were discovered in the Gong’an section of the Yangtze River. This marked the third time the species had been observed since June 2017 and November 2020, and it was the first instance in recent years of multiple individuals being recorded (https://www.cafs.ac.cn/info/1024/36845.htm). The reappearance of O. elongatus provides an opportunity for conservation. Despite the application of conservation genetics and genomics in the protection of several endangered species within the Yangtze River Basin—such as the two subspecies of Neophocaena asiaeorientalis (sunameri10 and asiorientalis11), the endangered species Gobiocypris rarus12, and the vulnerable species Leptobotia elongata13—all of which have had their genomes sequenced and assembled, the critically endangered O. elongatus, also inhabiting the Yangtze River Basin, has yet to have its genome assembled. The assembly of a genome is crucial for understanding the genetic diversity, identifying unique adaptations, and developing effective conservation strategies for endangered species. The lack of a genome assembly for O. elongatus represents a significant obstacle to its conservation efforts.

Fig. 1
figure 1

A picture of the O. elongatus for sequencing and assembly.

In this study, we assembled the chromosome-scale genome sequence of O. elongatus utilizing a combination of PacBio long reads and Hi-C technology (Table 1). The initial assembly spanned 898.4 Mb across 144 contigs with a contig N50 of 24.8 Mb (Table 2). Following redundancy removal and Hi-C scaffolding, 99.83% of the assembly, totaling 881.6 Mb, was anchored to 24 pseudo-chromosomes, yielding a scaffold N50 of 35.1 Mb with 40 scaffolds (Fig. 2, Tables 2, 3). Upon conducting BUSCO analysis with the Actinopterygii_odb10 database, the results included 3,581 (98.3%) complete BUSCOs, 3,536 (97.1%) single-copy BUSCOs, 45 (1.2%) duplicated BUSCOs, 21 (0.6%) fragmented BUSCOs, and 38 (1.1%) missing BUSCOs (Table 4). Annotation revealed 394.1 Mb of repetitive sequences (Fig. 3, Table 5), mainly composed of DNA transposons (221.10 Mb, 25.04%). A total of 28,674 protein-coding genes were predicted using a combined approach of homology-based, RNA-Seq-assisted, ab initio methods, with 28,637 genes (99.87%) annotated (Table 6). Collinearity analysis between O. elongatus and two other species from the family Xenocyprididae revealed high collinearity, indicating good assembly quality of O. elongatus (Fig. 4). The accomplishment of assembling the O. elongatus genome in this study represents a pivotal step forward in conservation efforts, providing a more profound understanding of its genetic diversity. This genomic information serves as a cornerstone for the formulation of precise and effective preservation strategies.

Table 1 Sequencing data for the O. elongatus genome assembly.
Table 2 Assembly statistics of the O. elongatus genome.
Fig. 2
figure 2

Characteristics of the O. elongatus genome. (a) Hi-C intra-chromosomal contact map of the O. elongatus genome assembly. (b) Genomic synteny circos plot between zebrafish (Danio rerio) and O. elongatus. (1) O. elongatus chromosomes at 100 kb scale; (2) GC content; (3) coding sequences drawn in a 100 kb sliding window with a 100 kb step.

Table 3 Statistical results of the 24 pseudochromosomes of O. elongatus genome.
Table 4 Genome quality assessment statistics of the O. elongatus genome.
Fig. 3
figure 3

The length distribution statistics of different types of repetitive sequences in the O. elongatus genome. (Only parts less than 3,000 in length are shown).

Table 5 Statistics of repetitive sequences in the O. elongatus genome.
Table 6 Summary of the functional gene annotation of the O. elongatus genome.
Fig. 4
figure 4

Genomic collinearity analysis between O. elongatus and two Xenocyprididae species (C. idella and C. erythropterus).

Methods

Sample collection and sequencing

A healthy individual of O. elongatus was collected from the Jiujiang section of the Yangtze River, Jiangxi Province, China, on March 21, 2024. DNA was extracted from their muscle tissue samples for long-read Single Molecule Real-Time (SMRT) sequencing, and Hi-C sequencing. For RNA sequencing, total RNA was prepared by pooling RNA extracted from ten distinct tissues (muscle, gill, liver, brain, heart, caudal fin, gut, gonad, skin). All samples were rapidly frozen in liquid nitrogen and subsequently stored at −80 °C to ensure preservation of integrity.

The processes for DNA extraction, library preparation, and sequencing were executed by Nextomics Biosciences (Wuhan, China), strictly adhering to the protocols provided by the manufacturers. Genomic DNA of high molecular weight was obtained from muscle samples, and only DNA of high quality was employed for the construction of sequencing libraries and subsequent high-throughput sequencing.

For PacBio sequencing, SMRTbell libraries were constructed using a 20-kilobase preparation kit according to the standard protocols outlined by Pacific Biosciences (CA, USA)14. The library preparation process included steps such as DNA shearing, end-repair, and ligation with hairpin adapters to generate circular templates compatible with SMRT sequencing technology. Sequencing was conducted on the PacBio Revio platform using Circular Consensus Sequencing (CCS) mode, in line with the manufacturer’s guidelines. Quality control of the raw data was performed using the ccs software (https://github.com/PacificBiosciences/ccs), with parameters set to “-min-passes 1 -min-rq 0.99 -min-length 100”. Sequencing a single SMRT cell yielded 105.0 Gb of PacBio CCS reads, providing a coverage depth of 118.9×, which was utilized for subsequent genome assembly. The average length of the subreads was 18,447 base pairs (Table 1).

For Hi-C sequencing, freshly collected muscle tissues were stabilized through fixation with 2% formaldehyde to facilitate DNA-protein crosslinking. The preparation of Hi-C libraries involved steps including the digestion of cross-linked DNA, biotin labeling, proximity-based ligation, and DNA purification15. Subsequently, the Hi-C libraries were sequenced on the MGISEQ-2000 platform using 150 bp paired-end reads to capture the spatial interactions between chromosomal regions. This resulted in generating 81.4 Gb of Hi-C data with an average sequencing depth of 92.2× (Table 1).

Total RNA was extracted from ground tissue using TRIzol reagent (Tiangen, Germany) under cryogenic conditions, following the manufacturer’s instructions. RNA from muscles, liver, intestine, head kidney, heart, spleen, kidney, gills, brain, and gonads was combined for RNA-Seq on the MGISEQ-2000 platform. This process generated 18.6 Gb of RNA-seq data, which was utilized for comprehensive genome-wide prediction of protein-coding genes (Table 1).

Genome survey and assembly

The de novo genome assembly of a total 105.0 Gb PacBio long-read dataset (Table 1) was performed utilizing Hifiasm v0.16.116. This approach yielded an assembly of 898.4 Mb, comprising 144 contigs with an N50 contig length of 24.8 Mb (Table 2).

After the initial genome assembly, the assembly was further refined using purge_dups v1.2.517 in conjunction with minimap2 v2.2218. Minimap2 was used to align the sequencing reads to the assembled contigs, assessing coverage across different segments and conducting self-alignments to identify repetitive structures. Purge_dups then utilized this alignment information to detect and classify repetitive sequences, differentiating primary assembly components from potential haplotypes, thereby aiding in the removal of redundancies and resolving haplotypes. The final deduplicated genome size was 883.1 Mb, consisting of 90 contigs with an N50 contig length of 24.8 Mb (Table 2).

Subsequently, Hi-C data was utilized to anchor and orient the draft genome contigs into chromosome-scale assemblies. The deduplicated genome assembly was indexed using bwa v0.7.1719 and samtools v1.720. Hi-C reads were aligned to the genome using bwa mem, and the resulting alignment files were processed and sorted with samtools sort. PCR duplicates were removed from the alignments using bammarkduplicates2 in biobambam2 v2.0.8721. Further refinement of the assembly was achieved using yahs v1.122, which leveraged Hi-C data to enhance scaffold ordering and orientation, yielding an updated assembly. Gap and telomere analyses of this assembly were conducted using Pretextmap v0.1.9 (https://github.com/sanger-tol/PretextMap), followed by manual curation with PretextView v0.2.5 (https://github.com/sanger-tol/PretextView), where users adjust the position of sequence fragments based on visualized genomic interaction information. Scaffolds exhibiting strong interaction signals were clustered together, enabling demarcation of chromosomal boundaries.

Ultimately, 99.83% of the initial assembled sequences were anchored onto 24 pseudo-chromosomes, with sizes ranging from 27.61 to 54.33 Mb, resulting in a total genome assembly length of 883.1 Mb, comprised of 40 scaffolds with a scaffold N50 of 35.1 Mb (Fig. 2, Table 3). To evaluate the completeness of the genome, BUSCO version 5.4.723 was utilized with parameters ‘-l actinopterygii_odb10 -g genome’, an assessment of genomic integrity was conducted against the Actinopterygii reference dataset. Of the 3,640 benchmarking sets, 3,581 (98.3%) were identified as complete, indicative of a high-quality genome assembly. The assessment also revealed minor fragmentation with 21 (0.6%) fragmented BUSCOs and 38 (1.1%) missing BUSCOs. Notably, the assembly exhibited exceptional continuity, with inter-sequence gaps accounting for merely 0.001%, affirming a highly contiguous and accurate genomic structure (Table 4).

Repetitive sequence annotation

For the annotation of repetitive sequences, we utilized RepeatMasker v4.1.624 with Dfam database utilizing advanced Hidden Markov Models (HMMs) for known repeats, and with RepBase for a complete family representation. For the identification of detecting species-specific repeats not catalogued in public databases, RepeatModeler v2.0.525 was implemented to construct de novo repeat libraries through iterative clustering and refinement of sequence data. The annotations derived from Dfam, RepBase, and RepeatModeler were then consolidated into a unified dataset, with overlapping annotations merged and redundant entries removed. The outcomes of repetitive sequence annotation are summarized in Table 5, revealing that DNA transposons constitute 25.04%, making them the most prevalent type of repetitive sequence (Fig. 3).

Protein-coding gene prediction and annotation

Gene predictions were conducted through a combination of homology, transcriptome-based prediction and de novo prediction methods.

For homology-based gene prediction, full-genome protein sequences from three Xenocyprididae family species were utilized. Grass carp (Ctenopharyngodon idella, GCF_019924925.1), Wuchang bream (Megalobrama amblycephala, GCF_018812025.1), and predatory carp (Chanodichthys erythropterus, GCF_024489055.1) genomes were downloaded from GenBank as sources of homologous proteins. To align multi-species homologous protein sequences against the target genome, MMseqs v15-6f45226 was employed, applying a filter criterion of “identity > 0.1, evalue < 1e-3”. Overlapping High Scoring Segment Pairs (HSPs) resulting from alternative splicing were merged, and a stricter filter was applied with “identity > 0.2, evalue < 1e-9, query coverage > 0.3”. To refine the alignment accuracy, genewise v2.4.127, gth v1.7.328, and exonerate v2.2.029 were used to perform precise spliced alignments of matched proteins to their homologous protein sequences, aiding in the prediction of gene structures for each protein region. RNA-Seq datasets from ten tissues were subjected to quality control using Trimmomatic v0.3930 and the trimmed reads were aligned to the reference genome sequence using HISAT2 v2.1.031. Open Reading Frames (ORFs) were predicted from the assembled transcripts with TransDecoder v5.5.032. Additionally, Augustus v3.5.033 was used for de novo prediction of gene structures. The gene predictions from gene predictions were combined to create a non-redundant reference gene set, resulting in a total of 28,674 protein-coding genes (Table 6).

Protein-coding genes were annotated by the aligning genomic sequences against the NT database using Blast v2.13.034 with an e-value threshold set at 1e-10. The predicted protein sequences were further compared against the NR, Uniprot35, GO36, KEGG37, COG38, Pfam39 databases utilizing diamond v2.1.840. Ultimately, 28,637 genes (99.87%) received successful annotations (Table 6).

Colinearity analysis

Genome collinearity analysis and visualizations were performed using the MCScanX41 tool from TBtools v2.09642 with parameters set as follows: e-value was set to 1e-10, and the Number of BlastHits was set to 5. Collinearity analysis between O. elongatus and two other species from the family Xenocyprididae revealed high collinearity, indicating good assembly quality of O. elongatus (Fig. 4).

Data Records

The sequencing dataset and the genome assembly are stored in public databases. The sequencing data used for genome assembly, including PacBio, Hi-C, and RNA-seq, have been deposited in the National Center for Biotechnology Information’s (NCBI) Sequence Read Archive (SRA) under BioProject PRJNA116301543. The complete genomic sequence data has been submitted to NCBI GenBank database with the accession number GCA_041950425.144. The results of the genome annotation have been made available in the Figshare database45.

Technical Validation

For DNA intended for library preparation and sequencing, quality and purity were evaluated using 0.75% agarose gel electrophoresis, which showed that the main band size was greater than 23,130 bp. Additionally, purity and concentration were assessed using both a NanoDrop One UV-Vis spectrophotometer (Thermo Fisher Scientific, USA) and Qubit Fluorometer (Invitrogen, USA). The concentrations measured by both instruments were greater than 461.0 ng/µL. The NanoDrop results also indicated a 260/280 ratio of 1.87 and a 260/230 ratio of 2.43.

The RNA used for library preparation and sequencing had a concentration greater than 640.0 ng/µL as measured by NanoDrop and Qubit. Its integrity was assessed using the Agilent 2100 Bioanalyzer (Agilent, USA) with Agilent RNA 6000 Nano Kit (Agilent, USA), yielding an RNA Integrity Number (RIN) of 8.8, indicating good integrity.

BLAST v2.13.0 was used to align genomic sequences to the NT database, enabling the identification of protein-coding genes and the assessment of genomic sequence contamination. A stringent 1e-10 e-value cutoff was selected. The analysis confirmed that there were no bacterial or artificial contaminants in our constructed genome.