Background & Summary

Hypomesus nipponensis (NCBI Taxonomy ID: 182223), an anadromous small cold-water fish classified under the genus Hypomesus (family Osmeridae), exhibits a short life cycle, high fecundity, and rapid population growth—adaptive traits that support its colonization of diverse water bodies and survival across heterogeneous habitats1. Prior to the 1980s, Hypomesus nipponensis (Japanese smelt) was introduced to northeastern China, with Shuifeng Reservoir—the largest reservoir in Northeast China—recognized as one of its core introduction sites2. As a highly dispersive species, this fish has now established a distribution range covering the entire Northeast Asia, spanning China, Japan, and the Korean Peninsula3.

Whole genome information serves as the foundation for investigating biological characteristics. To date, several genomic resources for H. nipponensis have been reported. In 2019, the complete mitochondrial genome of H. nipponensis was decoded4. Subsequently, a draft genome was generated in 2021, with a contig N50 of 464,523 bp5. Most recently, a chromosome-level genome assembly (designated HNIP-V2) with a contig N50 of 8.19 Mb was published6. These studies have provided critical genetic resources and established a robust foundation for breeding programs and biological research on H. nipponensis. However, these present genome assemblies have been limited by numerous gaps, particularly in repetitive sequence-rich regions such as telomeres and centromeres. Telomeric and centromeric DNA sequences are predominantly composed of satellite DNA and are known to evolve rapidly in eukaryotic genomes7,8. With advancements in genome sequencing technologies and assembly methodologies, gap-free telomere-to-telomere (T2T) genome assemblies have now become achievable, enabling the characterization of nearly the entire genome. Pacific BioSciences (PacBio) HiFi reads can resolve complex genomic regions, while ONT ultra-long reads facilitate the resolution of tandem duplications9,10. Hifiasm, a high-performance assembly tool, has been successfully applied to gap-free T2T genome assembly in various fish species, including the Yangtze finless porpoise (Neophocaena asiaeorientalis)11, Neosalanx taihuensis12, Asian icefish (Protosalanx chinensis)13, and Siniperca roulei14. Notably, its algorithm has recently been updated to specifically support T2T assembly using ONT data alone15.

In this study, we report the first gap-free T2T reference genome for H. nipponensis (designated HNIP-T2T), generated using multiple assembly strategies and integrating HiFi reads, ONT reads, MGI short reads, and chromatin conformation capture (Hi-C) data. The HNIP-T2T assembly spans approximately 526.31 Mb with an N50 of 20.23 Mb. Gene annotation identified 31,310 protein-coding genes, 97.67% of which were annotated in public biological databases. This high-quality, gap-free genome assembly will serve as an important resource for investigating the reproductive biology and ecological adaptability of H. nipponensis.

Methods

Sample collection and sequencing

In September 2024, a healthy H. nipponensis was collected from Shuifeng Reservoir on the Yalu River in Liaoning Province, China (Fig. 1a). High-quality, high-molecular-weight genomic DNA (gDNA) was extracted from muscle tissue using the cetyltrimethylammonium bromide method16. DNA purity was assessed through 1% agarose gel electrophoresis and quantified using a NanoDropTM One UV-Vis spectrophotometer (Thermo Fisher Scientific, USA). DNA concentration was further determined using a Qubit 4.0 fluorometer (Invitrogen, USA). Following quality assessment, paired-end sequencing was performed on the DNBSEQ-T7 platform (MGI, Shenzhen, China), generating 143.95 Gb of raw reads (Table 1). Quality control of the sequencing data was conducted using fastp (v0.23.2)17 with default settings.

Fig. 1
Fig. 1The alternative text for this image may have been generated using AI.
Full size image

A T2T genome assembly of H. nipponensis. (a) An image of the sequenced fish. (b) K-mer frequency distribution estimated. The observed K-mer (raw K-mer) frequencies (in grey), fitted K-mer frequencies (in blue) with skew normal distribution model, and overall fitting (in red) that concatenated observed and fitted K-mer frequencies. Genome size estimate for 19-mer: 525,729,808 bp. (c) Snail plot showing the features of the assembled H. nipponensis genome. The contiguity and completeness of the H. nipponensis genome assembly after contamination screening is plotted as a circle that represents the full length of the assembly (~526.31 Mb). The N50 (20.23 Mb) is highlighted in dark orange and the N90 (13.86 Mb) in light orange. The longest contig was 27.41 Mb (highlighted in red). The assembly has a uniform GC content of 45.85% and the BUSCO scores are shown in the top right corner in green. (d) Hi-C assembly of chromosome interactive heat map. The abscissa and ordinate represent the order of each bin on the corresponding chromosome group. The colour block illuminates the intensity of interaction from white (low) to red (high).

Table 1 Summary of DNA sequencing data of H. nipponensis genome.

For long-read sequencing, a SMRTbell library was constructed and sequenced using the PacBio Revio system (Pacific Biosciences, USA). Following preprocessing with the CCS program18, 36.32 Gb of high-quality Circular Consensus Sequencing (CCS) reads were generated, corresponding to a sequencing depth of ~69.01 × with an N50 value of 17,253 bp (Table 1).

ONT technology was applied by constructing an ultra-long library and then sequencing of one flow cell on a PromethION platform (Oxford Nanopore Technologies Co., UK). The raw reads were first filtered to remove bases with quality value (QV) below 7. Adapter sequences were then trimmed using Porechop (https://github.com/rrwick/Porechop). Finally, reads in which fewer than 90% of bases achieved QV ≥ 7 were removed using Filtlong (https://github.com/rrwick/Filtlong). Finally, we obtained a total of 28.63 Gb clean reads, with an N50 length of 83.82 kb.

For Hi-C sequencing, we extracted gDNA, digested chromatin using the restriction enzyme MboI, and then conducted proximity ligation according to protocols outlined in previous studies19. In brief, gDNA was cross-linked, digested, biotin-labeled, ligated, and fragmented to 350 bp, followed by purification with streptavidin magnetic beads. Library quality and insert size were assessed by using a Qubit 3.0 Fluorometer and an Agilent 2100 Bioanalyzer (Agilent Technologies, Santa Clara, CA, USA), respectively. Libraries were sequenced on DNBSEQ-T7 platform (MGI, Shenzhen, China), generating ~206.67 Gb of 150 bp paired-end reads (Table 1).

Genome size estimation

Through K-mer analysis (K = 19) of MGI short clean reads with Jellyfish (v2.3.0)20, an overall H. nipponensis genome size of 525.73 Mb was estimated using findGSE (v1.94)21 (Fig. 1b).

De novo genome assembly

Initially, contig assembly was performed using the HiFiasm (v0.25.0-r726)22 software based on three distinct datasets: 1 > PacBio HIFI data; 2 > a combined dataset of corrected ONT data, PacBio HIFI data and Hi-C data; 3 > raw ONT data (without error correction). Assembly using PacBio HiFi reads alone yielded 2,012 contigs, with a total length of 505.31 Mb and an N50 value of 666,015 bp. When combining error-corrected ONT reads with PacBio HiFi reads, 566 contigs were generated, with a total length of 554.94 Mb and an N50 of 4.46 Mb (Table 2). Assembly using raw ONT reads alone resulted in 76 contigs, with a total length of 774.05 Mb and an N50 of 18.76 Mb (Table 2). Additionally, ONT reads were error corrected using NextDenovo (v2.5.0)23, producing 6.23 Gb of error-corrected reads with an N50 length of 143,361 bp. These error-corrected ONT reads were subsequently assembled independently using NextDenovo (v2.5.0)23, yielding an assembly with a total length of 516.29 Mb and an N50 of 15.91 Mb (Table 2). Based on the core consideration of sequence contiguity, the assembly result generated by HiFiasm using raw ONT data (without error correction) was ultimately selected as the genome scaffold for subsequent analyses.

Table 2 Assembly statistics using two different assembly software.

The contigs generated from ONT-only assembly using HiFiasm (v0.25.0-r726)22 were polished by pilon (v1.24)24 using clean short reads from MGI sequencing. Purge-Haplotigs (v1.1.2)25 was employed to reduce haplotypic duplication, thereby refining the assembly continuity and haploidy. Following the generation of non-redundant contigs, Hi-C clean reads were mapped to this assembly using Bowtie2 (v2.2.5)26 with parameters:–very-sensitive -L 30–score-min L,-0.6,-0.2–end-to-end –reorder, and then effective linkage products were detected using HiC-Pro (v2.8.1)27 under default settings, retaining only valid contact pairs to support the anchoring of contigs to chromosomes. To orient, order, and cluster contigs into pseudochromosomes, we applied Juicer (v1.5)28 and 3D-DNA (v170123)29. Visualization and manual corrections were performed with Juicebox (v1.11.08)30 to adjust mis-assemblies and eliminate redundant contigs. Ultimately, 28 pseudo-chromosomes were obtained with only six gaps remaining (Table 3). The longest and shortest pseudo-chromosomes measured 27.40 Mb and 11.48 Mb, respectively. This chromosome number aligns with the count reported in the previously published HNIP-V2 assembly6 and is consistent with the karyotype of Hypomesus olidus (2n = 56)31.

Table 3 Pseudo-chromosome length statistics after Hi-C assisted assembly.

To achieve a gap-free and telomere-to-telomere (T2T)-level assembly, LR_GapCloser (v1.0)32 was sequentially employed to fill gaps using PacBio HiFi reads and error-corrected ONT long reads, with the following parameters: -m 1000000 -v 10000 -r 3. The resulting HNIP-T2T assembly comprises 28 anchored pseudochromosomes, with a total length of 526.31 Mb (Table 4). The N50 value of these anchored chromosomes was increased to 20.23 Mb (Table 4 and Fig. 1c). Notably, the Hi-C interaction heatmap exhibited high consistency across all pseudochromosomes, confirming the accuracy of sequencing data, contig ordering, and orientation in the HNIP-T2T assembly (Fig. 1d). The chromosome order and orientation of the HNIP-T2T assembly were adjusted with reference to the reference genome of Danio rerio (zebrafish; GenBank assembly accession: GCF_000002035.6), ensuring comparability with the genomic structure of this model species.

Table 4 Summary statistics of H. nipponensis assembly.

The detailed assembly pipeline is illustrated in Fig. 2.

Fig. 2
Fig. 2The alternative text for this image may have been generated using AI.
Full size image

Overview of the de novo genome assembly pipeline for H. nipponensis (T2T-level).

Identification of centromere and telomere sequences

Using the QuarTeT software33, we identified centromere and telomere sequences in the HNIP-T2T genome. QuarTeT’s centromere prediction relies on three integrated signals: (1) tandem repeat enrichment (Tandem Repeats Finder parameters: match = 2, mismatch = 7, indel = 7, minimum score = 50); (2) CENH3 homolog co-localization; (3) low recombination/high divergence signatures from read depth analysis—this strategy compensates for the lack of H. nipponensis karyotypic data. All 28 pseudochromosomes harbored intact telomeres and centromeres, including 56 telomeres and 28 centromeres (average length: 316,527 bp; Fig. 3). Centromere lengths varied significantly, ranging from 104,388 bp (pseudochromosome 15) to 1,690,782 bp (pseudochromosome 7). Future validation via fluorescence in situ hybridization (FISH, for chromosomal localization) and CENH3-targeted chromatin immunoprecipitation sequencing (ChIP-seq, for functional verification) will confirm centromere positions, addressing the current gap in H. nipponensis karyotypic research.

Fig. 3
Fig. 3The alternative text for this image may have been generated using AI.
Full size image

Telomere and centromere locations in the H. nipponensis genome. The triangle represents the telomere region, and the circle represents the centromere region.

Repeat element annotation

In HNIP-T2T, repetitive elements were identified through integration of de novo and homology-based annotation methods. The homology-based blast was performed against the RepBase database (http://www.girinst.org/repbase/)34 using RepeatMasker (v4.0.7)35 and Proteinmask software for known repeat elements. For de novo annotation, we firstly used LTR_FINDER (v1.06)36 and RepeatModeler (v1.0.4)37 to construct a de novo repeat library. This library was then used to predict repetitive elements with RepeatMasker (v4.0.7)35 under default parameters. Additionally, Tandem Repeat Finder (v4.10.0)38 was applied to identify tandem repeats using settings: 2 7 7 80 10 50 2000 -d -h. In detail, a total of 206.18 Mb (39.17%) of repetitive sequences were obtained. The proportion of repetitive sequences is higher than that in HNIP-V2 (33.59%)6. Among the interspersed repeats, DNA transposons were the most abundant type, representing 16.95% of the genome (Table 5).

Table 5 Statistics of interspersed repetitive sequences in H. nipponensis assembly.

Gene prediction and functional annotation

Gene structure annotation was performed following the established methodology from pig pan-genome research39. For transcriptome-based annotation, approximately 33.52 Gb of RNA-seq data from muscle tissues were mapped to the HNIP-T2T assembly using HISAT2 (v2.2.1)40 with the following parameters:–sensitive–no-discordant–no-mixed -I 1 -X 1000–max-intronlen 1000000. The unique genome mapping rate ranged from 90.52% to 91.17% (Table 6). Subsequently, transcript assembly was performed using Stringtie (v1.2.2)41 (parameters: -f 0.3 -j 3 -c 5 -g 100 -s 10000). Coding sequences (CDSs) were identified using TransDecoder (v5.7.1). Genes with complete structures were selected, with only the longest transcript retained for each gene. Single-exon genes were included only if a structural protein domain was detected. We excluded genes with ≥80% overlap between gene regions and repeat sequences, yielding a final transcriptome-derived candidate gene set. For the homology prediction, genome sequences and annotation files were retrieved from five representative species: Danio rerio (zebrafish; GCF_000002035.6), HNIP-V26, Hypomesus transpacificus (GCF_021917145.1), Neosalanx taihuensis12, and Protosalanx chinensis13. Leveraging these RNA-seq and homology data, CDSs were predicted with GeMoMa (v1.9)42. Genes derived from transcriptome data but absent from homology predictions were incorporated into the gene set. Finally, untranslated regions and alternative splicing variants were annotated using the Program to Assemble Spliced Alignment (v2.4.1)43. The final comprehensive gene set comprised 31,310 genes, with a mean of 8.31 exons per gene, an exon length of 191.90 bp, and a CDS length of 1,593.89 bp.

Table 6 Summary of RNAseq sequencing data of H. nipponensis genome.

The protein-coding genes were functionally annotated by aligning them with several routine protein databases. Briefly, amino-acid sequences were aligned to SwissProt44, Kyoto Encyclopedia of Genes and Genomes (KEGG)45, Eukaryotic Orthologous Groups (KOG)46, and the NCBI nonredundant database (NR) using the Diamond (v2.1.10)47 with an E-value cutoff of 1e-05. Protein domains were identified using the InterProScan (v5.30)48 program, and Gene Ontology (GO) terms for each gene were also extracted through InterProScan. Overall, 30,582 genes (97.67%) were functionally annotated (Fig. 4).

Fig. 4
Fig. 4The alternative text for this image may have been generated using AI.
Full size image

UpSetR plot showing distribution of gene function annotation. Note: NR, Non-Redundant Protein Sequence Database; Swissport, Swiss-Prot Protein Knowledgebase; KEGG, Kyoto Encyclopedia of Genes and Genomes; KOG, Eukaryotic Orthologous Groups; TrEMBL, Translation of European Molecular Biology Laboratory; Interpro, Integrative Protein Signature Database; GO, Gene Ontology.

Ethics declarations

Both the sampling procedure and experimental workflow were conducted in strict accordance with the guidelines of the Animal Ethics Committee of the Institute of Hydrobiology, Chinese Academy of Sciences, and have obtained its official approval (Approval Number: IHB1LL12024044).

Data Records

The sequencing data of Hypomesus nipponensis presented in this study have been deposited to the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) database under accession number PRJNA128279649. This includes short-read data [RNA-seq data: SRR34259912–SRR34259916; DNA survey data: SRR34259908; and Hi-C data: SRR34259911] and long-read data [Oxford Nanopore Technology (ONT) data: SRR34259910; and PacBio HiFi data: SRR34259909]. The final genome assembly is available under the accession numbers JBTLQK000000000 and GCA_054491055.150. Furthermore, the final genome assembly, annotated coding sequences, and protein sequences are available at Figshare51.

Technical Validation

To assess the accuracy and quality of the H. nipponensis HNIP-T2T assembly, we first mapped multi-platform sequencing data: MGI short reads, PacBio HiFi reads and ONT long reads achieved 99.72%, 99.97%, and 99.86% mapping rates (with 98.71–99.99% genome coverage; Fig. 5), confirming strong consistency with raw sequencing data. Transcriptome alignment also showed a higher unique mapping rate for HNIP-T2T than HNIP-V2 (Table 6), supporting superior structural accuracy for downstream analyses.

Fig. 5
Fig. 5The alternative text for this image may have been generated using AI.
Full size image

Mapping rate and coverage of reads from different sequencing platforms.

BUSCO (v5.8.0)52 (Actinopterygii_odb10, 3,640 orthologs) benchmarking revealed 98.2% complete genes (97.5% single-copy) for HNIP-T2T—exceeding HNIP-V2’s 96.7% (Table 4); protein-level BUSCO yielded 98.0% complete orthologs, validating structural integrity. Merqury53 (k = 19) assigned a QV score of 35.11 (Table 4), consistent with T2T-level accuracy.

HNIP-T2T also outperformed HNIP-V2 in contiguity: its contig N50 (20.23 Mb) was 2.5 × longer, and it was gap-free (0 vs. 189 gaps in HNIP-V2; Table 4). Minimap2-derived54 read coverage plots (Fig. 6) showed uniform depth across all 28 chromosomes, resolving HNIP-V2’s fragmented coverage and gaps. Mummer55 collinear alignment (Fig. 7) confirmed strict chromosome-level synteny between assemblies (93.01%/95.43% aligned bases for HNIP-T2T/HNIP-V2): diagonal high-similarity hits verified HNIP-T2T retained HNIP-V2’s chromosomal framework while correcting local misassemblies (diagonal deviations in HNIP-V2 correspond to HNIP-T2T’s linearity improvements). Presence-absence variation (PAV) analysis was performed using BWA (v0.7.17-r1188)56 with the MEM algorithm (parameters: -w 500 -M -t 16; Table 7). The results revealed that the HNIP-T2T assembly exhibits a substantially expanded repertoire of PAVs—defined as sequences failing to align or showing <25% coverage—compared to HNIP-V2. Specifically, the PAV content increased from 1.75 Mb (0.34% of the genome) in HNIP-V2 to 7.55 Mb (1.43%) in HNIP-T2T. This expansion primarily reflects the successful filling of genomic gaps, while SNP rates remained conserved between the two versions (~0.21–0.22%). BUSCO assessment of gene set completeness (Actinopterygii_odb10) confirmed HNIP-T2T’s superiority (3,569 complete orthologs), outperforming HNIP-V2 (3,485) and H. transpacificus (3,334; Fig. 8).

Fig. 6
Fig. 6The alternative text for this image may have been generated using AI.
Full size image

The genome-read coverage plot (ONT and PacBio HiFi reads mapped via Minimap2).

Fig. 7
Fig. 7The alternative text for this image may have been generated using AI.
Full size image

Collinear alignment (Mummer) between HNIP-T2T and HNIP-V2 assembly of H. nipponensis genome.

Table 7 Summary of genome structure alignment data between HNIP-T2T and HNIP-V2 of H. nipponensis genome.
Fig. 8
Fig. 8The alternative text for this image may have been generated using AI.
Full size image

BUSCO assessment of gene set completeness.

Overall, HNIP-T2T represents a substantial improvement over HNIP-V2, with higher completeness, longer contiguity, gap-free structure, and robust mapping/transcriptome alignment performance.