A telomere-to-telomere reference genome assembly of the Hypomesus nipponensis

Zhou, Yanfeng; Fang, Di’an; You, Yang; Tang, Fujiang; Bai, Yulin; Zhang, Minying; Li, Xuemei; Deng, Guoping; Xu, Dongpo

doi:10.1038/s41597-026-07078-6

Download PDF

Data Descriptor
Open access
Published: 27 March 2026

A telomere-to-telomere reference genome assembly of the Hypomesus nipponensis

Yanfeng Zhou¹,
Di’an Fang ORCID: orcid.org/0000-0003-2151-1865¹,
Yang You¹,
Fujiang Tang²,
Yulin Bai¹,
Minying Zhang¹,
Xuemei Li³,
Guoping Deng⁴ &
…
Dongpo Xu¹

Scientific Data volume 13, Article number: 755 (2026) Cite this article

2256 Accesses
4 Altmetric
Metrics details

Subjects

Abstract

A small cold-water teleost endemic to Northeast Asia, Hypomesus nipponensis possesses a short lifecycle, high fecundity, and rapid population growth, with extensive introductions for aquacultural purposes across East Asia. In this study, we generated a gap-free, telomere-to-telomere (T2T) genome assembly of H. nipponensis using a combined sequencing strategy, incorporating MGI short reads, PacBio High-Fidelity (HiFi) reads, Oxford Nanopore Technologies (ONT) ultra-long reads, and Hi-C data. The final assembly spans 526.31 Mb with a contig N50 of 20.23 Mb, and all genomic sequences were successfully anchored to 28 pseudochromosomes. BUSCO assessment (Actinopterygii_odb10) confirms 98.19% completeness, including 3,548 single-copy and 26 duplicated orthologs out of 3,640 conserved genes. Repeat elements account for 39.17% (206.18 Mb) of the genome, and 31,310 protein-coding genes are annotated. This gap-free T2T assembly resolves previously uncharacterized genomic regions, providing a high-quality reference for molecular breeding, evolutionary analyses of the Hypomesus genus, and functional investigations into adaptive traits of cold-water fishes.

Telomere-to-telomere gapless genome assembly of Siniperca scherzeri

Article Open access 02 April 2026

Near telomere-to-telomere genome assembly of the fourfinger threadfin (Eleutheronema tetradactylum)

Article Open access 02 December 2025

A complete telomere-to-telomere chromosome-level genome assembly of X-ray tetra (Pristella maxillaris)

Article Open access 24 March 2025

Background & Summary

Hypomesus nipponensis (NCBI Taxonomy ID: 182223), an anadromous small cold-water fish classified under the genus Hypomesus (family Osmeridae), exhibits a short life cycle, high fecundity, and rapid population growth—adaptive traits that support its colonization of diverse water bodies and survival across heterogeneous habitats¹. Prior to the 1980s, Hypomesus nipponensis (Japanese smelt) was introduced to northeastern China, with Shuifeng Reservoir—the largest reservoir in Northeast China—recognized as one of its core introduction sites². As a highly dispersive species, this fish has now established a distribution range covering the entire Northeast Asia, spanning China, Japan, and the Korean Peninsula³.

Whole genome information serves as the foundation for investigating biological characteristics. To date, several genomic resources for H. nipponensis have been reported. In 2019, the complete mitochondrial genome of H. nipponensis was decoded⁴. Subsequently, a draft genome was generated in 2021, with a contig N50 of 464,523 bp⁵. Most recently, a chromosome-level genome assembly (designated HNIP-V2) with a contig N50 of 8.19 Mb was published⁶. These studies have provided critical genetic resources and established a robust foundation for breeding programs and biological research on H. nipponensis. However, these present genome assemblies have been limited by numerous gaps, particularly in repetitive sequence-rich regions such as telomeres and centromeres. Telomeric and centromeric DNA sequences are predominantly composed of satellite DNA and are known to evolve rapidly in eukaryotic genomes^7,8. With advancements in genome sequencing technologies and assembly methodologies, gap-free telomere-to-telomere (T2T) genome assemblies have now become achievable, enabling the characterization of nearly the entire genome. Pacific BioSciences (PacBio) HiFi reads can resolve complex genomic regions, while ONT ultra-long reads facilitate the resolution of tandem duplications^9,10. Hifiasm, a high-performance assembly tool, has been successfully applied to gap-free T2T genome assembly in various fish species, including the Yangtze finless porpoise (Neophocaena asiaeorientalis)¹¹, Neosalanx taihuensis¹², Asian icefish (Protosalanx chinensis)¹³, and Siniperca roulei¹⁴. Notably, its algorithm has recently been updated to specifically support T2T assembly using ONT data alone¹⁵.

In this study, we report the first gap-free T2T reference genome for H. nipponensis (designated HNIP-T2T), generated using multiple assembly strategies and integrating HiFi reads, ONT reads, MGI short reads, and chromatin conformation capture (Hi-C) data. The HNIP-T2T assembly spans approximately 526.31 Mb with an N50 of 20.23 Mb. Gene annotation identified 31,310 protein-coding genes, 97.67% of which were annotated in public biological databases. This high-quality, gap-free genome assembly will serve as an important resource for investigating the reproductive biology and ecological adaptability of H. nipponensis.

Methods

Sample collection and sequencing

In September 2024, a healthy H. nipponensis was collected from Shuifeng Reservoir on the Yalu River in Liaoning Province, China (Fig. 1a). High-quality, high-molecular-weight genomic DNA (gDNA) was extracted from muscle tissue using the cetyltrimethylammonium bromide method¹⁶. DNA purity was assessed through 1% agarose gel electrophoresis and quantified using a NanoDrop^TM One UV-Vis spectrophotometer (Thermo Fisher Scientific, USA). DNA concentration was further determined using a Qubit 4.0 fluorometer (Invitrogen, USA). Following quality assessment, paired-end sequencing was performed on the DNBSEQ-T7 platform (MGI, Shenzhen, China), generating 143.95 Gb of raw reads (Table 1). Quality control of the sequencing data was conducted using fastp (v0.23.2)¹⁷ with default settings.

Table 1 Summary of DNA sequencing data of H. nipponensis genome.

Full size table

For long-read sequencing, a SMRTbell library was constructed and sequenced using the PacBio Revio system (Pacific Biosciences, USA). Following preprocessing with the CCS program¹⁸, 36.32 Gb of high-quality Circular Consensus Sequencing (CCS) reads were generated, corresponding to a sequencing depth of ~69.01 × with an N50 value of 17,253 bp (Table 1).

ONT technology was applied by constructing an ultra-long library and then sequencing of one flow cell on a PromethION platform (Oxford Nanopore Technologies Co., UK). The raw reads were first filtered to remove bases with quality value (QV) below 7. Adapter sequences were then trimmed using Porechop (https://github.com/rrwick/Porechop). Finally, reads in which fewer than 90% of bases achieved QV ≥ 7 were removed using Filtlong (https://github.com/rrwick/Filtlong). Finally, we obtained a total of 28.63 Gb clean reads, with an N50 length of 83.82 kb.

For Hi-C sequencing, we extracted gDNA, digested chromatin using the restriction enzyme MboI, and then conducted proximity ligation according to protocols outlined in previous studies¹⁹. In brief, gDNA was cross-linked, digested, biotin-labeled, ligated, and fragmented to 350 bp, followed by purification with streptavidin magnetic beads. Library quality and insert size were assessed by using a Qubit 3.0 Fluorometer and an Agilent 2100 Bioanalyzer (Agilent Technologies, Santa Clara, CA, USA), respectively. Libraries were sequenced on DNBSEQ-T7 platform (MGI, Shenzhen, China), generating ~206.67 Gb of 150 bp paired-end reads (Table 1).

Genome size estimation

Through K-mer analysis (K = 19) of MGI short clean reads with Jellyfish (v2.3.0)²⁰, an overall H. nipponensis genome size of 525.73 Mb was estimated using findGSE (v1.94)²¹ (Fig. 1b).

De novo genome assembly

Initially, contig assembly was performed using the HiFiasm (v0.25.0-r726)²² software based on three distinct datasets: 1 > PacBio HIFI data; 2 > a combined dataset of corrected ONT data, PacBio HIFI data and Hi-C data; 3 > raw ONT data (without error correction). Assembly using PacBio HiFi reads alone yielded 2,012 contigs, with a total length of 505.31 Mb and an N50 value of 666,015 bp. When combining error-corrected ONT reads with PacBio HiFi reads, 566 contigs were generated, with a total length of 554.94 Mb and an N50 of 4.46 Mb (Table 2). Assembly using raw ONT reads alone resulted in 76 contigs, with a total length of 774.05 Mb and an N50 of 18.76 Mb (Table 2). Additionally, ONT reads were error corrected using NextDenovo (v2.5.0)²³, producing 6.23 Gb of error-corrected reads with an N50 length of 143,361 bp. These error-corrected ONT reads were subsequently assembled independently using NextDenovo (v2.5.0)²³, yielding an assembly with a total length of 516.29 Mb and an N50 of 15.91 Mb (Table 2). Based on the core consideration of sequence contiguity, the assembly result generated by HiFiasm using raw ONT data (without error correction) was ultimately selected as the genome scaffold for subsequent analyses.

Table 2 Assembly statistics using two different assembly software.

Full size table

The contigs generated from ONT-only assembly using HiFiasm (v0.25.0-r726)²² were polished by pilon (v1.24)²⁴ using clean short reads from MGI sequencing. Purge-Haplotigs (v1.1.2)²⁵ was employed to reduce haplotypic duplication, thereby refining the assembly continuity and haploidy. Following the generation of non-redundant contigs, Hi-C clean reads were mapped to this assembly using Bowtie2 (v2.2.5)²⁶ with parameters:–very-sensitive -L 30–score-min L,-0.6,-0.2–end-to-end –reorder, and then effective linkage products were detected using HiC-Pro (v2.8.1)²⁷ under default settings, retaining only valid contact pairs to support the anchoring of contigs to chromosomes. To orient, order, and cluster contigs into pseudochromosomes, we applied Juicer (v1.5)²⁸ and 3D-DNA (v170123)²⁹. Visualization and manual corrections were performed with Juicebox (v1.11.08)³⁰ to adjust mis-assemblies and eliminate redundant contigs. Ultimately, 28 pseudo-chromosomes were obtained with only six gaps remaining (Table 3). The longest and shortest pseudo-chromosomes measured 27.40 Mb and 11.48 Mb, respectively. This chromosome number aligns with the count reported in the previously published HNIP-V2 assembly⁶ and is consistent with the karyotype of Hypomesus olidus (2n = 56)³¹.

Table 3 Pseudo-chromosome length statistics after Hi-C assisted assembly.

Full size table

To achieve a gap-free and telomere-to-telomere (T2T)-level assembly, LR_GapCloser (v1.0)³² was sequentially employed to fill gaps using PacBio HiFi reads and error-corrected ONT long reads, with the following parameters: -m 1000000 -v 10000 -r 3. The resulting HNIP-T2T assembly comprises 28 anchored pseudochromosomes, with a total length of 526.31 Mb (Table 4). The N50 value of these anchored chromosomes was increased to 20.23 Mb (Table 4 and Fig. 1c). Notably, the Hi-C interaction heatmap exhibited high consistency across all pseudochromosomes, confirming the accuracy of sequencing data, contig ordering, and orientation in the HNIP-T2T assembly (Fig. 1d). The chromosome order and orientation of the HNIP-T2T assembly were adjusted with reference to the reference genome of Danio rerio (zebrafish; GenBank assembly accession: GCF_000002035.6), ensuring comparability with the genomic structure of this model species.

Table 4 Summary statistics of H. nipponensis assembly.

Full size table

The detailed assembly pipeline is illustrated in Fig. 2.

Identification of centromere and telomere sequences

Using the QuarTeT software³³, we identified centromere and telomere sequences in the HNIP-T2T genome. QuarTeT’s centromere prediction relies on three integrated signals: (1) tandem repeat enrichment (Tandem Repeats Finder parameters: match = 2, mismatch = 7, indel = 7, minimum score = 50); (2) CENH3 homolog co-localization; (3) low recombination/high divergence signatures from read depth analysis—this strategy compensates for the lack of H. nipponensis karyotypic data. All 28 pseudochromosomes harbored intact telomeres and centromeres, including 56 telomeres and 28 centromeres (average length: 316,527 bp; Fig. 3). Centromere lengths varied significantly, ranging from 104,388 bp (pseudochromosome 15) to 1,690,782 bp (pseudochromosome 7). Future validation via fluorescence in situ hybridization (FISH, for chromosomal localization) and CENH3-targeted chromatin immunoprecipitation sequencing (ChIP-seq, for functional verification) will confirm centromere positions, addressing the current gap in H. nipponensis karyotypic research.

Repeat element annotation

In HNIP-T2T, repetitive elements were identified through integration of de novo and homology-based annotation methods. The homology-based blast was performed against the RepBase database (http://www.girinst.org/repbase/)³⁴ using RepeatMasker (v4.0.7)³⁵ and Proteinmask software for known repeat elements. For de novo annotation, we firstly used LTR_FINDER (v1.06)³⁶ and RepeatModeler (v1.0.4)³⁷ to construct a de novo repeat library. This library was then used to predict repetitive elements with RepeatMasker (v4.0.7)³⁵ under default parameters. Additionally, Tandem Repeat Finder (v4.10.0)³⁸ was applied to identify tandem repeats using settings: 2 7 7 80 10 50 2000 -d -h. In detail, a total of 206.18 Mb (39.17%) of repetitive sequences were obtained. The proportion of repetitive sequences is higher than that in HNIP-V2 (33.59%)⁶. Among the interspersed repeats, DNA transposons were the most abundant type, representing 16.95% of the genome (Table 5).

Table 5 Statistics of interspersed repetitive sequences in H. nipponensis assembly.

Full size table

Gene prediction and functional annotation

Gene structure annotation was performed following the established methodology from pig pan-genome research³⁹. For transcriptome-based annotation, approximately 33.52 Gb of RNA-seq data from muscle tissues were mapped to the HNIP-T2T assembly using HISAT2 (v2.2.1)⁴⁰ with the following parameters:–sensitive–no-discordant–no-mixed -I 1 -X 1000–max-intronlen 1000000. The unique genome mapping rate ranged from 90.52% to 91.17% (Table 6). Subsequently, transcript assembly was performed using Stringtie (v1.2.2)⁴¹ (parameters: -f 0.3 -j 3 -c 5 -g 100 -s 10000). Coding sequences (CDSs) were identified using TransDecoder (v5.7.1). Genes with complete structures were selected, with only the longest transcript retained for each gene. Single-exon genes were included only if a structural protein domain was detected. We excluded genes with ≥80% overlap between gene regions and repeat sequences, yielding a final transcriptome-derived candidate gene set. For the homology prediction, genome sequences and annotation files were retrieved from five representative species: Danio rerio (zebrafish; GCF_000002035.6), HNIP-V2⁶, Hypomesus transpacificus (GCF_021917145.1), Neosalanx taihuensis¹², and Protosalanx chinensis¹³. Leveraging these RNA-seq and homology data, CDSs were predicted with GeMoMa (v1.9)⁴². Genes derived from transcriptome data but absent from homology predictions were incorporated into the gene set. Finally, untranslated regions and alternative splicing variants were annotated using the Program to Assemble Spliced Alignment (v2.4.1)⁴³. The final comprehensive gene set comprised 31,310 genes, with a mean of 8.31 exons per gene, an exon length of 191.90 bp, and a CDS length of 1,593.89 bp.

Table 6 Summary of RNAseq sequencing data of H. nipponensis genome.

Full size table

The protein-coding genes were functionally annotated by aligning them with several routine protein databases. Briefly, amino-acid sequences were aligned to SwissProt⁴⁴, Kyoto Encyclopedia of Genes and Genomes (KEGG)⁴⁵, Eukaryotic Orthologous Groups (KOG)⁴⁶, and the NCBI nonredundant database (NR) using the Diamond (v2.1.10)⁴⁷ with an E-value cutoff of 1e-05. Protein domains were identified using the InterProScan (v5.30)⁴⁸ program, and Gene Ontology (GO) terms for each gene were also extracted through InterProScan. Overall, 30,582 genes (97.67%) were functionally annotated (Fig. 4).

Ethics declarations

Both the sampling procedure and experimental workflow were conducted in strict accordance with the guidelines of the Animal Ethics Committee of the Institute of Hydrobiology, Chinese Academy of Sciences, and have obtained its official approval (Approval Number: IHB1LL12024044).

Data Records

The sequencing data of Hypomesus nipponensis presented in this study have been deposited to the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) database under accession number PRJNA1282796⁴⁹. This includes short-read data [RNA-seq data: SRR34259912–SRR34259916; DNA survey data: SRR34259908; and Hi-C data: SRR34259911] and long-read data [Oxford Nanopore Technology (ONT) data: SRR34259910; and PacBio HiFi data: SRR34259909]. The final genome assembly is available under the accession numbers JBTLQK000000000 and GCA_054491055.1⁵⁰. Furthermore, the final genome assembly, annotated coding sequences, and protein sequences are available at Figshare⁵¹.

Technical Validation

To assess the accuracy and quality of the H. nipponensis HNIP-T2T assembly, we first mapped multi-platform sequencing data: MGI short reads, PacBio HiFi reads and ONT long reads achieved 99.72%, 99.97%, and 99.86% mapping rates (with 98.71–99.99% genome coverage; Fig. 5), confirming strong consistency with raw sequencing data. Transcriptome alignment also showed a higher unique mapping rate for HNIP-T2T than HNIP-V2 (Table 6), supporting superior structural accuracy for downstream analyses.

BUSCO (v5.8.0)⁵² (Actinopterygii_odb10, 3,640 orthologs) benchmarking revealed 98.2% complete genes (97.5% single-copy) for HNIP-T2T—exceeding HNIP-V2’s 96.7% (Table 4); protein-level BUSCO yielded 98.0% complete orthologs, validating structural integrity. Merqury⁵³ (k = 19) assigned a QV score of 35.11 (Table 4), consistent with T2T-level accuracy.

HNIP-T2T also outperformed HNIP-V2 in contiguity: its contig N50 (20.23 Mb) was 2.5 × longer, and it was gap-free (0 vs. 189 gaps in HNIP-V2; Table 4). Minimap2-derived⁵⁴ read coverage plots (Fig. 6) showed uniform depth across all 28 chromosomes, resolving HNIP-V2’s fragmented coverage and gaps. Mummer⁵⁵ collinear alignment (Fig. 7) confirmed strict chromosome-level synteny between assemblies (93.01%/95.43% aligned bases for HNIP-T2T/HNIP-V2): diagonal high-similarity hits verified HNIP-T2T retained HNIP-V2’s chromosomal framework while correcting local misassemblies (diagonal deviations in HNIP-V2 correspond to HNIP-T2T’s linearity improvements). Presence-absence variation (PAV) analysis was performed using BWA (v0.7.17-r1188)⁵⁶ with the MEM algorithm (parameters: -w 500 -M -t 16; Table 7). The results revealed that the HNIP-T2T assembly exhibits a substantially expanded repertoire of PAVs—defined as sequences failing to align or showing <25% coverage—compared to HNIP-V2. Specifically, the PAV content increased from 1.75 Mb (0.34% of the genome) in HNIP-V2 to 7.55 Mb (1.43%) in HNIP-T2T. This expansion primarily reflects the successful filling of genomic gaps, while SNP rates remained conserved between the two versions (~0.21–0.22%). BUSCO assessment of gene set completeness (Actinopterygii_odb10) confirmed HNIP-T2T’s superiority (3,569 complete orthologs), outperforming HNIP-V2 (3,485) and H. transpacificus (3,334; Fig. 8).

Table 7 Summary of genome structure alignment data between HNIP-T2T and HNIP-V2 of H. nipponensis genome.

Full size table

Overall, HNIP-T2T represents a substantial improvement over HNIP-V2, with higher completeness, longer contiguity, gap-free structure, and robust mapping/transcriptome alignment performance.

Data availability

All data supporting this study have been publicly available. Raw sequencing data have been deposited in the NCBI Sequence Read Archive (SRA) database under the BioProject id PRJNA1282796⁴⁹, including RNA-seq data (SRR34259912 to SRR34259916), MGI genome survey data (SRR34259908), Hi-C reads (SRR34259911), Nanopore long-read data (SRR34259910) and PacBio long-read data (SRR34259909). The genome assembly has been deposited at the NCBl GenBank under the accession number of GCA_054491055.1⁵⁰. The genome assembly and gene structure annotation are also available on Figshare (https://doi.org/10.6084/m9.figshare.29672606.v1)⁵¹.

Code availability

All scripts and pipelines used for the genome assembly and gene annotation followed the standard manuals and protocols of the applied bioinformatics software. No specific code was developed for this study.

References

Sakamoto, D. et al. Population size estimation of the pond smelt Hypomesus nipponensis in Lake Kasumigaura and Lake Kitaura, Japan. Fisheries Science 80, 907–914, https://doi.org/10.1007/s12562-014-0791-1 (2014).
Article CAS Google Scholar
Xie, Y. et al. The fishes of genus Hypomesus and utilization of its resource (in Chinese) (Liaoning Science and Technology Press, 1992).
Yin, C., Chen, Y., Guo, L. & Ni, L. Fish Assemblage Shift after Japanese Smelt (Hypomesus nipponensis McAllister, 1963) Invasion in Lake Erhai, a Subtropical Plateau Lake in China. Water 13, 1800, https://doi.org/10.3390/w13131800 (2021).
Article Google Scholar
Choi, S. & Kim, E. B. Complete mitochondrial genome sequence and SNPs of the Korean smelt Hypomesus nipponensis (Osmeriformes, Osmeridae). Mitochondrial DNA Part B 4, 1844–1845, https://doi.org/10.1080/23802359.2019.1613178 (2019).
Article Google Scholar
Xuan, B. et al. Draft genome of the Korean smelt Hypomesus nipponensis and its transcriptomic responses to heat stress in the liver and muscle. G3 (Bethesda) 11, https://doi.org/10.1093/g3journal/jkab147 (2021).
Zhu, C., Kuang, Y., Li, Z. & Tang, F. Chromosome-level draft genome assembly of Hypomesus nipponensis reveals transposable element expansion reshaping the genome structure. Front Genet 16, 1502681, https://doi.org/10.3389/fgene.2025.1502681 (2025).
Article CAS PubMed PubMed Central Google Scholar
Shay, J. W. & Wright, W. E. Telomeres and telomerase: three decades of progress. Nat Rev Genet 20, 299–309, https://doi.org/10.1038/s41576-019-0099-1 (2019).
Article CAS PubMed Google Scholar
Wu, M. et al. Segrosome assembly at the pliable parH centromere. Nucleic Acids Res 39, 5082–5097, https://doi.org/10.1093/nar/gkr115 (2011).
Article CAS PubMed PubMed Central Google Scholar
Jain, M. et al. Linear assembly of a human centromere on the Y chromosome. Nature Biotechnology 36, 321–323, https://doi.org/10.1038/nbt.4109 (2018).
Article CAS PubMed Google Scholar
Vollger, M. R. et al. Segmental duplications and their variation in a complete human genome. Science 376, eabj6965, https://doi.org/10.1126/science.abj6965 (2022).
Article CAS PubMed PubMed Central Google Scholar
Yin, D. et al. Telomere-to-telomere gap-free genome assembly of the endangered Yangtze finless porpoise and East Asian finless porpoise. GigaScience 13, https://doi.org/10.1093/gigascience/giae067 (2024).
Zhou, Y. et al. Gap-free genome assembly of Salangid icefish Neosalanx taihuensis. Scientific Data 10, 768, https://doi.org/10.1038/s41597-023-02677-z (2023).
Article CAS PubMed PubMed Central Google Scholar
Zhou, Y. et al. Telomere-to-telomere genome and resequencing of 231 individuals reveal evolution, genomic footprints in Asian icefish, Protosalanx chinensis. GigaScience 14, https://doi.org/10.1093/gigascience/giaf067 (2025).
Jiang, M. et al. The telomere-to-telomere gap-free reference genome and taxonomic reassessment of Siniperca roulei. GigaScience 14, https://doi.org/10.1093/gigascience/giaf068 (2025).
Cheng, H. et al. Efficient near telomere-to-telomere assembly of Nanopore Simplex reads. bioRxiv, https://doi.org/10.1101/2025.04.14.648685 (2025).
Healey, A., Furtado, A., Cooper, T. & Henry, R. J. Protocol: a simple method for extracting next-generation sequencing quality genomic DNA from recalcitrant plant species. Plant Methods 10, 21, https://doi.org/10.1186/1746-4811-10-21 (2014).
Article CAS PubMed PubMed Central Google Scholar
Chen, S., Zhou, Y., Chen, Y. & Gu, J. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 34, i884–i890, https://doi.org/10.1093/bioinformatics/bty560 (2018).
Article CAS PubMed PubMed Central Google Scholar
Rhoads, A. & Au, K. F. PacBio Sequencing and Its Applications. Genomics Proteomics Bioinformatics 13, 278–289, https://doi.org/10.1016/j.gpb.2015.08.002 (2015).
Article PubMed PubMed Central Google Scholar
Zhu, W. et al. Altered chromatin compaction and histone methylation drive non-additive gene expression in an interspecific Arabidopsis hybrid. Genome Biology 18, 157, https://doi.org/10.1186/s13059-017-1281-4 (2017).
Article CAS PubMed PubMed Central Google Scholar
Marçais, G. & Kingsford, C. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27, 764–770, https://doi.org/10.1093/bioinformatics/btr011 (2011).
Article CAS PubMed PubMed Central Google Scholar
Sun, H., Ding, J., Piednoël, M. & Schneeberger, K. findGSE: estimating genome size variation within human and Arabidopsis using k-mer frequencies. Bioinformatics (Oxford, England) 34, 550–557, https://doi.org/10.1093/bioinformatics/btx637 (2018).
Article CAS PubMed Google Scholar
Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nature Methods 18, 170–175, https://doi.org/10.1038/s41592-020-01056-5 (2021).
Article ADS CAS PubMed PubMed Central Google Scholar
Hu, J. et al. NextDenovo: an efficient error correction and accurate assembly tool for noisy long reads. Genome Biology 25, 107, https://doi.org/10.1186/s13059-024-03252-4 (2024).
Article PubMed PubMed Central Google Scholar
Walker, B. J. et al. Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement. PLoS One 9, e112963, https://doi.org/10.1371/journal.pone.0112963 (2014).
Article ADS CAS PubMed PubMed Central Google Scholar
Roach, M. J., Schmidt, S. A. & Borneman, A. R. Purge Haplotigs: allelic contig reassignment for third-gen diploid genome assemblies. BMC Bioinformatics 19, 460, https://doi.org/10.1186/s12859-018-2485-7 (2018).
Article CAS PubMed PubMed Central Google Scholar
Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nature Methods 9, 357–359, https://doi.org/10.1038/nmeth.1923 (2012).
Article CAS PubMed PubMed Central Google Scholar
Servant, N. et al. HiC-Pro: An optimized and flexible pipeline for Hi-C data processing. Genome Biology 16, https://doi.org/10.1186/s13059-015-0831-x (2015).
Durand, N. et al. Juicer Provides a One-Click System for Analyzing Loop-Resolution Hi-C Experiments. Cell Systems 3, 95–98, https://doi.org/10.1016/j.cels.2016.07.002 (2016).
Article CAS PubMed PubMed Central Google Scholar
Dudchenko, O. et al. De novo assembly of the Aedes aegypti genome using Hi-C yields chromosome-length scaffolds. Science 356, eaal3327, https://doi.org/10.1126/science.aal3327 (2017).
Article CAS Google Scholar
Durand, N. C. et al. Juicebox Provides a Visualization System for Hi-C Contact Maps with Unlimited Zoom. Cell systems 3, 99–101, https://doi.org/10.1016/j.cels.2015.07.012 (2016).
Article CAS PubMed PubMed Central Google Scholar
Wang, G. & Yu, W. J. A preliminary study on the karyotype of Hypomesus olidus. Salmon Fishery 2(1), n.p. (in Chinese) (1989).
Xu, G. C. et al. LR_Gapcloser: a tiling path-based gap closer that uses long reads to complete genome assembly. Gigascience 8, https://doi.org/10.1093/gigascience/giy157 (2019).
Lin, Y. et al. quarTeT: a telomere-to-telomere toolkit for gap-free genome assembly and centromeric repeat identification. Hortic Res, https://doi.org/10.1093/hr/uhad127 (2023).
Jurka, J. et al. Repbase Update, a database of eukaryotic repetitive elements. Cytogenet Genome Res 110, 462–467, https://doi.org/10.1159/000084979 (2005).
Article CAS PubMed Google Scholar
Price, A. L., Jones, N. C. & Pevzner, P. A. De novo identification of repeat families in large genomes. Bioinformatics (Oxford, England) 21(Suppl 1), i351–358, https://doi.org/10.1093/bioinformatics/bti1018 (2005).
Article CAS PubMed Google Scholar
Xu, Z. & Wang, H. LTR_FINDER: an efficient tool for the prediction of full-length LTR retrotransposons. Nucleic acids research 35, W265–268, https://doi.org/10.1093/nar/gkm286 (2007).
Article PubMed PubMed Central Google Scholar
Tarailo-Graovac, M. & Chen, N. Using RepeatMasker to identify repetitive elements in genomic sequences. Curr Protoc Bioinformatics Chapter 4, 4.10.11–14.10.14, https://doi.org/10.1002/0471250953.bi0410s25 (2009).
Article Google Scholar
Benson, G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Research 27, 573–580, https://doi.org/10.1093/nar/27.2.573 (1999).
Article CAS PubMed PubMed Central Google Scholar
Liu, L. et al. Multiomics analysis reveals signatures of selection and loci associated with complex traits in pigs. Imeta 3, e250, https://doi.org/10.1002/imt2.250 (2024).
Article CAS PubMed PubMed Central Google Scholar
Kim, D., Langmead, B. & Salzberg, S. L. HISAT: a fast spliced aligner with low memory requirements. Nature Methods 12, 357–360, https://doi.org/10.1038/nmeth.3317 (2015).
Article CAS PubMed PubMed Central Google Scholar
Kovaka, S. et al. Transcriptome assembly from long-read RNA-seq alignments with StringTie2. Genome Biology 20, 278, https://doi.org/10.1186/s13059-019-1910-1 (2019).
Article CAS PubMed PubMed Central Google Scholar
Keilwagen, J., Hartung, F. & Grau, J. GeMoMa: Homology-Based Gene Prediction Utilizing Intron Position Conservation and RNA-seq Data. Methods Mol Biol 1962, 161–177, https://doi.org/10.1007/978-1-4939-9173-0_9 (2019).
Article CAS PubMed Google Scholar
Haas, B. Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Research 31, 5654–5666, https://doi.org/10.1093/nar/gkg770 (2003).
Article CAS PubMed PubMed Central Google Scholar
Bairoch, A. & Apweiler, R. The SWISS-PROT protein sequence data bank and its supplement TrEMBL in 1999. Nucleic Acids Research 27, 49–54, https://doi.org/10.1093/nar/27.1.49 (1999).
Article CAS PubMed PubMed Central Google Scholar
Kanehisa, M. & Goto, S. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Research 28, 27–30, https://doi.org/10.1093/nar/28.1.27 (2000).
Article CAS PubMed PubMed Central Google Scholar
Tatusov, R., Galperin, M., Natale, D. & Koonin, E. The COG Database: A Tool for Genome-Scale Analysis of Protein Functions and Evolution. Nucleic Acids Research 28, https://doi.org/10.1093/nar/28.1.33 (2000).
Buchfink, B., Xie, C. & Huson, D. H. Fast and sensitive protein alignment using DIAMOND. Nature Methods 12, 59–60, https://doi.org/10.1038/nmeth.3176 (2015).
Article CAS PubMed Google Scholar
Jones, P. et al. InterProScan 5: genome-scale protein function classification. Bioinformatics 30, 1236–1240, https://doi.org/10.1093/bioinformatics/btu031 (2014).
Article CAS PubMed PubMed Central Google Scholar
NCBI Sequence Read Archive. https://identifiers.org/ncbi/insdc.sra:SRP595455 (2025).
NCBI GenBank. https://identifiers.org/ncbi/insdc.gca:GCA_054491055.1 (2026).
Zhou, Y. Telomere-to-telomere genome assembly of Hypomesus nipponensis. figshare. Dataset. https://doi.org/10.6084/m9.figshare.29672606.v1 (2025).
Simão, F. A., Waterhouse, R. M., Ioannidis, P., Kriventseva, E. V. & Zdobnov, E. M. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics 31, 3210–3212, https://doi.org/10.1093/bioinformatics/btv351 (2015).
Article CAS PubMed Google Scholar
Rhie, A., Walenz, B. P., Koren, S. & Phillippy, A. M. Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies. Genome Biol 21, 245, https://doi.org/10.1186/s13059-020-02134-9 (2020).
Article CAS PubMed PubMed Central Google Scholar
Li, H. New strategies to improve minimap2 alignment accuracy. Bioinformatics 37, 4572–4574, https://doi.org/10.1093/bioinformatics/btab705 (2021).
Article CAS PubMed PubMed Central Google Scholar
Marçais, G. et al. MUMmer4: A fast and versatile genome alignment system. PLoS Comput Biol 14, e1005944, https://doi.org/10.1371/journal.pcbi.1005944 (2018).
Article CAS PubMed PubMed Central Google Scholar
Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv: Genomics https://doi.org/10.48550/arXiv.1303.3997 (2013).
Article Google Scholar

Download references

Acknowledgements

This work was financially supported by the Earmarked Fund for the National Key R&D Program of China (Grant No. 2023YFD2400900) and the Modern Agricultural Technology System Grant (CARS-46).

Author information

Authors and Affiliations

Key Laboratory of Freshwater Fisheries and Germplasm Resources Utilization, Ministry of Agriculture and Rural Affairs, Freshwater Fisheries Research Center, Chinese Academy of Fishery Sciences, Wuxi, 214081, China
Yanfeng Zhou, Di’an Fang, Yang You, Yulin Bai, Minying Zhang & Dongpo Xu
Heilongjiang River Fisheries Research Institute, Chinese Academy of Fishery Sciences, Harbin, 150070, China
Fujiang Tang
Yangtze River Fisheries Research Institute, Chinese Academy of Fishery Sciences, Wuhan, China
Xuemei Li
Dalian Ocean University, Dalian, 116023, China
Guoping Deng

Authors

Yanfeng Zhou
View author publications
Search author on:PubMed Google Scholar
Di’an Fang
View author publications
Search author on:PubMed Google Scholar
Yang You
View author publications
Search author on:PubMed Google Scholar
Fujiang Tang
View author publications
Search author on:PubMed Google Scholar
Yulin Bai
View author publications
Search author on:PubMed Google Scholar
Minying Zhang
View author publications
Search author on:PubMed Google Scholar
Xuemei Li
View author publications
Search author on:PubMed Google Scholar
Guoping Deng
View author publications
Search author on:PubMed Google Scholar
Dongpo Xu
View author publications
Search author on:PubMed Google Scholar

Contributions

D. Xu designed and conceived the study. Y. Zhou, D. Fang, Y. You and X. Li collected the samples, conducted experiments. F. Tang, Y. Bai and M. Zhang performed bioinformatics analysis. Y. Zhou, G. Deng and D. Xu wrote and revised the manuscript. All authors have read and approved the final manuscript.

Corresponding authors

Correspondence to Yanfeng Zhou or Dongpo Xu.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Zhou, Y., Fang, D., You, Y. et al. A telomere-to-telomere reference genome assembly of the Hypomesus nipponensis. Sci Data 13, 755 (2026). https://doi.org/10.1038/s41597-026-07078-6

Download citation

Received: 15 September 2025
Accepted: 12 March 2026
Published: 27 March 2026
Version of record: 20 May 2026
DOI: https://doi.org/10.1038/s41597-026-07078-6