Telomere-to-telomere gapless genome assembly of Triplophysa yaopeizhii

Ma, Li; Zeng, Xu; Wang, Jixiao; Xiong, Hao; Yu, Yongyao; Liu, Haiping; Yang, Qing-Yong; Yang, Ruibin; Yang, Xuefen

doi:10.1038/s41597-025-04943-8

Download PDF

Data Descriptor
Open access
Published: 10 April 2025

Telomere-to-telomere gapless genome assembly of Triplophysa yaopeizhii

Li Ma ORCID: orcid.org/0009-0008-4467-6981¹^na1,
Xu Zeng¹^na1,
Jixiao Wang²,
Hao Xiong²,
Yongyao Yu¹,
Haiping Liu³,
Qing-Yong Yang ORCID: orcid.org/0000-0002-3510-8906⁴,
Ruibin Yang¹ &
…
Xuefen Yang¹

Scientific Data volume 12, Article number: 597 (2025) Cite this article

2895 Accesses
Metrics details

Subjects

Abstract

The genus Triplophysa exhibits remarkable adaptability to the unique environment found at the Qinghai-Tibet Plateau (QTP). Higher quality genomes are helpful to the study of the adaptability to the extreme environment in the plateau. This study utilized PacBio HiFi, Ultra-long ONT, and Hi-C sequencing of Triplophysa yaopeizhii to construct the first telomere-to-telomere (T2T) gapless genome assembly of the genus Triplophysa. The genome size is 671.58 Mb, with a contig N50 length of 26.04 Mb. The sequences were anchored onto 25 chromosomes with all centromeres and telomeres. Furthermore, 293.98 Mb (43.77%) of repetitive sequences and 26,487 protein-coding genes were identified. Comparative analyses with the genomes of closely related species demonstrated high completeness, continuity, and accuracy of the genome. The genomic quality was further substantiated by the QV of 31.82 with 96.60% of BUSCO. This study provides a valuable genetic resource of the genus Triplophysa and serves as an essential reference for elucidating the adaptive genetic mechanisms of plateau fish to the high altitude.

The telomere-to-telomere genome assembly of the Triplophysa erythraea (Nemacheilidae hypogean fishes)

Article Open access 10 December 2025

A telomere-to-telomere gapless genome assembly of the Tibetan wild ass (Equus kiang)

Article Open access 06 January 2026

Chromosome-level genome assembly of Triplophysa scleroptera

Article Open access 01 July 2025

Background & Summary

The Qinghai-Tibet Plateau (QTP), situated in China’s western region, has a mean elevation exceeding 4,000 meters. Recognized as the planet’s most elevated plateau, it is commonly referred to as the ‘Roof of the World’¹. The genus Triplophysa is the largest of the three major fish taxa on the QTP, belonging to the family Cobitidae. The fishes of Triplophysa are mostly small and medium-sized, which are mainly distributed in the QTP and its surrounding rivers, lakes, and other waters, from plains, basins to high altitudes^2,3. Triplophysa is distinguished by its unique biological characteristics that enable adaptation to the extreme environments of high-altitude plateaus. These fish exhibit a cylindrical body shape and a well-developed caudal peduncle, facilitating efficient movement in rapidly flowing plateau waters^4,5. Their enlarged mouth gape and specialized digestive system enhance feeding and nutrient absorption in food-scarce high-altitude ecosystems⁶. Furthermore, the species employs distinctive reproductive strategies, including prolonged reproductive periods and higher fecundability, to ensure population sustainability in cold plateau conditions⁷. The unique evolutionary adaptations of these species are shaped by the specific geographical and environmental constraints of the Qinghai-Tibet Plateau, resulting in the emergence of distinctive genetic mechanisms that facilitate adaptation to this extreme environment^1,8. Triplophysa yaopeizhii is predominantly distributed in the upper reaches of the Jinsha River. This species exhibits a preference for rivers with sandy substrates and high flow rates^9,10. In recent years, the development and construction of hydropower projects in the Jinsha River basin and other anthropogenic factors have led to a decline in the population of indigenous fishes such as T. yaopeizhii¹¹. T. yaopeizhii is well adapted to the complex environment of the QTP and is an important target for studying the complex mechanisms of species formation and evolution in fishes of the genus Triplophysa, as well as a special model for understanding the genetic basis of plateau adaptation. However, there are currently few studies related to T. yaopeizhii, and its genetic characteristics are still unclear.

The study of genetic information and evolutionary mechanisms in fish by whole genome sequencing has become a common tool for genetic research, with applications in various aspects such as fish breeding and biodiversity conservation¹². As an indigenous fish living on the QTP, the genome of T. yaopeizhii can provide powerful insights into the genetic basis of high-altitude adaptation. However, there are still many gaps and incomplete regions in the currently published genomes of plateau fishes, especially in regions rich in repetitive sequences, such as telomeres and centromeres¹³. Telomeres reflect the health and longevity of organisms and play a crucial role in genome stability and DNA damage repair^14,15. Variations in telomere length may be associated with genetic characteristics and environmental stress¹⁶. Centromeres are chromosomal regions that attach to the spindle during cell division, ensuring the equitable distribution of genetic material between daughter cells¹⁷. For plateau fishes such as T. yaopeizhii, the genetic information in these regions may play an important role in the evolutionary adaptations to the unique plateau environment. With the continuous development of genome sequencing and assembly methods, it has been possible to achieve telomere-to-telomere gapless assembly of genomic chromosome sequences¹⁸. The T2T genome assembly has made it possible to explore the uncharted territories of telomeres, centromeres and genome gaps, and has provided a more in-depth direction of biological research¹⁹. T2T genome assembly has been reported in some animal studies, including Ostfriesisches Milchschaf²⁰, Anser cygnoides domesticus²¹, and in fish, Lateolabrax maculatus²², Pampus argenteus²³ among others. However, the T2T genome of the fish which is able to adapt to the extreme environment in the QTP, has not been reported so far.

In this study, Pacific Biosciences (PacBio) HiFi sequencing, Oxford Nanopore Technologies (ONT) ultra-long sequencing and Hi-C assisted assembly technology were used to assemble the first high-quality T2T genome of a high-altitude adapted fish, T. yaopeizhii. The telomeres and centromeres of each chromosome were also detected. This is the first T2T genome in the genus Triplophysa. This study not only contributes to the population genetic and evolutionary analysis of T. yaopeizhii, but also provides important data for the study of the genetic mechanism of plateau adaptation in fish.

Methods

Ethics statement

All experimental protocols utilized in this study have been approved by the Animal Experimental Ethical Inspection of Laboratory Animal Center, Huazhong Agricultural University, Wuhan, China.

Sample collection and sequencing

The sample of T. yaopeizhii was obtained from the Anning River (Fig. 1A), a secondary tributary of the Jinsha River in Xichang City, Sichuan Province. The high-grade genomic DNA (gDNA) was isolated from muscular sample through the conventional sodium dodecyl sulfate method (SDS), followed by purification using the QIAGEN^® genomic kit (Cat# 13343, QIAGEN). The integrity assessment and contaminant detection of isolated DNA were performed through 0.75% agarose gel electrophoresis. Subsequent analysis of the purity of the DNA was conducted with a NanoDrop™ One UV-Vis spectrophotometer (Thermo Fisher Scientific, USA). This analysis demonstrated that the extinction coefficients at OD 260/280 and OD 260/230 ratios ranged from 1.8 to 2.0 and from 2.0 to 2.2. Finally, quantitative measurement of DNA was ultimately achieved using the Qubit^® 3.0 Fluorometric system (Invitrogen, USA).

The SMRTbell target library was prepared in accordance with the established protocol (Pacific Biosciences, CA, USA). The library preparation process entailed several key steps, initially, gDNA was sheared into small fragments utilizing g-TUBE (Covaris, USA). Subsequently, a process of enzymatic repair was conducted to restore the integrity of the DNA fragments. This was followed by Blunt-end joining of hairpin adapters from the SMRTbell Express Template Prep Kit 2.0 (Pacific Biosciences, CA, USA). The target molecules were then subjected to size-selection and purified through BluePippin (Sage Science, USA). Finally, the library was purified using AMPure PB beads (Pacific Biosciences, CA, USA). Fragment size distribution of the library was verified through electrophoretic analysis on an Agilent 2100 Bioanalyzer system (Agilent Technologies, USA). Subsequent circular consensus sequencing (CCS) was then executed on the PacBio Sequel II platform using the Sequel II Sequencing Kit 2.0 (Nextomics, Wuhan).

The magnetic beads were used to enrich and purify large gDNA fragments (>15 kb). Then, damage and end repair were performed on the fragmented DNA. After purification, the ONT ultra-long sequencing library was prepared by adding base A of the DNA fragments, followed by adapter ligation using the SQK-LSK109 kit (Oxford Nanopore Technologies, Oxford, UK)²⁴. The processed library was introduced into R10 Spot-On Flow Cells, and nanopore sequencing was carried out with a PromethI ON P48 sequencer (Oxford Nanopore Technologies, Oxford, UK). For data preprocessing, Porechop (v0.2.4)²⁵ was utilized for adapt sequences filtering, while Filtlong (v0.2.4) (https://github.com/rrwick/Filtlong) facilitated quality-based read selection. Reads meeting thresholds of ≥30 kb in length and mean quality scores exceeding 90% were preserved for downstream assembly.

In order to attain chromosome-level genome assembly, the construction of a Hi-C library was initiated from isolated high-quality gDNA²⁶. The workflow comprised the following steps, cell lysis post-crosslinking; enzymatic cleavage of DNA using DpnII restriction enzyme; biotinylation of fragment termini; blunt-end ligation; and DNA purification to generate Hi-C templates. Then, the biotin ends were removed from the Hi-C fragments, the fragments were cleaved by ultrasonic treatment, the end was repaired, base A was added, and the sequencing linker was added to form the coupling product. After that, PCR conditions were selected and amplified to obtain Hi-C library. High-throughput paired-end sequencing (strategy PE150) was performed on an Illumina NovaSeq 6000 platform (Illumina, San Diego, CA, USA). After obtaining raw Hi-C data, the software Juicer (v1.6)²⁷ was used for quality control, and the default parameters were selected to obtain Hi-C clean data.

To aid genome annotation, RNA was extracted from 10 tissue samples (muscle, brain, kidney, liver, stomach, intestine, fin, gills, gonads and skin) using the TRIzol Universal total RNA Extraction kit (Tiangen). The Agilent 2100 Bioanalyzer system paired with the Agilent RNA 6000 Nano Kit was employed to evaluate RNA concentration and integrity. Subsequent construction of libraries was undertaken with the NEBNext^® Ultra™ RNA Library Prep Kit for Illumina^® (NEB, Ipswich, MA, USA). The Illumina HiSeq X Ten platform was then used to conduct high-throughput sequencing. Following the above processes, a total of 36.96 GB (~55.04 × coverage) of ONT data with a contig N50 of 100 kb and 27.19 GB (~44.49 × coverage) of PacBio HiFi CCS data with a contig N50 of 18.54 kb were obtained. Additionally, 66.47 GB of Raw Hi-C data (~98.97 × coverage) and 65.33 GB of Clean Hi-C data (~97.28 × coverage) were obtained (Table 1).

Table 1 Summary of sequencing data of T. yaopeizhii genome.

Full size table

Genome assembly

The genome assembly was integrated with a combination of PacBio HiFi reads, ultra-long ONT data, and Hi-C data (Fig. 1B). NextDenovo (v2.4.0)²⁸ and Hifiasm (v0.15.1)²⁹ were used to assemble the ONT sequencing data and PacBio HiFi sequencing data with default parameters. The mapping of clean Hi-C reads was performed using BWA (v0.7.17)³⁰ to the corresponding draft contigs. Trimmomatic (v0.40)³¹ was then applied to remove low-quality reads with parameters: leading: 3 trailing: 3 slidingwindow: 4:15 minlen: 50. Processed Hi-C data were analyzed through the 3D-DNA pipeline (https://github.com/aidenlab/3d-dna)³² for automatic clustering, sorting, and directional determination. JuiceBox (v2.13.07)²⁷ was used to visual error, and the interaction frequencies between different chromosomes were analyzed. The interaction heatmap was used to identify and correct the errors in contig order, orientation or assembly within contigs and chromosome regions (Fig. 1C). Gap closure was achieved via Winnowmap (v1.11)³³ (parameters: k = 15, -MD), aligning contigs previously from NextDenovo/Hifiasm outputs to unresolved genomic regions. Gap sequences were replaced by selecting the longest and most consistent sequence from aligned reads. Finally, a gapless T2T genome was obtained, with a genome size of 671.58 Mb, N50 length of 26.04 Mb, and GC content of 39.11%. The genomic sequences were clustered without gaps and localized on 25 chromosomes (Table 2).

Table 2 Summary statistics of T. yaopeizhii genome assembly.

Full size table

Telomere and centromere identification

Ultra-long ONT data were mapped to the genome using Winnowmap (v1.11) with parameters: k = 15, -MD. The reads aligned uniquely within 50 bp of chromosome ends were collected. The occurrence number of telomere repeat motifs in each read was calculated. The read with the highest occurrence of these motifs was defined as the reference (ref), and the others were defined as queries. Using Medaka_consensus (v1.7.2) (https://github.com/nanoporetech/medaka) with parameter -m r941_min_high_g360, the ref telomere read and the query telomere reads were reassembled to obtain the consensus sequence. Finally, the software Nucmer (v3.1)³⁴ was used to align the consensus sequences of the telomeres to each chromosome to determine whether the contigs aligned to the ends of the chromosomes contained telomeric repeat motifs. The terminal telomere sequences were replaced with the best alignment results. No substitutions were made if the identity value was ≤ 80 or the aligned region was not within the last 20 kb of the chromosome. Telomeres are detected at both ends of all 25 chromosomes in the genome. The software TRASH was utilized to identify all tandem repeat monomers, and the monomer with the highest occurrence was selected as the representative monomer of centromere based on its cycle and copy number. Subsequently, the software StringDecomposer (v1.1.2)³⁵ was used to map the representative monomeric sequences to the chromosomes, search for all centromeric repeats, and extend the centromeric position on the chromosome by 10 kb at the first and last repeat intervals (Table 3, Fig. 2A).

Table 3 Telomere and centromere positions of T. yaopeizhii genome.

Full size table

Repetitive sequences annotation

Repetitive sequences were annotated by combining homology prediction with de novo prediction. Homology prediction was performed using the software RepeatMasker (v4.0.9)³⁶ and RepeatProteinMask (v4.0.9)³⁶ based on the RepBase library (http://www.girinst.org/repbase)³⁷. The de novo prediction was performed by RepeatModeler (v1.0.11)³⁸ and LTR-FINDER (v1.0.5)³⁹, employing self-sequence alignment and characteristics of repeat sequence. Additionally, the identification of tandem repeats was facilitated by Tandem Repeats Finder (v4.09)⁴⁰ The annotation results show that repetitive sequence size is 293.98 Mb, which accounts for 43.77% of the genome (Table 4). Among these repeats, SINEs accounted for 0.54% of the genome size, LINEs for 7.02%, LTRs for 10.40%, and DNA elements for 20.90% (Table 5).

Table 4 Statistical results of repetitive sequences in T. yaopeizhii genome.

Full size table

Table 5 Types of repetitive sequences in T. yaopeizhii genome.

Full size table

Prediction and functional annotation of protein-coding genes

In this study, the prediction of protein-coding genes in the genome of T. yaopeizhii was achieved through a combination of de novo prediction, homology prediction, and transcriptome prediction. The de novo prediction was performed using Augustus (v3.3)⁴¹ and GlimmerHMM (v3.0.4)⁴² to predict gene structure. Homology prediction was conducted using six closely related species, T. longibarbata (unpublished data), T. bombifrons (unpublished data), T. dalaica⁴³, T. tribetana⁴⁴, T. rosa⁴⁵ and T. yarkandensis⁴⁶, with the software Exonerate (v2.4)⁴⁷. The protein-coding sequences of known closely related species were aligned to the genome sequences of the target species to predict genes. The RNA-seq data were aligned with StringTie (v2.1.1)⁴⁸ in the genome, and the resulting transcripts were then reconstructed. The coding regions were predicted using PASA (v2.4.1)⁴⁹. MAKER (v3.00)⁵⁰ integrated the gene sets predicted by the different methods into a non-redundant and more complete gene sets. Gene prediction was corrected by PASA (v2.4.1) combined with transcriptome data. For functional annotation of gene predictions, BLASTP (v2.6.0)^51,52 was used to compare the predicted genes with databases such as the Kyoto Encyclopedia of Genes and Genomes (KEGG)⁵³, Gene Ontology (GO)⁵⁴, NCBI non-redundant protein database (NR)⁵⁵, Swiss-Prot⁵⁶, and TrEMBL⁵⁶, as well as the InterPro⁵⁷ databases. Overall, we successfully predicted 26,487 protein-coding genes in the genome. The average gene length of these predicted genes was 12,653.76 bp, the average coding sequence length was 1,512.01 bp, and the average number of exons was 8.82 (Table 6). The results of gene function annotation showed that 25,686 genes were annotated to at least one database, accounting for 96.98% of the total number of predicted genes (Fig. 2B, Table 7).

Table 6 Statistical results of gene prediction.

Full size table

Table 7 Statistical results of gene function annotation.

Full size table

Annotation of non-coding RNAs

The software tRNAscan-SE (v1.3.1)⁵⁸ was utilised with default parameters in order to identify tRNA sequences within the genome, in accordance with the structural characteristics of tRNA. Using rRNA sequences from closely related species as reference sequences, rRNAs in the genome were searched by BLASTN (v2.6.0)⁵⁹ comparison with E-value < 1e-5, homology ≥85%, and match length ≥50 bp. In addition, the covariance model of the Rfam database (http://xfam.org/) was utilized along with Infernal (v1.1.2)⁶⁰ and Rfam (v14.1)⁶¹ with default parameters, to predict the miRNA and snRNA sequence information in the genome. As a result, we annotated 406 miRNAs, 23,001 tRNAs, 124 rRNAs, and 1,480 snRNAs (Table 8).

Table 8 Statistical results of non-coding RNA annotation.

Full size table

Data Records

All raw data of the whole genome have been deposited into the National Center for Biotechnology Information (NCBI) SRA database under BioProject accession number PRJNA1195554. The genomic PacBio sequencing data, the ultra-long ONT sequencing data, the Hi-C sequencing data, and the RNA sequencing data were deposited in the in the Sequence Read Archive at NCBI with accession number SRP551230⁶². The genome assembly data has been deposited at GenBank under the accession GCA_048296945.1⁶³. The files of repetitive sequences annotation, the gene structure annotation, predicted coding sequences, protein sequences, and ncRNA annotation have been deposited at Figshare⁶⁴.

Technical Validation

Multiple methods were used to verify the accuracy and integrity of T. yaopeizhii genome. Firstly, the Hi-C heatmap of T. yaopeizhii genome showed a high degree of consistency among all chromosomes, reflecting the accuracy of sequencing, ordering, and orientation of contigs in the genome assembly (Fig. 1C). Secondly, a total of 25 centromeres were mapped on 25 chromosomes, and all telomeres were identified. These results provided significant evidence for the integrity of the chromosomes (Fig. 2A). Subsequently, the Illumina sequencing data were aligned to the genome using the software BWA (v0.7.17), achieving a mapping rate of 98.81%. And alignment results from Winnowmap (v1.11), 99.91% of ONT reads and 99.96% of HiFi reads could be aligned to the T2T assembly. In addition, the statistical analysis of gene data, including the distribution of genes, CDS, exons, and introns, was performed on T. yaopeizhii and related species. The results show that the distribution among these species is consistent, demonstrating the accuracy of the genome (Fig. 3). Notably, the calculation and comparison of the integrity of T2T genomes with those of closely related species using the software KAT⁶⁵ showed that T2T genomes have higher integrity (Table 9). Finally, the quality value (QV) of assembly was quantified using merqury (v1.3)⁶⁶, resulting in QV of 31.73 (Table 2). The integrity of the genome assembly and the protein-coding genes were assessed using BUSCO (v4.0.5)⁶⁷, based on the single-copy orthologous gene set actinopterygii_odb10 in the OrthoDB database. The results showed that 96.62% and 92.50% of the 3640 single-copy orthologous gene sets were identified (Fig. 4). In conclusion, the T2T genome of T. yaopeizhii is found to be highly complete and accurate. The high-quality genome provides a robust foundation for investigating the evolutionary and adaptive mechanisms of plateau fish in response to the unique environmental conditions of the plateau.

Table 9 Integrity assessment of genomes.

Full size table

Code availability

All commands and pipelines used in data processing were executed according to the manual and protocols of the corresponding bioinformatic software. No specific code has been developed for this study.

References

Gao, Y. & Liu, Y. C. Conservation of fishes of Triplophysa in the plateau. Xizang Sci Technol, 35–39 (2021).
He, C. L., Song, Z. B. & Zhang, E. Triplophysa fishes in China and the status of its taxonomic studies. Sichuan J Zool 30, 150–155 (2011).
Google Scholar
Xiao, H. & Dai, Y. G. A Review of Study on Diversity of Triplophysa in China. Fish Sci 30, 53–57, https://doi.org/10.16378/j.cnki.1003-1111.2011.01.016 (2011).
Article Google Scholar
Zhang, L. R., Ji, B. W., Nie, Z. L. & Wei, J. Age Structure and Growth Characteristics of Loach Triplophysa tenuis in Muzati River,Xinjiang. Chin. J. Fish. 37, 30–37 (2024).
Google Scholar
Xu, X., Chen, Y. R., Li, T. T., Ren, Y. L. & Nie, Z. L. Morphological characteristics and their correlation analysis of Triplophysa bombifrons from Keriya River in Xinjiang area. Jiangsu Agric Sci 48, 192–197, https://doi.org/10.15889/j.issn.1002-1302.2020.08.036 (2020).
Article Google Scholar
Liu, M. Y., Yang, R. B., Yang, X. F., Fan, Q. X. & Wei, K. J. Characteristics of the morphology and histology ordigestive tract of Triplophysa tibetana, Triplophysa stenura and Triplophysastewarti. Acta Hydrobiol. Sin. 42, 342–348 (2018).
Google Scholar
Xie, J., Xia, Y., Yan, Y., Liang, W. & Ren, C. Reproductive cycle of Triplophysa stenura (Herzenstein, 1888)(Balitoridae: Nemacheilinae) from the Yarlung Tsangpo River in the Tibetan Plateau, China. J. Appl. Ichthyol 33, 37–41 (2017).
Article Google Scholar
Zhao, Y. H., Zhang, J. & Zhang, C. G. Fish diversity in the Tibetan Plateau. Biol. Bull 43, 8–10, https://doi.org/10.3969/j.issn.0006-3193.2008.07.003 (2008).
Article Google Scholar
Zhang, C. G. et al. Fishes in the Jinsha Jiang River Basin, the Upper Reaches of the Yangtze River, China (Science Press, 2019).
Xu, T. Q. & Zhang, C. G. A New Species of Cobitid Fish from Tibet, China (Cypriniformes: Cobitidae). Zool. Syst., 377–379 (1996).
Sun, H. Y., Sui, X. Y., He, D. K., Li, X. Q. & Chen, Y. F. Fish Systematic Conservation Planning in the JinSha River Basin. Acta Hydrobiol Sin., 110–118 (2019).
Xu, G. C., Du, F. K., Bian, C., Shi, Q. & Xu, P. Research Progress on Fish Genomics. Biotech Bull. 33, 23–31, https://doi.org/10.13560/j.cnki.biotech.bull.1985.2017-0290 (2017).
Article Google Scholar
Li, H. & Durbin, R. Genome assembly in the telomere-to-telomere era. Nat Rev Genet 25, 658–670, https://doi.org/10.1038/s41576-024-00718-w (2024).
Article PubMed CAS Google Scholar
Chan, S. R. & Blackburn, E. H. Telomeres and telomerase. Philos Trans R Soc Lond B Biol Sci 359, 109–121, https://doi.org/10.1098/rstb.2003.1370 (2004).
Article PubMed PubMed Central CAS Google Scholar
O’Sullivan, R. J. & Karlseder, J. Telomeres: protecting chromosomes against genome instability. Nat Rev Mol Cell Biol 11, 171–181, https://doi.org/10.1038/nrm2848 (2010).
Article PubMed PubMed Central CAS Google Scholar
Bauch, C., Boonekamp, J. J., Korsten, P., Mulder, E. & Verhulst, S. High heritability of telomere length and low heritability of telomere shortening in wild birds. Mol Ecol 31, 6308–6323, https://doi.org/10.1111/mec.16183 (2022).
Article PubMed CAS Google Scholar
Kursel, L. E. & Malik, H. S. Centromeres. Curr Biol 26, R487–R490, https://doi.org/10.1016/j.cub.2016.05.031 (2016).
Article PubMed CAS Google Scholar
Nurk, S. et al. The complete sequence of a human genome. Science 376, 44–53, https://doi.org/10.1126/science.abj6987 (2022).
Article ADS PubMed PubMed Central CAS Google Scholar
Hou, X., Wang, D., Cheng, Z., Wang, Y. & Jiao, Y. A near-complete assembly of an Arabidopsis thaliana genome. Mol Plant 15, 1247–1250, https://doi.org/10.1016/j.molp.2022.05.014 (2022).
Article PubMed CAS Google Scholar
You, X. et al. A near complete genome assembly of the East Friesian sheep genome. Sci Data 11, 762, https://doi.org/10.1038/s41597-024-03581-w (2024).
Article PubMed PubMed Central CAS Google Scholar
Zhao, H. et al. Telomere-to-telomere genome assembly of the goose Anser cygnoides. Sci Data 11, 741, https://doi.org/10.1038/s41597-024-03567-8 (2024).
Article PubMed PubMed Central CAS Google Scholar
Sun, Z. et al. Telomere-to-telomere gapless genome assembly of the Chinese sea bass (Lateolabrax maculatus). Sci Data 11, 175, https://doi.org/10.1038/s41597-024-02988-9 (2024).
Article PubMed PubMed Central CAS Google Scholar
Hu, J. et al. Two high quality chromosome-scale genome assemblies of female and male silver pomfret (Pampus argenteus). Sci Data 11, 1100, https://doi.org/10.1038/s41597-024-03914-9 (2024).
Article PubMed PubMed Central CAS Google Scholar
Shafin, K. et al. Nanopore sequencing and the Shasta toolkit enable efficient de novo assembly of eleven human genomes. Nat Biotechnol 38, 1044–1053, https://doi.org/10.1038/s41587-020-0503-6 (2020).
Article PubMed PubMed Central CAS Google Scholar
Wick, R. R., Judd, L. M., Gorrie, C. L. & Holt, K. E. Completing bacterial genome assemblies with multiplex MinION sequencing. Microb Genom 3, e000132, https://doi.org/10.1099/mgen.0.000132 (2017).
Article PubMed PubMed Central Google Scholar
Belton, J. M. et al. Hi-C: a comprehensive technique to capture the conformation of genomes. Methods 58, 268–276, https://doi.org/10.1016/j.ymeth.2012.05.001 (2012).
Article PubMed CAS Google Scholar
Durand, N. C. et al. Juicer Provides a One-Click System for Analyzing Loop-Resolution Hi-C Experiments. Cell Syst 3, 95–98, https://doi.org/10.1016/j.cels.2016.07.002 (2016).
Article PubMed PubMed Central CAS Google Scholar
Hu, J., Fan, J., Sun, Z. & Liu, S. NextPolish: a fast and efficient genome polishing tool for long-read assembly. Bioinformatics 36, 2253–2255, https://doi.org/10.1093/bioinformatics/btz891 (2020).
Article PubMed CAS Google Scholar
Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat Methods 18, 170–175, https://doi.org/10.1038/s41592-020-01056-5 (2021).
Article PubMed PubMed Central CAS Google Scholar
Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754–1760, https://doi.org/10.1093/bioinformatics/btp324 (2009).
Article PubMed PubMed Central CAS Google Scholar
Bolger, A. M., Lohse, M. & Usadel, B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 30, 2114–2120, https://doi.org/10.1093/bioinformatics/btu170 (2014).
Article PubMed PubMed Central CAS Google Scholar
Dudchenko, O. et al. De novo assembly of the Aedes aegypti genome using Hi-C yields chromosome-length scaffolds. Science 356, 92–95, https://doi.org/10.1126/science.aal3327 (2017).
Article ADS PubMed PubMed Central CAS Google Scholar
Jain, C. et al. Weighted minimizer sampling improves long read mapping. Bioinformatics 36, i111–i118, https://doi.org/10.1093/bioinformatics/btaa435 (2020).
Article PubMed PubMed Central CAS Google Scholar
Marcais, G. et al. MUMmer4: A fast and versatile genome alignment system. PLoS Comput Biol 14, e1005944, https://doi.org/10.1371/journal.pcbi.1005944 (2018).
Article PubMed PubMed Central CAS Google Scholar
Dvorkina, T., Bzikadze, A. V. & Pevzner, P. A. The string decomposition problem and its applications to centromere analysis and assembly. Bioinformatics 36, i93–i101, https://doi.org/10.1093/bioinformatics/btaa454 (2020).
Article PubMed PubMed Central CAS Google Scholar
Tarailo-Graovac, M. & Chen, N. S. Using RepeatMasker to identify repetitive elements in genomic sequences. Curr Protoc Bioinformatics Chapter 4, 4.10.11–14.10.14, https://doi.org/10.1002/0471250953.bi0410s25 (2009).
Article Google Scholar
Bao, W., Kojima, K. K. & Kohany, O. Repbase Update, a database of repetitive elements in eukaryotic genomes. Mob DNA 6, 11, https://doi.org/10.1186/s13100-015-0041-9 (2015).
Article PubMed PubMed Central Google Scholar
Flynn, J. M. et al. RepeatModeler2 for automated genomic discovery of transposable element families. Proc Natl Acad Sci USA 117, 9451–9457, https://doi.org/10.1073/pnas.1921046117 (2020).
Article ADS PubMed PubMed Central CAS Google Scholar
Xu, Z. & Wang, H. LTR_FINDER: an efficient tool for the prediction of full-length LTR retrotransposons. Nucleic Acids Res 35, W265–268, https://doi.org/10.1093/nar/gkm286 (2007).
Article PubMed PubMed Central Google Scholar
Benson, G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res 27, 573–580, https://doi.org/10.1093/nar/27.2.573 (1999).
Article PubMed PubMed Central CAS Google Scholar
Stanke, M. et al. AUGUSTUS: ab initio prediction of alternative transcripts. Nucleic Acids Res 34, W435–439, https://doi.org/10.1093/nar/gkl200 (2006).
Article PubMed PubMed Central CAS Google Scholar
Majoros, W. H., Pertea, M. & Salzberg, S. L. TigrScan and GlimmerHMM: two open source ab initio eukaryotic gene-finders. Bioinformatics 20, 2878–2879, https://doi.org/10.1093/bioinformatics/bth315 (2004).
Article PubMed CAS Google Scholar
Zhou, C. et al. The Chromosome-Level Genome of Triplophysa dalaica (Cypriniformes: Cobitidae) Provides Insights into Its Survival in Extremely Alkaline Environment. Genome Biol Evol 13, https://doi.org/10.1093/gbe/evab153 (2021).
Yang, X. et al. Chromosome-level genome assembly of Triplophysa tibetana, a fish adapted to the harsh high-altitude environment of the Tibetan Plateau. Mol Ecol Resour 19, 1027–1036, https://doi.org/10.1111/1755-0998.13021 (2019).
Article PubMed CAS Google Scholar
Zhao, Q., Shao, F., Li, Y., Yi, S. V. & Peng, Z. Novel genome sequence of Chinese cavefish (Triplophysa rosa) reveals pervasive relaxation of natural selection in cavefish genomes. Mol Ecol 31, 5831–5845, https://doi.org/10.1111/mec.16700 (2022).
Article PubMed PubMed Central CAS Google Scholar
She, J., Chen, S., Liu, X. & Huo, B. Chromosome-level assembly of Triplophysa yarkandensis genome based on the single molecule real-time sequencing. Sci Data 11, 39, https://doi.org/10.1038/s41597-023-02900-x (2024).
Article PubMed PubMed Central CAS Google Scholar
Slater, G. S. & Birney, E. Automated generation of heuristics for biological sequence comparison. BMC Bioinformatics 6, 31, https://doi.org/10.1186/1471-2105-6-31 (2005).
Article PubMed PubMed Central CAS Google Scholar
Pertea, M. et al. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nat Biotechnol 33, 290–295, https://doi.org/10.1038/nbt.3122 (2015).
Article PubMed PubMed Central CAS Google Scholar
Haas, B. J. et al. Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Res 31, 5654–5666, https://doi.org/10.1093/nar/gkg770 (2003).
Article PubMed PubMed Central CAS Google Scholar
Holt, C. & Yandell, M. MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects. BMC Bioinformatics 12, Artn 49110.1186/1471-2105-12-491 (2011).
Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. J Mol Biol 215, 403–410, https://doi.org/10.1016/S0022-2836(05)80360-2 (1990).
Article PubMed CAS Google Scholar
Camacho, C. et al. BLAST+: architecture and applications. BMC Bioinformatics 10, 421, https://doi.org/10.1186/1471-2105-10-421 (2009).
Article PubMed PubMed Central CAS Google Scholar
Kanehisa, M. & Goto, S. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res 28, 27–30, https://doi.org/10.1093/nar/28.1.27 (2000).
Article PubMed PubMed Central CAS Google Scholar
Ashburner, M. et al. Gene Ontology: tool for the unification of biology. Nat Genet 25, 25–29, https://doi.org/10.1038/75556 (2000).
Article PubMed PubMed Central CAS Google Scholar
Pruitt, K. D., Tatusova, T. & Maglott, D. R. NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res 35, D61–65, https://doi.org/10.1093/nar/gkl842 (2007).
Article PubMed CAS Google Scholar
Bairoch, A. & Apweiler, R. The SWISS-PROT protein sequence data bank and its supplement TrEMBL in 1999. Nucleic Acids Res 27, 49–54, https://doi.org/10.1093/nar/27.1.49 (1999).
Article PubMed PubMed Central CAS Google Scholar
Mitchell, A. et al. The InterPro protein families database: the classification resource after 15 years. Nucleic Acids Res 43, D213–221, https://doi.org/10.1093/nar/gku1243 (2015).
Article PubMed Google Scholar
Lowe, T. M. & Eddy, S. R. tRNAscan-SE: A program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res 25, 955–964, https://doi.org/10.1093/nar/25.5.955 (1997).
Article PubMed PubMed Central CAS Google Scholar
Altschul, S. F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25, 3389–3402, https://doi.org/10.1093/nar/25.17.3389 (1997).
Article PubMed PubMed Central CAS Google Scholar
Nawrocki, E. P., Kolbe, D. L. & Eddy, S. R. Infernal 1.0: inference of RNA alignments. Bioinformatics 25, 1335–1337, https://doi.org/10.1093/bioinformatics/btp157 (2009).
Article PubMed PubMed Central CAS Google Scholar
Griffiths-Jones, S. et al. Rfam: annotating non-coding RNAs in complete genomes. Nucleic Acids Res 33, D121–124, https://doi.org/10.1093/nar/gki081 (2005).
Article PubMed CAS Google Scholar
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRP551230 (2025).
NCBI GenBank https://identifiers.org/ncbi/insdc.gca:GCA_048296945.1 (2025).
Ma, L. & Yang, R. B. Telomere to-telomere gapless genome assembly and annotation of Triplophysa yaopeizhii. Figshare https://doi.org/10.6084/m9.figshare.28127846.v1 (2025).
Mapleson, D., Garcia Accinelli, G., Kettleborough, G., Wright, J. & Clavijo, B. J. KAT: a K-mer analysis toolkit to quality control NGS datasets and genome assemblies. Bioinformatics 33, 574–576, https://doi.org/10.1093/bioinformatics/btw663 (2017).
Article PubMed CAS Google Scholar
Rhie, A., Walenz, B. P., Koren, S. & Phillippy, A. M. Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies. Genome Biol 21, 245, https://doi.org/10.1186/s13059-020-02134-9 (2020).
Article PubMed PubMed Central CAS Google Scholar
Manni, M., Berkeley, M. R., Seppey, M., Simao, F. A. & Zdobnov, E. M. BUSCO Update: Novel and Streamlined Workflows along with Broader and Deeper Phylogenetic Coverage for Scoring of Eukaryotic, Prokaryotic, and Viral Genomes. Mol Biol Evol 38, 4647–4654, https://doi.org/10.1093/molbev/msab199 (2021).
Article PubMed PubMed Central CAS Google Scholar

Download references

Acknowledgements

This study was supported by the National Natural Science Foundation of China (Grant Number: 31971421) and the Breeding Program for Endemic Fish Species in the Jinsha River (T-2022-04). The funders didn’t have any role in study design, data collection and analysis, decision to publish, or preparation of this manuscript.

Author information

These authors contributed equally: Li Ma, Xu Zeng.

Authors and Affiliations

College of Fisheries, Huazhong Agricultural University, Wuhan, 430070, China
Li Ma, Xu Zeng, Yongyao Yu, Ruibin Yang & Xuefen Yang
Yebatan Branch of Huadian Jinshajiang Upstream Hydropower Development Co., Ltd., Ganzi, 627153, China
Jixiao Wang & Hao Xiong
College of Fisheries, Southwest University, Chongqing, 402460, China
Haiping Liu
College of Informatics, Huazhong Agricultural University, Wuhan, 430070, China
Qing-Yong Yang

Authors

Li Ma
View author publications
Search author on:PubMed Google Scholar
Xu Zeng
View author publications
Search author on:PubMed Google Scholar
Jixiao Wang
View author publications
Search author on:PubMed Google Scholar
Hao Xiong
View author publications
Search author on:PubMed Google Scholar
Yongyao Yu
View author publications
Search author on:PubMed Google Scholar
Haiping Liu
View author publications
Search author on:PubMed Google Scholar
Qing-Yong Yang
View author publications
Search author on:PubMed Google Scholar
Ruibin Yang
View author publications
Search author on:PubMed Google Scholar
Xuefen Yang
View author publications
Search author on:PubMed Google Scholar

Contributions

R.Y. and X.Y. conceived the project. J.W., H.X. and H.L. coordinated the project and collected the samples. L.M. and X.Z. processed the samples and performed the experiments. Y.Y. and Q.Y. analyzed data. L.M. and X.Z. drafted the manuscript with significant contributions. R.Y. and X.Y. revised and finalized the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Xuefen Yang.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Ma, L., Zeng, X., Wang, J. et al. Telomere-to-telomere gapless genome assembly of Triplophysa yaopeizhii. Sci Data 12, 597 (2025). https://doi.org/10.1038/s41597-025-04943-8

Download citation

Received: 15 January 2025
Accepted: 01 April 2025
Published: 10 April 2025
Version of record: 10 April 2025
DOI: https://doi.org/10.1038/s41597-025-04943-8