Abstract
The genus Triplophysa exhibits remarkable adaptability to the unique environment found at the Qinghai-Tibet Plateau (QTP). Higher quality genomes are helpful to the study of the adaptability to the extreme environment in the plateau. This study utilized PacBio HiFi, Ultra-long ONT, and Hi-C sequencing of Triplophysa yaopeizhii to construct the first telomere-to-telomere (T2T) gapless genome assembly of the genus Triplophysa. The genome size is 671.58 Mb, with a contig N50 length of 26.04 Mb. The sequences were anchored onto 25 chromosomes with all centromeres and telomeres. Furthermore, 293.98 Mb (43.77%) of repetitive sequences and 26,487 protein-coding genes were identified. Comparative analyses with the genomes of closely related species demonstrated high completeness, continuity, and accuracy of the genome. The genomic quality was further substantiated by the QV of 31.82 with 96.60% of BUSCO. This study provides a valuable genetic resource of the genus Triplophysa and serves as an essential reference for elucidating the adaptive genetic mechanisms of plateau fish to the high altitude.
Similar content being viewed by others
Background & Summary
The Qinghai-Tibet Plateau (QTP), situated in China’s western region, has a mean elevation exceeding 4,000 meters. Recognized as the planet’s most elevated plateau, it is commonly referred to as the ‘Roof of the World’1. The genus Triplophysa is the largest of the three major fish taxa on the QTP, belonging to the family Cobitidae. The fishes of Triplophysa are mostly small and medium-sized, which are mainly distributed in the QTP and its surrounding rivers, lakes, and other waters, from plains, basins to high altitudes2,3. Triplophysa is distinguished by its unique biological characteristics that enable adaptation to the extreme environments of high-altitude plateaus. These fish exhibit a cylindrical body shape and a well-developed caudal peduncle, facilitating efficient movement in rapidly flowing plateau waters4,5. Their enlarged mouth gape and specialized digestive system enhance feeding and nutrient absorption in food-scarce high-altitude ecosystems6. Furthermore, the species employs distinctive reproductive strategies, including prolonged reproductive periods and higher fecundability, to ensure population sustainability in cold plateau conditions7. The unique evolutionary adaptations of these species are shaped by the specific geographical and environmental constraints of the Qinghai-Tibet Plateau, resulting in the emergence of distinctive genetic mechanisms that facilitate adaptation to this extreme environment1,8. Triplophysa yaopeizhii is predominantly distributed in the upper reaches of the Jinsha River. This species exhibits a preference for rivers with sandy substrates and high flow rates9,10. In recent years, the development and construction of hydropower projects in the Jinsha River basin and other anthropogenic factors have led to a decline in the population of indigenous fishes such as T. yaopeizhii11. T. yaopeizhii is well adapted to the complex environment of the QTP and is an important target for studying the complex mechanisms of species formation and evolution in fishes of the genus Triplophysa, as well as a special model for understanding the genetic basis of plateau adaptation. However, there are currently few studies related to T. yaopeizhii, and its genetic characteristics are still unclear.
The study of genetic information and evolutionary mechanisms in fish by whole genome sequencing has become a common tool for genetic research, with applications in various aspects such as fish breeding and biodiversity conservation12. As an indigenous fish living on the QTP, the genome of T. yaopeizhii can provide powerful insights into the genetic basis of high-altitude adaptation. However, there are still many gaps and incomplete regions in the currently published genomes of plateau fishes, especially in regions rich in repetitive sequences, such as telomeres and centromeres13. Telomeres reflect the health and longevity of organisms and play a crucial role in genome stability and DNA damage repair14,15. Variations in telomere length may be associated with genetic characteristics and environmental stress16. Centromeres are chromosomal regions that attach to the spindle during cell division, ensuring the equitable distribution of genetic material between daughter cells17. For plateau fishes such as T. yaopeizhii, the genetic information in these regions may play an important role in the evolutionary adaptations to the unique plateau environment. With the continuous development of genome sequencing and assembly methods, it has been possible to achieve telomere-to-telomere gapless assembly of genomic chromosome sequences18. The T2T genome assembly has made it possible to explore the uncharted territories of telomeres, centromeres and genome gaps, and has provided a more in-depth direction of biological research19. T2T genome assembly has been reported in some animal studies, including Ostfriesisches Milchschaf20, Anser cygnoides domesticus21, and in fish, Lateolabrax maculatus22, Pampus argenteus23 among others. However, the T2T genome of the fish which is able to adapt to the extreme environment in the QTP, has not been reported so far.
In this study, Pacific Biosciences (PacBio) HiFi sequencing, Oxford Nanopore Technologies (ONT) ultra-long sequencing and Hi-C assisted assembly technology were used to assemble the first high-quality T2T genome of a high-altitude adapted fish, T. yaopeizhii. The telomeres and centromeres of each chromosome were also detected. This is the first T2T genome in the genus Triplophysa. This study not only contributes to the population genetic and evolutionary analysis of T. yaopeizhii, but also provides important data for the study of the genetic mechanism of plateau adaptation in fish.
Methods
Ethics statement
All experimental protocols utilized in this study have been approved by the Animal Experimental Ethical Inspection of Laboratory Animal Center, Huazhong Agricultural University, Wuhan, China.
Sample collection and sequencing
The sample of T. yaopeizhii was obtained from the Anning River (Fig. 1A), a secondary tributary of the Jinsha River in Xichang City, Sichuan Province. The high-grade genomic DNA (gDNA) was isolated from muscular sample through the conventional sodium dodecyl sulfate method (SDS), followed by purification using the QIAGEN® genomic kit (Cat# 13343, QIAGEN). The integrity assessment and contaminant detection of isolated DNA were performed through 0.75% agarose gel electrophoresis. Subsequent analysis of the purity of the DNA was conducted with a NanoDrop™ One UV-Vis spectrophotometer (Thermo Fisher Scientific, USA). This analysis demonstrated that the extinction coefficients at OD 260/280 and OD 260/230 ratios ranged from 1.8 to 2.0 and from 2.0 to 2.2. Finally, quantitative measurement of DNA was ultimately achieved using the Qubit® 3.0 Fluorometric system (Invitrogen, USA).
Sample photograph and genome map. (A) Photographs of T. yaopeizhii. (B) The circos plot of genomic features: arranged from outside to inside, (a) chromosomes, (b) gene density, (c) GC density, (d) TE density, (e) TRF density, and (f) collinearity within the genome. (C) The Hi-C heatmap of chromosome interactions: Chr1 - Chr25 is an abbreviation for 25 chromosomes. The abscissa and ordinate represent the order of each bin on the corresponding chromosome. The color from light to dark indicates the strength of the interaction from low to high.
The SMRTbell target library was prepared in accordance with the established protocol (Pacific Biosciences, CA, USA). The library preparation process entailed several key steps, initially, gDNA was sheared into small fragments utilizing g-TUBE (Covaris, USA). Subsequently, a process of enzymatic repair was conducted to restore the integrity of the DNA fragments. This was followed by Blunt-end joining of hairpin adapters from the SMRTbell Express Template Prep Kit 2.0 (Pacific Biosciences, CA, USA). The target molecules were then subjected to size-selection and purified through BluePippin (Sage Science, USA). Finally, the library was purified using AMPure PB beads (Pacific Biosciences, CA, USA). Fragment size distribution of the library was verified through electrophoretic analysis on an Agilent 2100 Bioanalyzer system (Agilent Technologies, USA). Subsequent circular consensus sequencing (CCS) was then executed on the PacBio Sequel II platform using the Sequel II Sequencing Kit 2.0 (Nextomics, Wuhan).
The magnetic beads were used to enrich and purify large gDNA fragments (>15 kb). Then, damage and end repair were performed on the fragmented DNA. After purification, the ONT ultra-long sequencing library was prepared by adding base A of the DNA fragments, followed by adapter ligation using the SQK-LSK109 kit (Oxford Nanopore Technologies, Oxford, UK)24. The processed library was introduced into R10 Spot-On Flow Cells, and nanopore sequencing was carried out with a PromethI ON P48 sequencer (Oxford Nanopore Technologies, Oxford, UK). For data preprocessing, Porechop (v0.2.4)25 was utilized for adapt sequences filtering, while Filtlong (v0.2.4) (https://github.com/rrwick/Filtlong) facilitated quality-based read selection. Reads meeting thresholds of ≥30 kb in length and mean quality scores exceeding 90% were preserved for downstream assembly.
In order to attain chromosome-level genome assembly, the construction of a Hi-C library was initiated from isolated high-quality gDNA26. The workflow comprised the following steps, cell lysis post-crosslinking; enzymatic cleavage of DNA using DpnII restriction enzyme; biotinylation of fragment termini; blunt-end ligation; and DNA purification to generate Hi-C templates. Then, the biotin ends were removed from the Hi-C fragments, the fragments were cleaved by ultrasonic treatment, the end was repaired, base A was added, and the sequencing linker was added to form the coupling product. After that, PCR conditions were selected and amplified to obtain Hi-C library. High-throughput paired-end sequencing (strategy PE150) was performed on an Illumina NovaSeq 6000 platform (Illumina, San Diego, CA, USA). After obtaining raw Hi-C data, the software Juicer (v1.6)27 was used for quality control, and the default parameters were selected to obtain Hi-C clean data.
To aid genome annotation, RNA was extracted from 10 tissue samples (muscle, brain, kidney, liver, stomach, intestine, fin, gills, gonads and skin) using the TRIzol Universal total RNA Extraction kit (Tiangen). The Agilent 2100 Bioanalyzer system paired with the Agilent RNA 6000 Nano Kit was employed to evaluate RNA concentration and integrity. Subsequent construction of libraries was undertaken with the NEBNext® Ultra™ RNA Library Prep Kit for Illumina® (NEB, Ipswich, MA, USA). The Illumina HiSeq X Ten platform was then used to conduct high-throughput sequencing. Following the above processes, a total of 36.96 GB (~55.04 × coverage) of ONT data with a contig N50 of 100 kb and 27.19 GB (~44.49 × coverage) of PacBio HiFi CCS data with a contig N50 of 18.54 kb were obtained. Additionally, 66.47 GB of Raw Hi-C data (~98.97 × coverage) and 65.33 GB of Clean Hi-C data (~97.28 × coverage) were obtained (Table 1).
Genome assembly
The genome assembly was integrated with a combination of PacBio HiFi reads, ultra-long ONT data, and Hi-C data (Fig. 1B). NextDenovo (v2.4.0)28 and Hifiasm (v0.15.1)29 were used to assemble the ONT sequencing data and PacBio HiFi sequencing data with default parameters. The mapping of clean Hi-C reads was performed using BWA (v0.7.17)30 to the corresponding draft contigs. Trimmomatic (v0.40)31 was then applied to remove low-quality reads with parameters: leading: 3 trailing: 3 slidingwindow: 4:15 minlen: 50. Processed Hi-C data were analyzed through the 3D-DNA pipeline (https://github.com/aidenlab/3d-dna)32 for automatic clustering, sorting, and directional determination. JuiceBox (v2.13.07)27 was used to visual error, and the interaction frequencies between different chromosomes were analyzed. The interaction heatmap was used to identify and correct the errors in contig order, orientation or assembly within contigs and chromosome regions (Fig. 1C). Gap closure was achieved via Winnowmap (v1.11)33 (parameters: k = 15, -MD), aligning contigs previously from NextDenovo/Hifiasm outputs to unresolved genomic regions. Gap sequences were replaced by selecting the longest and most consistent sequence from aligned reads. Finally, a gapless T2T genome was obtained, with a genome size of 671.58 Mb, N50 length of 26.04 Mb, and GC content of 39.11%. The genomic sequences were clustered without gaps and localized on 25 chromosomes (Table 2).
Telomere and centromere identification
Ultra-long ONT data were mapped to the genome using Winnowmap (v1.11) with parameters: k = 15, -MD. The reads aligned uniquely within 50 bp of chromosome ends were collected. The occurrence number of telomere repeat motifs in each read was calculated. The read with the highest occurrence of these motifs was defined as the reference (ref), and the others were defined as queries. Using Medaka_consensus (v1.7.2) (https://github.com/nanoporetech/medaka) with parameter -m r941_min_high_g360, the ref telomere read and the query telomere reads were reassembled to obtain the consensus sequence. Finally, the software Nucmer (v3.1)34 was used to align the consensus sequences of the telomeres to each chromosome to determine whether the contigs aligned to the ends of the chromosomes contained telomeric repeat motifs. The terminal telomere sequences were replaced with the best alignment results. No substitutions were made if the identity value was ≤ 80 or the aligned region was not within the last 20 kb of the chromosome. Telomeres are detected at both ends of all 25 chromosomes in the genome. The software TRASH was utilized to identify all tandem repeat monomers, and the monomer with the highest occurrence was selected as the representative monomer of centromere based on its cycle and copy number. Subsequently, the software StringDecomposer (v1.1.2)35 was used to map the representative monomeric sequences to the chromosomes, search for all centromeric repeats, and extend the centromeric position on the chromosome by 10 kb at the first and last repeat intervals (Table 3, Fig. 2A).
Distribution of telomeres and centromeres in T. yaopeizhii genome and Venn diagram for functional annotation of protein-coding genes. (A) Distribution of telomeres and centromeres: triangles and circles represent telomeres and centromeres on chromosomes; Red indicates high gene density; Blue indicates low gene density. (B) Venn diagram: Five public databases KEGG, GO, NR, InterPro, and SwissProt were used for gene function annotation to obtain statistical Venn diagram.
Repetitive sequences annotation
Repetitive sequences were annotated by combining homology prediction with de novo prediction. Homology prediction was performed using the software RepeatMasker (v4.0.9)36 and RepeatProteinMask (v4.0.9)36 based on the RepBase library (http://www.girinst.org/repbase)37. The de novo prediction was performed by RepeatModeler (v1.0.11)38 and LTR-FINDER (v1.0.5)39, employing self-sequence alignment and characteristics of repeat sequence. Additionally, the identification of tandem repeats was facilitated by Tandem Repeats Finder (v4.09)40 The annotation results show that repetitive sequence size is 293.98 Mb, which accounts for 43.77% of the genome (Table 4). Among these repeats, SINEs accounted for 0.54% of the genome size, LINEs for 7.02%, LTRs for 10.40%, and DNA elements for 20.90% (Table 5).
Prediction and functional annotation of protein-coding genes
In this study, the prediction of protein-coding genes in the genome of T. yaopeizhii was achieved through a combination of de novo prediction, homology prediction, and transcriptome prediction. The de novo prediction was performed using Augustus (v3.3)41 and GlimmerHMM (v3.0.4)42 to predict gene structure. Homology prediction was conducted using six closely related species, T. longibarbata (unpublished data), T. bombifrons (unpublished data), T. dalaica43, T. tribetana44, T. rosa45 and T. yarkandensis46, with the software Exonerate (v2.4)47. The protein-coding sequences of known closely related species were aligned to the genome sequences of the target species to predict genes. The RNA-seq data were aligned with StringTie (v2.1.1)48 in the genome, and the resulting transcripts were then reconstructed. The coding regions were predicted using PASA (v2.4.1)49. MAKER (v3.00)50 integrated the gene sets predicted by the different methods into a non-redundant and more complete gene sets. Gene prediction was corrected by PASA (v2.4.1) combined with transcriptome data. For functional annotation of gene predictions, BLASTP (v2.6.0)51,52 was used to compare the predicted genes with databases such as the Kyoto Encyclopedia of Genes and Genomes (KEGG)53, Gene Ontology (GO)54, NCBI non-redundant protein database (NR)55, Swiss-Prot56, and TrEMBL56, as well as the InterPro57 databases. Overall, we successfully predicted 26,487 protein-coding genes in the genome. The average gene length of these predicted genes was 12,653.76 bp, the average coding sequence length was 1,512.01 bp, and the average number of exons was 8.82 (Table 6). The results of gene function annotation showed that 25,686 genes were annotated to at least one database, accounting for 96.98% of the total number of predicted genes (Fig. 2B, Table 7).
Annotation of non-coding RNAs
The software tRNAscan-SE (v1.3.1)58 was utilised with default parameters in order to identify tRNA sequences within the genome, in accordance with the structural characteristics of tRNA. Using rRNA sequences from closely related species as reference sequences, rRNAs in the genome were searched by BLASTN (v2.6.0)59 comparison with E-value < 1e-5, homology ≥85%, and match length ≥50 bp. In addition, the covariance model of the Rfam database (http://xfam.org/) was utilized along with Infernal (v1.1.2)60 and Rfam (v14.1)61 with default parameters, to predict the miRNA and snRNA sequence information in the genome. As a result, we annotated 406 miRNAs, 23,001 tRNAs, 124 rRNAs, and 1,480 snRNAs (Table 8).
Data Records
All raw data of the whole genome have been deposited into the National Center for Biotechnology Information (NCBI) SRA database under BioProject accession number PRJNA1195554. The genomic PacBio sequencing data, the ultra-long ONT sequencing data, the Hi-C sequencing data, and the RNA sequencing data were deposited in the in the Sequence Read Archive at NCBI with accession number SRP55123062. The genome assembly data has been deposited at GenBank under the accession GCA_048296945.163. The files of repetitive sequences annotation, the gene structure annotation, predicted coding sequences, protein sequences, and ncRNA annotation have been deposited at Figshare64.
Technical Validation
Multiple methods were used to verify the accuracy and integrity of T. yaopeizhii genome. Firstly, the Hi-C heatmap of T. yaopeizhii genome showed a high degree of consistency among all chromosomes, reflecting the accuracy of sequencing, ordering, and orientation of contigs in the genome assembly (Fig. 1C). Secondly, a total of 25 centromeres were mapped on 25 chromosomes, and all telomeres were identified. These results provided significant evidence for the integrity of the chromosomes (Fig. 2A). Subsequently, the Illumina sequencing data were aligned to the genome using the software BWA (v0.7.17), achieving a mapping rate of 98.81%. And alignment results from Winnowmap (v1.11), 99.91% of ONT reads and 99.96% of HiFi reads could be aligned to the T2T assembly. In addition, the statistical analysis of gene data, including the distribution of genes, CDS, exons, and introns, was performed on T. yaopeizhii and related species. The results show that the distribution among these species is consistent, demonstrating the accuracy of the genome (Fig. 3). Notably, the calculation and comparison of the integrity of T2T genomes with those of closely related species using the software KAT65 showed that T2T genomes have higher integrity (Table 9). Finally, the quality value (QV) of assembly was quantified using merqury (v1.3)66, resulting in QV of 31.73 (Table 2). The integrity of the genome assembly and the protein-coding genes were assessed using BUSCO (v4.0.5)67, based on the single-copy orthologous gene set actinopterygii_odb10 in the OrthoDB database. The results showed that 96.62% and 92.50% of the 3640 single-copy orthologous gene sets were identified (Fig. 4). In conclusion, the T2T genome of T. yaopeizhii is found to be highly complete and accurate. The high-quality genome provides a robust foundation for investigating the evolutionary and adaptive mechanisms of plateau fish in response to the unique environmental conditions of the plateau.
Code availability
All commands and pipelines used in data processing were executed according to the manual and protocols of the corresponding bioinformatic software. No specific code has been developed for this study.
References
Gao, Y. & Liu, Y. C. Conservation of fishes of Triplophysa in the plateau. Xizang Sci Technol, 35–39 (2021).
He, C. L., Song, Z. B. & Zhang, E. Triplophysa fishes in China and the status of its taxonomic studies. Sichuan J Zool 30, 150–155 (2011).
Xiao, H. & Dai, Y. G. A Review of Study on Diversity of Triplophysa in China. Fish Sci 30, 53–57, https://doi.org/10.16378/j.cnki.1003-1111.2011.01.016 (2011).
Zhang, L. R., Ji, B. W., Nie, Z. L. & Wei, J. Age Structure and Growth Characteristics of Loach Triplophysa tenuis in Muzati River,Xinjiang. Chin. J. Fish. 37, 30–37 (2024).
Xu, X., Chen, Y. R., Li, T. T., Ren, Y. L. & Nie, Z. L. Morphological characteristics and their correlation analysis of Triplophysa bombifrons from Keriya River in Xinjiang area. Jiangsu Agric Sci 48, 192–197, https://doi.org/10.15889/j.issn.1002-1302.2020.08.036 (2020).
Liu, M. Y., Yang, R. B., Yang, X. F., Fan, Q. X. & Wei, K. J. Characteristics of the morphology and histology ordigestive tract of Triplophysa tibetana, Triplophysa stenura and Triplophysastewarti. Acta Hydrobiol. Sin. 42, 342–348 (2018).
Xie, J., Xia, Y., Yan, Y., Liang, W. & Ren, C. Reproductive cycle of Triplophysa stenura (Herzenstein, 1888)(Balitoridae: Nemacheilinae) from the Yarlung Tsangpo River in the Tibetan Plateau, China. J. Appl. Ichthyol 33, 37–41 (2017).
Zhao, Y. H., Zhang, J. & Zhang, C. G. Fish diversity in the Tibetan Plateau. Biol. Bull 43, 8–10, https://doi.org/10.3969/j.issn.0006-3193.2008.07.003 (2008).
Zhang, C. G. et al. Fishes in the Jinsha Jiang River Basin, the Upper Reaches of the Yangtze River, China (Science Press, 2019).
Xu, T. Q. & Zhang, C. G. A New Species of Cobitid Fish from Tibet, China (Cypriniformes: Cobitidae). Zool. Syst., 377–379 (1996).
Sun, H. Y., Sui, X. Y., He, D. K., Li, X. Q. & Chen, Y. F. Fish Systematic Conservation Planning in the JinSha River Basin. Acta Hydrobiol Sin., 110–118 (2019).
Xu, G. C., Du, F. K., Bian, C., Shi, Q. & Xu, P. Research Progress on Fish Genomics. Biotech Bull. 33, 23–31, https://doi.org/10.13560/j.cnki.biotech.bull.1985.2017-0290 (2017).
Li, H. & Durbin, R. Genome assembly in the telomere-to-telomere era. Nat Rev Genet 25, 658–670, https://doi.org/10.1038/s41576-024-00718-w (2024).
Chan, S. R. & Blackburn, E. H. Telomeres and telomerase. Philos Trans R Soc Lond B Biol Sci 359, 109–121, https://doi.org/10.1098/rstb.2003.1370 (2004).
O’Sullivan, R. J. & Karlseder, J. Telomeres: protecting chromosomes against genome instability. Nat Rev Mol Cell Biol 11, 171–181, https://doi.org/10.1038/nrm2848 (2010).
Bauch, C., Boonekamp, J. J., Korsten, P., Mulder, E. & Verhulst, S. High heritability of telomere length and low heritability of telomere shortening in wild birds. Mol Ecol 31, 6308–6323, https://doi.org/10.1111/mec.16183 (2022).
Kursel, L. E. & Malik, H. S. Centromeres. Curr Biol 26, R487–R490, https://doi.org/10.1016/j.cub.2016.05.031 (2016).
Nurk, S. et al. The complete sequence of a human genome. Science 376, 44–53, https://doi.org/10.1126/science.abj6987 (2022).
Hou, X., Wang, D., Cheng, Z., Wang, Y. & Jiao, Y. A near-complete assembly of an Arabidopsis thaliana genome. Mol Plant 15, 1247–1250, https://doi.org/10.1016/j.molp.2022.05.014 (2022).
You, X. et al. A near complete genome assembly of the East Friesian sheep genome. Sci Data 11, 762, https://doi.org/10.1038/s41597-024-03581-w (2024).
Zhao, H. et al. Telomere-to-telomere genome assembly of the goose Anser cygnoides. Sci Data 11, 741, https://doi.org/10.1038/s41597-024-03567-8 (2024).
Sun, Z. et al. Telomere-to-telomere gapless genome assembly of the Chinese sea bass (Lateolabrax maculatus). Sci Data 11, 175, https://doi.org/10.1038/s41597-024-02988-9 (2024).
Hu, J. et al. Two high quality chromosome-scale genome assemblies of female and male silver pomfret (Pampus argenteus). Sci Data 11, 1100, https://doi.org/10.1038/s41597-024-03914-9 (2024).
Shafin, K. et al. Nanopore sequencing and the Shasta toolkit enable efficient de novo assembly of eleven human genomes. Nat Biotechnol 38, 1044–1053, https://doi.org/10.1038/s41587-020-0503-6 (2020).
Wick, R. R., Judd, L. M., Gorrie, C. L. & Holt, K. E. Completing bacterial genome assemblies with multiplex MinION sequencing. Microb Genom 3, e000132, https://doi.org/10.1099/mgen.0.000132 (2017).
Belton, J. M. et al. Hi-C: a comprehensive technique to capture the conformation of genomes. Methods 58, 268–276, https://doi.org/10.1016/j.ymeth.2012.05.001 (2012).
Durand, N. C. et al. Juicer Provides a One-Click System for Analyzing Loop-Resolution Hi-C Experiments. Cell Syst 3, 95–98, https://doi.org/10.1016/j.cels.2016.07.002 (2016).
Hu, J., Fan, J., Sun, Z. & Liu, S. NextPolish: a fast and efficient genome polishing tool for long-read assembly. Bioinformatics 36, 2253–2255, https://doi.org/10.1093/bioinformatics/btz891 (2020).
Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat Methods 18, 170–175, https://doi.org/10.1038/s41592-020-01056-5 (2021).
Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754–1760, https://doi.org/10.1093/bioinformatics/btp324 (2009).
Bolger, A. M., Lohse, M. & Usadel, B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 30, 2114–2120, https://doi.org/10.1093/bioinformatics/btu170 (2014).
Dudchenko, O. et al. De novo assembly of the Aedes aegypti genome using Hi-C yields chromosome-length scaffolds. Science 356, 92–95, https://doi.org/10.1126/science.aal3327 (2017).
Jain, C. et al. Weighted minimizer sampling improves long read mapping. Bioinformatics 36, i111–i118, https://doi.org/10.1093/bioinformatics/btaa435 (2020).
Marcais, G. et al. MUMmer4: A fast and versatile genome alignment system. PLoS Comput Biol 14, e1005944, https://doi.org/10.1371/journal.pcbi.1005944 (2018).
Dvorkina, T., Bzikadze, A. V. & Pevzner, P. A. The string decomposition problem and its applications to centromere analysis and assembly. Bioinformatics 36, i93–i101, https://doi.org/10.1093/bioinformatics/btaa454 (2020).
Tarailo-Graovac, M. & Chen, N. S. Using RepeatMasker to identify repetitive elements in genomic sequences. Curr Protoc Bioinformatics Chapter 4, 4.10.11–14.10.14, https://doi.org/10.1002/0471250953.bi0410s25 (2009).
Bao, W., Kojima, K. K. & Kohany, O. Repbase Update, a database of repetitive elements in eukaryotic genomes. Mob DNA 6, 11, https://doi.org/10.1186/s13100-015-0041-9 (2015).
Flynn, J. M. et al. RepeatModeler2 for automated genomic discovery of transposable element families. Proc Natl Acad Sci USA 117, 9451–9457, https://doi.org/10.1073/pnas.1921046117 (2020).
Xu, Z. & Wang, H. LTR_FINDER: an efficient tool for the prediction of full-length LTR retrotransposons. Nucleic Acids Res 35, W265–268, https://doi.org/10.1093/nar/gkm286 (2007).
Benson, G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res 27, 573–580, https://doi.org/10.1093/nar/27.2.573 (1999).
Stanke, M. et al. AUGUSTUS: ab initio prediction of alternative transcripts. Nucleic Acids Res 34, W435–439, https://doi.org/10.1093/nar/gkl200 (2006).
Majoros, W. H., Pertea, M. & Salzberg, S. L. TigrScan and GlimmerHMM: two open source ab initio eukaryotic gene-finders. Bioinformatics 20, 2878–2879, https://doi.org/10.1093/bioinformatics/bth315 (2004).
Zhou, C. et al. The Chromosome-Level Genome of Triplophysa dalaica (Cypriniformes: Cobitidae) Provides Insights into Its Survival in Extremely Alkaline Environment. Genome Biol Evol 13, https://doi.org/10.1093/gbe/evab153 (2021).
Yang, X. et al. Chromosome-level genome assembly of Triplophysa tibetana, a fish adapted to the harsh high-altitude environment of the Tibetan Plateau. Mol Ecol Resour 19, 1027–1036, https://doi.org/10.1111/1755-0998.13021 (2019).
Zhao, Q., Shao, F., Li, Y., Yi, S. V. & Peng, Z. Novel genome sequence of Chinese cavefish (Triplophysa rosa) reveals pervasive relaxation of natural selection in cavefish genomes. Mol Ecol 31, 5831–5845, https://doi.org/10.1111/mec.16700 (2022).
She, J., Chen, S., Liu, X. & Huo, B. Chromosome-level assembly of Triplophysa yarkandensis genome based on the single molecule real-time sequencing. Sci Data 11, 39, https://doi.org/10.1038/s41597-023-02900-x (2024).
Slater, G. S. & Birney, E. Automated generation of heuristics for biological sequence comparison. BMC Bioinformatics 6, 31, https://doi.org/10.1186/1471-2105-6-31 (2005).
Pertea, M. et al. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nat Biotechnol 33, 290–295, https://doi.org/10.1038/nbt.3122 (2015).
Haas, B. J. et al. Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Res 31, 5654–5666, https://doi.org/10.1093/nar/gkg770 (2003).
Holt, C. & Yandell, M. MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects. BMC Bioinformatics 12, Artn 49110.1186/1471-2105-12-491 (2011).
Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. J Mol Biol 215, 403–410, https://doi.org/10.1016/S0022-2836(05)80360-2 (1990).
Camacho, C. et al. BLAST+: architecture and applications. BMC Bioinformatics 10, 421, https://doi.org/10.1186/1471-2105-10-421 (2009).
Kanehisa, M. & Goto, S. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res 28, 27–30, https://doi.org/10.1093/nar/28.1.27 (2000).
Ashburner, M. et al. Gene Ontology: tool for the unification of biology. Nat Genet 25, 25–29, https://doi.org/10.1038/75556 (2000).
Pruitt, K. D., Tatusova, T. & Maglott, D. R. NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res 35, D61–65, https://doi.org/10.1093/nar/gkl842 (2007).
Bairoch, A. & Apweiler, R. The SWISS-PROT protein sequence data bank and its supplement TrEMBL in 1999. Nucleic Acids Res 27, 49–54, https://doi.org/10.1093/nar/27.1.49 (1999).
Mitchell, A. et al. The InterPro protein families database: the classification resource after 15 years. Nucleic Acids Res 43, D213–221, https://doi.org/10.1093/nar/gku1243 (2015).
Lowe, T. M. & Eddy, S. R. tRNAscan-SE: A program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res 25, 955–964, https://doi.org/10.1093/nar/25.5.955 (1997).
Altschul, S. F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25, 3389–3402, https://doi.org/10.1093/nar/25.17.3389 (1997).
Nawrocki, E. P., Kolbe, D. L. & Eddy, S. R. Infernal 1.0: inference of RNA alignments. Bioinformatics 25, 1335–1337, https://doi.org/10.1093/bioinformatics/btp157 (2009).
Griffiths-Jones, S. et al. Rfam: annotating non-coding RNAs in complete genomes. Nucleic Acids Res 33, D121–124, https://doi.org/10.1093/nar/gki081 (2005).
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRP551230 (2025).
NCBI GenBank https://identifiers.org/ncbi/insdc.gca:GCA_048296945.1 (2025).
Ma, L. & Yang, R. B. Telomere to-telomere gapless genome assembly and annotation of Triplophysa yaopeizhii. Figshare https://doi.org/10.6084/m9.figshare.28127846.v1 (2025).
Mapleson, D., Garcia Accinelli, G., Kettleborough, G., Wright, J. & Clavijo, B. J. KAT: a K-mer analysis toolkit to quality control NGS datasets and genome assemblies. Bioinformatics 33, 574–576, https://doi.org/10.1093/bioinformatics/btw663 (2017).
Rhie, A., Walenz, B. P., Koren, S. & Phillippy, A. M. Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies. Genome Biol 21, 245, https://doi.org/10.1186/s13059-020-02134-9 (2020).
Manni, M., Berkeley, M. R., Seppey, M., Simao, F. A. & Zdobnov, E. M. BUSCO Update: Novel and Streamlined Workflows along with Broader and Deeper Phylogenetic Coverage for Scoring of Eukaryotic, Prokaryotic, and Viral Genomes. Mol Biol Evol 38, 4647–4654, https://doi.org/10.1093/molbev/msab199 (2021).
Acknowledgements
This study was supported by the National Natural Science Foundation of China (Grant Number: 31971421) and the Breeding Program for Endemic Fish Species in the Jinsha River (T-2022-04). The funders didn’t have any role in study design, data collection and analysis, decision to publish, or preparation of this manuscript.
Author information
Authors and Affiliations
Contributions
R.Y. and X.Y. conceived the project. J.W., H.X. and H.L. coordinated the project and collected the samples. L.M. and X.Z. processed the samples and performed the experiments. Y.Y. and Q.Y. analyzed data. L.M. and X.Z. drafted the manuscript with significant contributions. R.Y. and X.Y. revised and finalized the manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Ma, L., Zeng, X., Wang, J. et al. Telomere-to-telomere gapless genome assembly of Triplophysa yaopeizhii. Sci Data 12, 597 (2025). https://doi.org/10.1038/s41597-025-04943-8
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41597-025-04943-8






