Background & Summary

The Qinghai-Tibet Plateau (QTP), situated in China’s western region, has a mean elevation exceeding 4,000 meters. Recognized as the planet’s most elevated plateau, it is commonly referred to as the ‘Roof of the World’1. The genus Triplophysa is the largest of the three major fish taxa on the QTP, belonging to the family Cobitidae. The fishes of Triplophysa are mostly small and medium-sized, which are mainly distributed in the QTP and its surrounding rivers, lakes, and other waters, from plains, basins to high altitudes2,3. Triplophysa is distinguished by its unique biological characteristics that enable adaptation to the extreme environments of high-altitude plateaus. These fish exhibit a cylindrical body shape and a well-developed caudal peduncle, facilitating efficient movement in rapidly flowing plateau waters4,5. Their enlarged mouth gape and specialized digestive system enhance feeding and nutrient absorption in food-scarce high-altitude ecosystems6. Furthermore, the species employs distinctive reproductive strategies, including prolonged reproductive periods and higher fecundability, to ensure population sustainability in cold plateau conditions7. The unique evolutionary adaptations of these species are shaped by the specific geographical and environmental constraints of the Qinghai-Tibet Plateau, resulting in the emergence of distinctive genetic mechanisms that facilitate adaptation to this extreme environment1,8. Triplophysa yaopeizhii is predominantly distributed in the upper reaches of the Jinsha River. This species exhibits a preference for rivers with sandy substrates and high flow rates9,10. In recent years, the development and construction of hydropower projects in the Jinsha River basin and other anthropogenic factors have led to a decline in the population of indigenous fishes such as T. yaopeizhii11. T. yaopeizhii is well adapted to the complex environment of the QTP and is an important target for studying the complex mechanisms of species formation and evolution in fishes of the genus Triplophysa, as well as a special model for understanding the genetic basis of plateau adaptation. However, there are currently few studies related to T. yaopeizhii, and its genetic characteristics are still unclear.

The study of genetic information and evolutionary mechanisms in fish by whole genome sequencing has become a common tool for genetic research, with applications in various aspects such as fish breeding and biodiversity conservation12. As an indigenous fish living on the QTP, the genome of T. yaopeizhii can provide powerful insights into the genetic basis of high-altitude adaptation. However, there are still many gaps and incomplete regions in the currently published genomes of plateau fishes, especially in regions rich in repetitive sequences, such as telomeres and centromeres13. Telomeres reflect the health and longevity of organisms and play a crucial role in genome stability and DNA damage repair14,15. Variations in telomere length may be associated with genetic characteristics and environmental stress16. Centromeres are chromosomal regions that attach to the spindle during cell division, ensuring the equitable distribution of genetic material between daughter cells17. For plateau fishes such as T. yaopeizhii, the genetic information in these regions may play an important role in the evolutionary adaptations to the unique plateau environment. With the continuous development of genome sequencing and assembly methods, it has been possible to achieve telomere-to-telomere gapless assembly of genomic chromosome sequences18. The T2T genome assembly has made it possible to explore the uncharted territories of telomeres, centromeres and genome gaps, and has provided a more in-depth direction of biological research19. T2T genome assembly has been reported in some animal studies, including Ostfriesisches Milchschaf20, Anser cygnoides domesticus21, and in fish, Lateolabrax maculatus22, Pampus argenteus23 among others. However, the T2T genome of the fish which is able to adapt to the extreme environment in the QTP, has not been reported so far.

In this study, Pacific Biosciences (PacBio) HiFi sequencing, Oxford Nanopore Technologies (ONT) ultra-long sequencing and Hi-C assisted assembly technology were used to assemble the first high-quality T2T genome of a high-altitude adapted fish, T. yaopeizhii. The telomeres and centromeres of each chromosome were also detected. This is the first T2T genome in the genus Triplophysa. This study not only contributes to the population genetic and evolutionary analysis of T. yaopeizhii, but also provides important data for the study of the genetic mechanism of plateau adaptation in fish.

Methods

Ethics statement

All experimental protocols utilized in this study have been approved by the Animal Experimental Ethical Inspection of Laboratory Animal Center, Huazhong Agricultural University, Wuhan, China.

Sample collection and sequencing

The sample of T. yaopeizhii was obtained from the Anning River (Fig. 1A), a secondary tributary of the Jinsha River in Xichang City, Sichuan Province. The high-grade genomic DNA (gDNA) was isolated from muscular sample through the conventional sodium dodecyl sulfate method (SDS), followed by purification using the QIAGEN® genomic kit (Cat# 13343, QIAGEN). The integrity assessment and contaminant detection of isolated DNA were performed through 0.75% agarose gel electrophoresis. Subsequent analysis of the purity of the DNA was conducted with a NanoDrop™ One UV-Vis spectrophotometer (Thermo Fisher Scientific, USA). This analysis demonstrated that the extinction coefficients at OD 260/280 and OD 260/230 ratios ranged from 1.8 to 2.0 and from 2.0 to 2.2. Finally, quantitative measurement of DNA was ultimately achieved using the Qubit® 3.0 Fluorometric system (Invitrogen, USA).

Fig. 1
figure 1

Sample photograph and genome map. (A) Photographs of T. yaopeizhii. (B) The circos plot of genomic features: arranged from outside to inside, (a) chromosomes, (b) gene density, (c) GC density, (d) TE density, (e) TRF density, and (f) collinearity within the genome. (C) The Hi-C heatmap of chromosome interactions: Chr1 - Chr25 is an abbreviation for 25 chromosomes. The abscissa and ordinate represent the order of each bin on the corresponding chromosome. The color from light to dark indicates the strength of the interaction from low to high.

The SMRTbell target library was prepared in accordance with the established protocol (Pacific Biosciences, CA, USA). The library preparation process entailed several key steps, initially, gDNA was sheared into small fragments utilizing g-TUBE (Covaris, USA). Subsequently, a process of enzymatic repair was conducted to restore the integrity of the DNA fragments. This was followed by Blunt-end joining of hairpin adapters from the SMRTbell Express Template Prep Kit 2.0 (Pacific Biosciences, CA, USA). The target molecules were then subjected to size-selection and purified through BluePippin (Sage Science, USA). Finally, the library was purified using AMPure PB beads (Pacific Biosciences, CA, USA). Fragment size distribution of the library was verified through electrophoretic analysis on an Agilent 2100 Bioanalyzer system (Agilent Technologies, USA). Subsequent circular consensus sequencing (CCS) was then executed on the PacBio Sequel II platform using the Sequel II Sequencing Kit 2.0 (Nextomics, Wuhan).

The magnetic beads were used to enrich and purify large gDNA fragments (>15 kb). Then, damage and end repair were performed on the fragmented DNA. After purification, the ONT ultra-long sequencing library was prepared by adding base A of the DNA fragments, followed by adapter ligation using the SQK-LSK109 kit (Oxford Nanopore Technologies, Oxford, UK)24. The processed library was introduced into R10 Spot-On Flow Cells, and nanopore sequencing was carried out with a PromethI ON P48 sequencer (Oxford Nanopore Technologies, Oxford, UK). For data preprocessing, Porechop (v0.2.4)25 was utilized for adapt sequences filtering, while Filtlong (v0.2.4) (https://github.com/rrwick/Filtlong) facilitated quality-based read selection. Reads meeting thresholds of ≥30 kb in length and mean quality scores exceeding 90% were preserved for downstream assembly.

In order to attain chromosome-level genome assembly, the construction of a Hi-C library was initiated from isolated high-quality gDNA26. The workflow comprised the following steps, cell lysis post-crosslinking; enzymatic cleavage of DNA using DpnII restriction enzyme; biotinylation of fragment termini; blunt-end ligation; and DNA purification to generate Hi-C templates. Then, the biotin ends were removed from the Hi-C fragments, the fragments were cleaved by ultrasonic treatment, the end was repaired, base A was added, and the sequencing linker was added to form the coupling product. After that, PCR conditions were selected and amplified to obtain Hi-C library. High-throughput paired-end sequencing (strategy PE150) was performed on an Illumina NovaSeq 6000 platform (Illumina, San Diego, CA, USA). After obtaining raw Hi-C data, the software Juicer (v1.6)27 was used for quality control, and the default parameters were selected to obtain Hi-C clean data.

To aid genome annotation, RNA was extracted from 10 tissue samples (muscle, brain, kidney, liver, stomach, intestine, fin, gills, gonads and skin) using the TRIzol Universal total RNA Extraction kit (Tiangen). The Agilent 2100 Bioanalyzer system paired with the Agilent RNA 6000 Nano Kit was employed to evaluate RNA concentration and integrity. Subsequent construction of libraries was undertaken with the NEBNext® Ultra™ RNA Library Prep Kit for Illumina® (NEB, Ipswich, MA, USA). The Illumina HiSeq X Ten platform was then used to conduct high-throughput sequencing. Following the above processes, a total of 36.96 GB (~55.04 × coverage) of ONT data with a contig N50 of 100 kb and 27.19 GB (~44.49 × coverage) of PacBio HiFi CCS data with a contig N50 of 18.54 kb were obtained. Additionally, 66.47 GB of Raw Hi-C data (~98.97 × coverage) and 65.33 GB of Clean Hi-C data (~97.28 × coverage) were obtained (Table 1).

Table 1 Summary of sequencing data of T. yaopeizhii genome.

Genome assembly

The genome assembly was integrated with a combination of PacBio HiFi reads, ultra-long ONT data, and Hi-C data (Fig. 1B). NextDenovo (v2.4.0)28 and Hifiasm (v0.15.1)29 were used to assemble the ONT sequencing data and PacBio HiFi sequencing data with default parameters. The mapping of clean Hi-C reads was performed using BWA (v0.7.17)30 to the corresponding draft contigs. Trimmomatic (v0.40)31 was then applied to remove low-quality reads with parameters: leading: 3 trailing: 3 slidingwindow: 4:15 minlen: 50. Processed Hi-C data were analyzed through the 3D-DNA pipeline (https://github.com/aidenlab/3d-dna)32 for automatic clustering, sorting, and directional determination. JuiceBox (v2.13.07)27 was used to visual error, and the interaction frequencies between different chromosomes were analyzed. The interaction heatmap was used to identify and correct the errors in contig order, orientation or assembly within contigs and chromosome regions (Fig. 1C). Gap closure was achieved via Winnowmap (v1.11)33 (parameters: k = 15, -MD), aligning contigs previously from NextDenovo/Hifiasm outputs to unresolved genomic regions. Gap sequences were replaced by selecting the longest and most consistent sequence from aligned reads. Finally, a gapless T2T genome was obtained, with a genome size of 671.58 Mb, N50 length of 26.04 Mb, and GC content of 39.11%. The genomic sequences were clustered without gaps and localized on 25 chromosomes (Table 2).

Table 2 Summary statistics of T. yaopeizhii genome assembly.

Telomere and centromere identification

Ultra-long ONT data were mapped to the genome using Winnowmap (v1.11) with parameters: k = 15, -MD. The reads aligned uniquely within 50 bp of chromosome ends were collected. The occurrence number of telomere repeat motifs in each read was calculated. The read with the highest occurrence of these motifs was defined as the reference (ref), and the others were defined as queries. Using Medaka_consensus (v1.7.2) (https://github.com/nanoporetech/medaka) with parameter -m r941_min_high_g360, the ref telomere read and the query telomere reads were reassembled to obtain the consensus sequence. Finally, the software Nucmer (v3.1)34 was used to align the consensus sequences of the telomeres to each chromosome to determine whether the contigs aligned to the ends of the chromosomes contained telomeric repeat motifs. The terminal telomere sequences were replaced with the best alignment results. No substitutions were made if the identity value was ≤ 80 or the aligned region was not within the last 20 kb of the chromosome. Telomeres are detected at both ends of all 25 chromosomes in the genome. The software TRASH was utilized to identify all tandem repeat monomers, and the monomer with the highest occurrence was selected as the representative monomer of centromere based on its cycle and copy number. Subsequently, the software StringDecomposer (v1.1.2)35 was used to map the representative monomeric sequences to the chromosomes, search for all centromeric repeats, and extend the centromeric position on the chromosome by 10 kb at the first and last repeat intervals (Table 3, Fig. 2A).

Table 3 Telomere and centromere positions of T. yaopeizhii genome.
Fig. 2
figure 2

Distribution of telomeres and centromeres in T. yaopeizhii genome and Venn diagram for functional annotation of protein-coding genes. (A) Distribution of telomeres and centromeres: triangles and circles represent telomeres and centromeres on chromosomes; Red indicates high gene density; Blue indicates low gene density. (B) Venn diagram: Five public databases KEGG, GO, NR, InterPro, and SwissProt were used for gene function annotation to obtain statistical Venn diagram.

Repetitive sequences annotation

Repetitive sequences were annotated by combining homology prediction with de novo prediction. Homology prediction was performed using the software RepeatMasker (v4.0.9)36 and RepeatProteinMask (v4.0.9)36 based on the RepBase library (http://www.girinst.org/repbase)37. The de novo prediction was performed by RepeatModeler (v1.0.11)38 and LTR-FINDER (v1.0.5)39, employing self-sequence alignment and characteristics of repeat sequence. Additionally, the identification of tandem repeats was facilitated by Tandem Repeats Finder (v4.09)40 The annotation results show that repetitive sequence size is 293.98 Mb, which accounts for 43.77% of the genome (Table 4). Among these repeats, SINEs accounted for 0.54% of the genome size, LINEs for 7.02%, LTRs for 10.40%, and DNA elements for 20.90% (Table 5).

Table 4 Statistical results of repetitive sequences in T. yaopeizhii genome.
Table 5 Types of repetitive sequences in T. yaopeizhii genome.

Prediction and functional annotation of protein-coding genes

In this study, the prediction of protein-coding genes in the genome of T. yaopeizhii was achieved through a combination of de novo prediction, homology prediction, and transcriptome prediction. The de novo prediction was performed using Augustus (v3.3)41 and GlimmerHMM (v3.0.4)42 to predict gene structure. Homology prediction was conducted using six closely related species, T. longibarbata (unpublished data), T. bombifrons (unpublished data), T. dalaica43, T. tribetana44, T. rosa45 and T. yarkandensis46, with the software Exonerate (v2.4)47. The protein-coding sequences of known closely related species were aligned to the genome sequences of the target species to predict genes. The RNA-seq data were aligned with StringTie (v2.1.1)48 in the genome, and the resulting transcripts were then reconstructed. The coding regions were predicted using PASA (v2.4.1)49. MAKER (v3.00)50 integrated the gene sets predicted by the different methods into a non-redundant and more complete gene sets. Gene prediction was corrected by PASA (v2.4.1) combined with transcriptome data. For functional annotation of gene predictions, BLASTP (v2.6.0)51,52 was used to compare the predicted genes with databases such as the Kyoto Encyclopedia of Genes and Genomes (KEGG)53, Gene Ontology (GO)54, NCBI non-redundant protein database (NR)55, Swiss-Prot56, and TrEMBL56, as well as the InterPro57 databases. Overall, we successfully predicted 26,487 protein-coding genes in the genome. The average gene length of these predicted genes was 12,653.76 bp, the average coding sequence length was 1,512.01 bp, and the average number of exons was 8.82 (Table 6). The results of gene function annotation showed that 25,686 genes were annotated to at least one database, accounting for 96.98% of the total number of predicted genes (Fig. 2B, Table 7).

Table 6 Statistical results of gene prediction.
Table 7 Statistical results of gene function annotation.

Annotation of non-coding RNAs

The software tRNAscan-SE (v1.3.1)58 was utilised with default parameters in order to identify tRNA sequences within the genome, in accordance with the structural characteristics of tRNA. Using rRNA sequences from closely related species as reference sequences, rRNAs in the genome were searched by BLASTN (v2.6.0)59 comparison with E-value < 1e-5, homology ≥85%, and match length ≥50 bp. In addition, the covariance model of the Rfam database (http://xfam.org/) was utilized along with Infernal (v1.1.2)60 and Rfam (v14.1)61 with default parameters, to predict the miRNA and snRNA sequence information in the genome. As a result, we annotated 406 miRNAs, 23,001 tRNAs, 124 rRNAs, and 1,480 snRNAs (Table 8).

Table 8 Statistical results of non-coding RNA annotation.

Data Records

All raw data of the whole genome have been deposited into the National Center for Biotechnology Information (NCBI) SRA database under BioProject accession number PRJNA1195554. The genomic PacBio sequencing data, the ultra-long ONT sequencing data, the Hi-C sequencing data, and the RNA sequencing data were deposited in the in the Sequence Read Archive at NCBI with accession number SRP55123062. The genome assembly data has been deposited at GenBank under the accession GCA_048296945.163. The files of repetitive sequences annotation, the gene structure annotation, predicted coding sequences, protein sequences, and ncRNA annotation have been deposited at Figshare64.

Technical Validation

Multiple methods were used to verify the accuracy and integrity of T. yaopeizhii genome. Firstly, the Hi-C heatmap of T. yaopeizhii genome showed a high degree of consistency among all chromosomes, reflecting the accuracy of sequencing, ordering, and orientation of contigs in the genome assembly (Fig. 1C). Secondly, a total of 25 centromeres were mapped on 25 chromosomes, and all telomeres were identified. These results provided significant evidence for the integrity of the chromosomes (Fig. 2A). Subsequently, the Illumina sequencing data were aligned to the genome using the software BWA (v0.7.17), achieving a mapping rate of 98.81%. And alignment results from Winnowmap (v1.11), 99.91% of ONT reads and 99.96% of HiFi reads could be aligned to the T2T assembly. In addition, the statistical analysis of gene data, including the distribution of genes, CDS, exons, and introns, was performed on T. yaopeizhii and related species. The results show that the distribution among these species is consistent, demonstrating the accuracy of the genome (Fig. 3). Notably, the calculation and comparison of the integrity of T2T genomes with those of closely related species using the software KAT65 showed that T2T genomes have higher integrity (Table 9). Finally, the quality value (QV) of assembly was quantified using merqury (v1.3)66, resulting in QV of 31.73 (Table 2). The integrity of the genome assembly and the protein-coding genes were assessed using BUSCO (v4.0.5)67, based on the single-copy orthologous gene set actinopterygii_odb10 in the OrthoDB database. The results showed that 96.62% and 92.50% of the 3640 single-copy orthologous gene sets were identified (Fig. 4). In conclusion, the T2T genome of T. yaopeizhii is found to be highly complete and accurate. The high-quality genome provides a robust foundation for investigating the evolutionary and adaptive mechanisms of plateau fish in response to the unique environmental conditions of the plateau.

Fig. 3
figure 3

Distribution of (A) genes, (B) CDS, (C) exon and (D) intron length of the closely related species.

Table 9 Integrity assessment of genomes.
Fig. 4
figure 4

BUSCO assessments of assembly and annotation.