Background & Summary

Major shifts in paleogeoclimatic have repeatedly driven biotic community reshaping, leading to rapid lineage divergence that shapes current biodiversity patterns1,2,3,4,5,6,7. The uplift of the Tibetan Plateau and the intensification of the East Asian monsoon drove the formation of eastward-flowing rivers like the Yangtze, fostering high cyprinid diversity and rapid speciation in East Asian lineages8. The phylogeographic dynamics of cyprinid fishes in turn provided crucial evidence for understanding the spatiotemporal evolution of their fluvial systems, including river connectivity, drainage changes, and habitat fragmentation8,9. Meanwhile, the rapid uplift of the Tibetan Plateau and the marked intensification of the East Asian monsoon system synergistically drove the formation of East Asia’s unique natural landscapes and hydrological regimes, exemplified by karst topography10,11,12. And the South China Karst has received broad recognition from the scientific community for its importance in biodiversity conservation10,13, as Karst regions’ abundant underground river networks have nurtured diverse cave fishes14,15,16. In recent decades, cave fishes have been continuously discovered since 1976, with the Nemacheilidae’s hypogean fishes representing nearly 90 described species15,16. Recent studies link cavefish diversification and speciation to paleoclimate shifts, especially the Tibetan Plateau uplift and East Asian monsoon intensification15,17,18, making them ideal models for exploring how these changes shaped karst biodiversity and subterranean ecosystem evolutionary mechanisms. However, current evidence remains inadequate to resolve the evolutionary mechanisms driving rapid radiation in cave-adapted Triplophysa or their adaptive strategies for surviving extreme subterranean environments. It is crucial to note that the fragility of karst landscapes (e.g., susceptibility to cave collapses and groundwater pollution in underground rivers) necessitates prioritizing ecological conservation in human activities19,20,21. Therefore, large-scale phylogeographic dynamic studies on cave-adapted fishes face significant challenges in reconstructing the spatiotemporal evolution of their subterranean river systems. However, cave-adapted fishes hold significant conservation value, making research into their survival mechanisms and adaptive evolutionary strategies in unique subterranean environments particularly crucial16,22,23,24. The Telomere-to-Telomere (T2T) genome enables the resolution of structural variations in complex genomic regions, the discovery of novel genes, and the functional annotation of “genomic dark matter” regions (e.g., centromeres and telomeres)25,26,27. These breakthroughs are critical for elucidating how species adapt to environmental changes.

Triplophysa erythraea (Fig. 1A), a newly described cavefish species identified in 2019, exhibits extreme troglomorphic adaptations: complete absence of eyes, scaleless body, transparent integument, blood-red trunk pigmentation, and elongated barbels28. This remarkable species inhabits subterranean rocky pools at depths of 0.3–1.0 m, representing a significant taxonomic addition to the cavefish of South China. Cave-dwelling Triplophysa exhibit troglomorphic traits while retaining genetic similarities to epigean congeners within this genus15,29. This unique ecological specialization, which merges extreme subterranean adaptation with conserved genetic traits from surface-dwelling congeners, makes this clade an exemplary model for exploring the evolutionary mechanisms of cavefish adaptation.

Fig. 1
figure 1

(A) Triplophysa erythraea, and the (B) circos plot illustrating the genome of the T. erythraea. The rings, from the outermost to the innermost layer, represent GC content (a), gene density (b), Repeats density (c), LTR density (d), LINE density (e), and DNA-TE density (f). The analysis was conducted using 300-kb genomic windows. (C) Chromosomal Hi-C heatmap of the T. erythraea genome assembly.

In this study, we achieved the first chromosome-level, telomere-to-telomere (T2T) genome assembly for T. erythraea through integration of Pacific Biosciences (PacBio) HiFi sequencing, Oxford Nanopore Technologies (ONT) ultra-long sequencing, and Hi-C assisted assembly technology. This genomic resource bridges critical knowledge gaps by providing the first high-quality chromosome-level assembly for this species, while also advancing evolutionary insights into cave adaptation and informing genome-driven conservation strategies for imperiled subterranean fauna. Furthermore, this accomplishment provides vital genomic data for taxonomic and evolutionary studies within the Nemacheilidae family. It establishes a robust foundation for comparative genomics research on Triplophysa evolution, thereby enhancing our comprehension of how the uplift of the Tibetan Plateau, intensification of the East Asian monsoon, as well as the oscillations in Pleistocene glaciation influence the rapid radiation evolution of Triplophysa stone loaches.

Methods

Ethics statement

All experimental protocols utilized in this study have been approved by the Laboratory Animal Ethics Committee of the Centre for Applied Aquatic Genomics at the Chinese Academy of Fishery Sciences. The sample collection process complied with the guidelines of Chinese Academy of Fishery Sciences.

Sample collection and processing

In present study, two T. erythraea individuals were sampled from the Underground River in Dalong Cave, Huayuan County, Xiangxi Tujia and Miao Autonomous Prefecture, Hunan Province, China. Tissue samples from T. erythraea were harvested and promptly preserved in liquid nitrogen until DNA or RNA extraction could be performed. Multiple tissues (muscle, brain, skin, gill, intestinal, pectoral fins, spleen, and heart) were collected, snap-frozen, and stored at −80 °C. Total RNA was extracted and used for transcriptome sequencing and genome annotation. Muscle tissue was specifically chosen for DNA and ultra-long ONT extraction and sequencing respectively. High molecular weight genomic DNA (gDNA) was extracted via SDS-based extraction, followed by QIAGEN® genomic kit purification (Cat# 13343, QIAGEN) to ensure analytical-grade purity. Genomic DNA integrity and purity were validated by: (1) agarose gel electrophoresis (intact high-molecular-weight DNA without smearing), (2) NanoDrop™ UV-Vis spectrophotometry (concentration and purity via A260/A280/A230 ratios), and (3) Qubit™ fluorometry (high-sensitivity quantification).

High-quality RNA was extracted from all sampled tissues using TRIzol reagent (Invitrogen, MA, USA). RNA integrity (RIN > 8.0) and concentration (≥500 ng/μL) were validated via Agilent Bioanalyzer and Qubit™ assays. Poly-A selected RNA (10–15 μg/sample) was used for strand-specific library prep with NEBNext® Ultra™ II Kit (NEB, USA), including UMIs to correct PCR duplicates. Indexed libraries were sequenced on Illumina NovaSeq. 6000 (PE150, 50 M reads/sample).

Library preparation and sequencing

Firstly, the SMRTbell target library was meticulously prepared in strict compliance with the established protocol (Pacific Biosciences, CA, USA). Subsequently, genome sequencing was performed in OneMore-tech Co.,Ltd. (Wuhan, China) using three complementary approaches: (1) PacBio HiFi reads (10-50 kb insert size) were generated from SMRTbell libraries (v2.0) on the Sequel II system, yielding 124.92 Gb data with N50 = 20.21 kb; (2) Oxford Nanopore ultra-long reads (N50 = 100 kb) were obtained via SDS-based lysis protocol, generating 14.98 Gb sequences; (3) Hi-C libraries were constructed following Belton et al.‘s protocol30, producing 62.16 Gb clean data for phased assembly (Table 1).

Table 1 Statistics of sequencing reads data.

Genome assembly and gap filling

The initial hybrid genome assembly was performed using HiFiasm (v0.16.0) by integrating HiFi reads (PacBio), ONT ultra-long reads, and Hi-C contact maps31, achieving a draft genome of 760.43 Mb with a contig N50 of 27.53 Mb (Table 2). Chromosome-level assembly was achieved through Hi-C-based scaffolding, with quality-controlled Hi-C reads aligned to contig-level genomes using Bowtie2 (v2.3.4.3) under paired-end mode32, yielding 97.20 million uniquely mapped reads (43.41% valid inter-chromosomal pairs, Tables 3, 4). The 3D-DNA pipeline v180922 was employed for chromatin interaction frequency analysis and scaffolding error correction33, followed by iterative refinement using JuiceBox v1.11.0834. This integrative approach produced 25 pseudo-chromosomes spanning 97.5% of the genome assembly (contig N50 = 27.32 Mb) (Fig. 1C).

Table 2 Statistics for the T. erythraea preliminary genome assembly.
Table 3 Statistics of alignment results of clean paired-end reads.
Table 4 Statistics of valid paired-end reads.

The T2T genome assembly was accomplished through a multi-step workflow: (1) ONT ultra-long reads were mapped to pseudo-chromosomes using minimap235 with–secondary = no flag to exclude multi-mapping artifacts; (2) TGS-GapCloser v1.1.136 executed gap filling by leveraging long-read continuity; (3) iterative refinement was performed via three Pilon v1.2437 correction cycles. This pipeline produced a 757.23 Mb telomere-to-telomere genome (contig N50 = 27.63 Mb) containing 19 fully resolved chromosomes (Fig. 2, Table 5), achieving 98.38% completeness as validated by Merqury (QV = 51.03).

Fig. 2
figure 2

The contigs in the chromosomes of the T. erythraea genome. The blue sections at each end of chromosomes and black sections inside each chromosomes represent the identified telomeres and centromeres respectively.

Table 5 Statistics for assembled chromosome, telomeres and centromeres.

Telomere and centromeric regions analysis

Telomere and centromere characterization was conducted using quarTeT v1.1.438, a specialized toolkit for T2T genome analysis. Telomere detection employed motif scanning to identify TTAGGG/CCCTAA repeats with a minimum of four contiguous units, leveraging the TeloExplorer module’s optimized threshold algorithms. Centromere prediction integrated genome annotations with automated tandem repeat detection through the CentroMiner module, which clusters satellite DNAs (≥5 repeats) and prioritizes regions with >72% repeat density. This pipeline generated 42 telomeric regions (17 pairs) and 19 centromere candidates across all chromosomes (Fig. 2, Table 5).

Repetitive sequences analysis

The repetitive landscape was characterized using de novo (RepeatModeler v1.0.11 + LTR-FINDER_parallel v1.0.7) and homology-based (RepeatMasker v4.09 + TRF v4.09) approaches39,40,41,42, revealing 378.05 Mb repetitive sequences (49.93% genome coverage) dominated by 23.83% DNA transposons, 6.93% LINEs, and 8.99% LTR retrotransposons (Table 6, Fig. 1B). Full annotations, including element distribution and evolutionary dynamics, are detailed in Table 7. Repetitive sequences comprise nearly half of the genome, a notable feature given their established roles in shaping genome stability, modulating gene expression, and generating phenotypic diversity. These functions are critical for understanding the molecular basis of adaptation in T. erythraea to extreme environments.

Table 6 Transposable elements statistics for the T. erythraea genome.
Table 7 Repetitive sequences statistics for the T. erythraea genome.

Prediction and functional annotation of protein-coding genes

Genome assembly of T. erythraea underwent comprehensive ab initio gene prediction using a multi-tool pipeline. De novo predictors included AUGUSTUS v3.3.243 for specific splicing patterns, Genscan v1.0 for gene architectures and GlimmerHMM v3.0.444,45 for prokaryotic-derived eukaryotic gene models. Evidence-based refinement employed GeneWise v2.4.146 to align homologous proteins with E-value ≤ 1e-10, resolving splice junctions with ≤5% false discovery rate.

Transcriptomic validation integrated RNA-seq data (Illumina NovaSeq. 6000) using HISAT2 v2.2.147 with–dta flag for splice-aware alignment, followed by StringTie v2.2.048 for transcript quantification and PASA v2.3.249 for consensus isoform assembly. Hybrid annotation merged these predictions via MAKER2 v2.31.1050 and HiFAP, generating 25,179 protein-coding genes (Table 8) with 97.69% BUSCO completeness.

Table 8 Statistics on transposable elements in the T. erythraea genome.

TBLASTN-based comparative genomics (E-value ≤ 1e-5) identified 3,663 conserved coding regions across the related species51, including Triplophysa yaopeizhii, T. tibetana, T. dalaica, T. rosa, and Ctenopharyngodon idella. The gene structures were compared and juxtaposed with those of homologous species, as depicted in Fig. 3. As shown in Fig. 3, the four dimensions exhibit high intra-genus consistency in gene structure across the five Triplophysa species. Notably, distinct differences from the outgroup (C. idella) are evident in two key aspects, including shorter overall gene length and shorter introns. Shorter genes and introns contribute to enhanced transcriptional efficiency, while the relatively shorter coding sequences (CDSs) and exons help maintain stable gene function52. Collectively, these structural features could be helpful to the survival and reproduction of Triplophysa in low-temperature and hypoxic environments.

Fig. 3
figure 3

Distribution of genes in different related species.

Comprehensive functional annotation of protein-coding genes was executed through iterative database curation using InterProScan v5.61–93.053 for conserved domain/motif detection (99.09% annotated genes, 24,951 entries), followed by InterPro, GO, KEGG, and SwissProt enrichment analysis54,55,56,57,58. Multi-source validation integrated TrEMBL (98.41% coverage)58, Pfam (85.95 domain overlap), and KOG (75.55% orthology groups), with TF and NR databases resolving unannotated gene families (Table 9).

Table 9 Putative protein-coding gene functional annotations of the T. erythraea genome.

Annotation of non-coding RNAs

Non-coding RNA annotation was performed using specialized bioinformatics pipelines. The tRNA was identified using tRNAscan-SE v1.3.159 with E-value cutoff ≤ 1e-5. The rRNA was predicted by BLASTN alignment. Additionally, the miRNA and snRNA were identified via INFERNAL v1.1.4 trained on Rfam v14.860, and the results were summarized in Table 10. Different non-coding RNAs vary widely in genomic abundance. For example, rRNA (especially 18S) account for a higher genomic proportion (0.025811%), while scaRNAs represent a far smaller fraction (0.000281%). This marked disparity likely reflects their distinct functional roles and biological significance in T. erythraea.

Table 10 Statistics of the noncoding RNA in the T. erythraea genome.

Data Records

The raw sequencing reads generated from three platform-specific sequencing runs, along with the final genome assembly, have been deposited in the NCBI Sequence Read Archive (SRA, accession number: SRR34067827 - SRR34067831) under BioProject accession number PRJNA127968561,62,63,64,65. The genome annotation files are available in figshare: https://doi.org/10.6084/m9.figshare.2936786066.

Technical Validation

Genome assembly validation was performed through multi-platform read alignment. The workflow achieved 99.69% alignment rate for short reads using BWA-MEM (v0.7.17, r = 1188)67, and 99.94% and 99.91% mapping rates for HiFi and ONT reads via Minimap2 v2.2435), respectively (Tables 11, 12). This dual-validation strategy demonstrated exceptional genomic congruence, with BUSCO v5.4.368 analysis (actinopterygii_odb10) revealing 98.38% completeness across 3,581 single-copy orthologs (Table 13).

Table 11 The alignment of the short reads to the T. erythraea genome assembly.
Table 12 The alignment of the long reads to the T. erythraea genome assembly.
Table 13 Statistics of BUSCO analysis of the T. erythraea genome.