Background & Summary

Salmo trutta, a member of the Salmonidae family, is characterized by a grey upper body with spots distributed above and below the lateral line1. It is native to Europe, Western Asia, and North Africa2, and can be broadly categorized into anadromous and lacustrine populations3. Since the mid-19th century, S. trutta has been introduced to 24 countries, including Russia, the United States, the United Kingdom, Japan, and countries in South America. Its strong migratory abilities and adaptability to diverse environments have enabled it to rapidly establish itself as a global species3,4.

In 1866, S. trutta was introduced to the Yadong River in Tibet, China, where it has since become a localized population, colloquially known as Yadong trout5. The specific environmental conditions of the Yadong River result in a slow reproductive and growth cycle for the Yadong trout population. This makes the species highly vulnerable to overfishing6,7. This has led to a steady decline in population numbers, prompting its designation as a second-class protected aquatic species in the Tibet Autonomous Region in 19928. In recent years, efforts by the Yellow Sea Fisheries Research Institute of the Chinese Academy of Fishery Sciences have led to the implementation of large-scale aquaculture programs to support the conservation and commercial farming of Yadong trout.

Advancements in sequencing technologies have made telomere-to-telomere (T2T) level genome assemblies feasible. As a result, several fish species, including Neosalanx taihuensis9, Lateolabrax maculatus10, and Clarias gariepinus11, have had their genomes published at this level of detail. However, the only available genome (GCA_901001165.1) of S. trutta remains the chromosomal-level assembly published by the Wellcome Sanger Institute in 201912, and the number of published chromosomal-level genomes is still extremely limited compared to the global distribution of different populations of S. trutta.

In this study, we integrated PacBio high-fidelity (HiFi), high-throughput chromatin conformation capture (Hi-C) and Oxford Nanopore Technologies (ONT) reads to assemble the S. trutta genome, achieving near complete genome sequence level (Fig. 1a,b). Compared to the published S. trutta genome12, our assembly shows significant improvements in continuity and completeness. This high-quality genome will provide valuable resources for the molecular breeding of S. trutta and facilitate comparative genomic analyses among S. trutta populations from different regions.

Fig. 1
figure 1

S. trutta genome snail plot and circos plot. (a) The snail plot presents the basic metrics of the genome assembly. (b) The circos plot represents the following metrics from outer to inner layers: (a) chromosomes, (b) gene density, (c) CG content, (d) DNA transposons, (e) LTRs, (f) LINEs, and (g) SINEs. The points on the chromosome backbone indicate detected telomeres, with red points representing chromosomes that are telomere-to-telomere with no gaps.).

Methods

Sample collection and sequencing

An adult female S. trutta was sourced from the Yadong County Industrial Park, Shang Yadong Township, Shigatse City, Tibet, China. We employed a rigorously annotated SDS method to obtain sufficient quality and quantity of genomic DNA (gDNA). The sheared gDNA was purified using AMPure PB beads, followed by end repair, adapter ligation, and further purification to construct SMRTbell templates. These templates were then loaded into SMRT cells and sequenced on the PacBio Sequel II platform, yielding 94.96 Gb (~38×) of HiFi data (Table 1).

Table 1 Statistics of the sequencing data.

For ONT data, DNA was extracted from fin clip tissue using the NEB Monarch® HMW DNA Extraction Kit for Tissue. Libraries were prepared and sequenced on the PromethION platform, resulting in 72.72 Gb (~29×) of ONT data (Table 1).

For Hi-C data, muscle tissue was processed through formaldehyde crosslinking, followed by washing, lysis, enzymatic digestion, DNA end modification, fragment ligation, purification, DNA end repair, biotin labeling, and PCR amplification. Illumina PE150 sequencing was performed, yielding 255.69 Gb (~103×) of Hi-C data (Table 1).

For WGS sequencing, DNA extracted from muscle tissues was transformed into Illumina library formats using the the NEBNext® Ultra™ DNA Library Prep Kit. Cluster formation for these libraries was carried out on cBot Cluster Generation System with the Illumina Paired-End Cluster Kit, adhering to the guidelines provided by the producer. We ultimately obtained 99.63 Gb (~40×) of short-fragment data (Table 1).

For RNA-seq data, we extracted RNA from nine different tissues, including muscle, liver, intestine, ovary, brain, spleen, gill, kidney, and heart. RNA-Seq libraries were assembled following the protocol of the library preparation kit. Illumina sequencing yielded an average of 6.83 Gb of data per tissue sample (Table 1). Additionally, we extracted RNA from a mixed tissue sample, and qualified RNA samples underwent reverse transcription, end repair, DNA fragmentation, adapter ligation, and amplification to construct the library. Sequencing was then carried out on the PacBio Sequel IIe platform, resulting in 10.11 Gb of Isoform Sequencing data. (Table 1).

Genome assembly and telomeres identification

To obtain a high-quality genome for S. trutta, we integrated ONT data with HiFi data and utilized Hifiasm13 (v0.19.9) to assemble a draft genome. Subsequently, we employed CRAQ14 (v1.0.9) to identify chimeric fragments and generated CRAQ-corrected genome. We then used kmerDedup15 for assembly redundancy removal and HapHic16 (v1.0.6) together with Hi-C data, to anchor the draft genome to 40 chromosomes, consistent with the chromosome number of the previously published S. trutta genome12 (Fig. 2a). We utilized Juicer-box17 (v1.91) for minor manual refinements of the genome, subsequently employing TGS-GapCloser18 (v1.2.1) in conjunction with ONT data to fill gaps within the genome. Thereafter, we leveraged NextPolish219 (v0.2.1), in tandem with HiFi and WGS data, to rectify the genome, culminating in the acquisition of a 2.49 Gb genome with a contig N50 of 47.99 Mb (Fig. 1a). We utilized the TeloExplorer parameter in quarTeT20 (v1.2.1) to identify the telomeres (TTAGGG) at both ends of each chromosome in the S. trutta genome, revealing that 18 chromosomes had double-end telomeres detected, 17 chromosomes had single-end telomeres detected, and 8 chromosomes achieved gap-free telomere-to-telomere status (Fig. 1b and Table 2).

Fig. 2
figure 2

Hi-C heatmap and collinearity dot plot of the genome. (a) The Hi-C heatmap illustrates the chromosome interaction frequencies of the S. trutta genome, with each blue contour representing a chromosome. (b) The dot plot displays the collinearity relationship with previously published genome assembly of S. trutta.

Table 2 Assembly statistics of chromosomes.

Additionally, we conducted a collinearity analysis using Minimap221 (v2.28) with the published S. trutta genome and visualized the results using pafCoordsDotPlotly.R (https://github.com/tpoorten/dotPlotly). The results showed that most chromosomes exhibited good collinearity, although some chromosomes displayed different arrangements (Fig. 2b). We also compared various assembly metrics between the two genomes, and found that our assembled genome demonstrated superior performance, with a total genome length of 2.49 Gb, an anchoring rate of 96.87%, a contig N50 of 47.99 Mb, a BUSCO completion rate of 99.50%, fewer genome gaps, 12 chromosomes being gap-free, and the detection of 53 telomeres (Table 3).

Table 3 Assembly statistics of S. trutta.

Repetitive sequence annotation

For the identification and characterization of genomic repetitive elements, we employed a combination of de novo prediction and homology-based annotation. We employed LTR-Finder22 (v1.07) to predict LTR retrotransposons in the S. trutta genome. Simultaneously, we utilized RepeatModeler23 (v2.0.5) to perform de novo predictions of repetitive elements, and based on the results from both methods, we constructed a repeat element database specific to the S. trutta genome with the Repbase database24 (v202101). Homology-based annotation was mainly conducted using the RepeatProteinMask and Repbase modules of RepeatMasker25 (v.4.1.0), also with default parameters for prediction. The predictions from both strategies were then integrated and filtered. The annotation results indicated that repetitive sequences account for approximately 63.24% of the genome, with the largest portion being occupied by various interspersed elements: long interspersed nuclear elements (LINEs) made up 32.34%, short interspersed nuclear elements (SINEs) constituted 0.55%, and long terminal repeats (LTRs) accounted for 10.80% (Fig. 1b and Table 4).

Table 4 Statistics of repeat content.

Genomic structure and functional annotation

Building upon the repetitive sequence-masked S. trutta genome, we employed Augustus26 (v3.5.0) for gene de novo prediction using default parameters. Additionally, we utilized Miniprot27 (v0.13) to perform homology annotation with protein sequences from four species, including Salmo salar, S. trutta, Oncorhynchus mykiss, and Oncorhynchus tshawytscha. Concurrently, we aligned RNA-seq data to the genome using HISAT228 (v2.2.1), followed by transcriptome assembly with StringTie29 (v2.2.3). Open reading frame (ORF) within the assembled transcripts were further identified using TransDecoder (https://github.com/TransDecoder/TransDecoder), and potential coding regions were further predicted to construct evidence for annotation. Additionally, we processed full-length transcriptome data using the IsoSeq330 (v4.0.0) pipeline and aligned them to the genome with GMAP (https://github.com/juliangehring/GMAP-GSNAP) to generate annotation evidence. Finally, we integrated four annotation strategies—de novo, homology, transcript-based, and full-length transcriptome—using EvidenceModeler31 (v1.1.1). The final gene set was further refined by aligning it to the S. trutta genome annotation using Liftoff32 (v1.6.3) as a reference, resulting in the identification of 41,782 genes. To assess the accuracy of gene annotation, we compared the distribution of mRNA lengths and exon counts per mRNA in the S. trutta genome with gene data from S. salar and O. mykiss. The results demonstrated highly similar genomic component distribution characteristics among the three species (Fig. 3a).

Fig. 3
figure 3

Comparative genome plot of closely related species and gene functional annotation upset plot. (a) Comparison of mRNA length distribution and exon count per mRNA among the three closely related species. (b) Upset plot of gene functional annotation using data from EggNOG, Pfam, KEGG, NR, Kofam, and SwissProt.

The predicted gene protein sequences were aligned against various functional databases using Diamond33 (v2.1.6) with an E-value threshold set at 1e-5. The databases encompassed Kyoto Encyclopedia of Genes and Genomes34 (KEGG), Swiss Institute of Bioinformatics Protein Database35 (Swiss-Prot), Evolutionary Genealogy of Genes: Non-supervised Orthologous Groups36 (EggNOG), Protein Families Database37 (Pfam), KOfam Database38 (Kofam), and the Non-Redundant database39 (NR) to extract potential gene functional information, which was subsequently utilized for statistical analysis. A total of 41,629 genes, which account for 99.63% of the total estimated protein-coding genes, have been effectively annotated by a minimum of one of these databases (Fig. 3b and Table 5).

Table 5 Statistics of gene annotation.

Data Records

The genome assembly data have been uploaded to GenBank under accession number JBICOJ000000000 (PRJNA1170013)40 and Figshare41. The genome annotation file has also been uploaded to Figshare41. The sequencing raw data have been uploaded to NCBI with the project number PRJNA1171601 (SRP540059)42.

Technical Validation

We employed BUSCO43 (v5.3) with the Actinopterygii database (actinopterygii_odb10) to evaluate the genome assembly’s completeness. The BUSCO analysis of the genome revealed an overall completeness of 99.50%, with 56.48% being single-copy and 43.02% being duplicated, leaving only 0.28% as fragmented and 0.22% missing (Fig. 1a and Table 3). We utilized Inspector44 (v1.0.1) to map the PacBio HiFi reads against the genome, achieving a quality score, as indicated by the Quality Value (QV), of 44.91 and a read-to-contig alignment rate of 99.95%. Additionally, we applied CRAQ14 for genome assessment, which determined an Assembly Quality Index (AQI) of 96 (>90), indicating that our assembled genome is highly complete, of good quality, and has achieved a reference-quality standard.