Background & Summary

Salvelinus malma belongs to the genus Salvelinus (family Salmonidae, order Salmoniformes, class Osteichthyes) with ecological and economic importance1. It usually inhabits in freshwater or marine ecosystems within northwest America and northeast Asia2,3. Characterized by its vibrant pigmentation and superior flesh quality4, S. malma has earned the epithet of “King of Cold-Water Fishes” in the high-altitude aquatic systems. The substantial market value have positioned it to become the premium aquaculture product in international trade5,6.

In recent decades, it has witnessed the alarming population declines of wild landlocked S. malma populations, primarily driven by anthropogenic disturbances and habitat alterations2,7,8. Together with the exceptionally slow growth rate, the wild population of S. malma gradually becomes scarce in freshwater systems of China9. Conserving such endangered species presents a significant challenge for biologists and ecologists. In this context, genomics has emerged as a powerful tool in conservation biology, offering insights into the genetic diversity of threatened species. Meanwhile, the genomic resources provide critical information on current and historical demography trends, phylogenetic relationships, and the molecular mechanisms that underpin interactions between genetics and environmental factors. Moreover, they enable the development of rapid monitoring tools and inform conservation strategies grounded in genetic evidence. To fill this knowledge gap, we present the first chromosome-level genome assembly of S. malma through the integration of PacBio long-read sequencing and Hi-C scaffolding technologies. This genomic blueprint will serve as a foundational resource for facilitating the marker-assisted selection and informing the evidence-based conservation management. Our work will establish the crucial genomic infrastructure for advancing fundamental research for this ecologically vulnerable species (Fig. 1).

Fig. 1
figure 1

The full-body view of S. malma.

Methods

Samples and sequencing

All procedures involving animals conformed to ethical standards set by the Institutional Review Board at Ocean University of China (Permit Number: 20141201). The male adult S. malma (body length: 33.25 cm and body weight: 0.71 kg) was collected and obtained from the Sifeng salmonid aquaculture farm in Yanji, Jilin Province, China. Following the anaesthesia with 100 mg/L tricaine methanesulfonate (MS-222, Sigma-Aldrich), the dorsal muscle tissue was aseptically collected, flash-frozen in liquid nitrogen, and stored at −80 °C. Genomic DNA was extracted using the QIAamp DNA Mini Kit (QIAGEN). DNA integrity was verified by electrophoresis and Agilent 4200 Bioanalyzer analysis (DNA Integrity number >7.0; OD260/280 = 1.8–2.0). High-quality DNA samples were applied for the construction of three different sequence libraries: (1) Illumina library for genome survey: libraries (350-bp insert size) were constructed using the TruSeq Nano DNA Kit and sequenced on NovaSeq 6000 platform (150-bp paired-end). It totally generated 237.52 Gb (94.33 × genome coverage) raw data. Raw reads were quality-controlled using Fastp (v0.23.2)10 with the following thresholds: adapter contamination ≤5 bp, ambiguous bases ≤5%, Q20 ≥90%. (2) SMRTbell library for de nove assembly: PacBio Sequel II with SMRTbell libraries were prepared using the Template Prep Kit 1.0 with size selection (15–20 kb fragments) via BluePippin™ (Sage Science). PacBio Sequel II sequencing generated 78.21 Gb (31 × genome coverage) circular consensus sequencing (CCS) reads (≥99% accuracy) over 30-hour movie captures. (3) Hi-C library for chromosome anchoring: Muscle samples were initially crosslinked with formaldehyde to preserve chromatin spatial interactions, followed by quenching with glycine. Crosslinked chromatin was digested with restriction enzyme to generate cohesive ends. The digested DNA fragments underwent end repair and biotin labelling. Blunt-end fragments were proximity-ligated with T4 DNA ligase under diluted conditions to prioritize intramolecular ligation events. Purified DNA was randomly sheared via ultrasonication into 300–500 bp fragments. Biotinylated DNA fragments were enriched by streptavidin magnetic bead capture to selectively retain proximity-ligated junctions. Sequencing libraries were constructed using the Illumina TruSeq Nano DNA Library Prep Kit. The spatial chromatin organization was interrogated through 150 bp paired-end sequencing on the Illumina NovaSeq 6000 platform, generating 372.50 Gb data (147.93 × genome coverage) for chromosome anchoring (Table 1).

Table 1 Statistics of sequencing data in S. malma genome assembly.

Genome assembly

The genome assembly workflow began with genome survey analysis. The 21-kmer frequency matrix was constructed using the Jellyfish (v2.3.0)11 software based on Illumina sequencing data. GenomeScope (v2.0)12 online tool was performed for the evaluation of S. malma genome, showing the 2.42 Gb genome size with 0.35% heterozygosity and 42.7% repetitive content (Fig. 2). Subsequently, PacBio HiFi reads were submitted for the de novo assembly via Hifiasm (v0.19.6)13,14 software, yielding 7,979 primary contigs (N50 = 1.29 Mb). Hi-C scaffolding was performed using the traditional Juicer and 3D-DNA pipelines15, followed by manual curation in Juicebox (v1.11.08)16 tool to generate the chromosome-level scaffolds. A total of 3,558 contigs were anchored to 42 chromosomes (Fig. 3c, Table 2). Then, all the contigs and HiFi reads were used to fill gaps for initial assembly using quarTeT (v1.2.5) software17. The assembly genome was further polished with HiFi reads using the T2T-Polish workflow (https://github.com/arangrhie/T2T-Polish)18. The final genome assembly size was 2.52 Gb with a GC content of 43.43% (Fig. 4, Tables 2, 3). Additionally, the assembly achieved 98.6% BUSCO completeness based on the Actinopterygii_db12 gene set.

Fig. 2
figure 2

The 21-mer analysis for genome survey of S. malma. (a) Linear scale. (b) Exponential scale. The estimated genome sizes (len), unique k-mer ratios (uniq), heterozygosity (het) ratios, k-mer coverage values (kcov), read errors (err), and duplication (dup) rates are displayed on the top side of each panel.

Fig. 3
figure 3

Chromosome number and morphology of S. malma revealed by cytogenetic and Hi-C analyses. (a) Metaphase chromosome spread showing the diploid chromosomal morphology. (b) Karyotype arrangement based on and centromeric position. (c) Hi-C contact map and chromosome anchoring of S. malma genome.

Table 2 Summary statistics of S. malma genome assembly.
Fig. 4
figure 4

Statistics of genome assembly of S. malma. (a) Physical map of S. malma chromosomes (Mb scale), different colour represents different chromosome. (b) Gene density represented by number of genes in 1 Mb window. (c) GC content represented by percentage of G/C bases in 1 Mb window. (d) Distribution of repeated sequences in 1 Mb window. (e) Distribution of DNA transposons sequences in 1 Mb window. (f) Distribution of LINE transposons sequences in 1 Mb window. (g) Distribution of LTR transposons sequences in 1 Mb window.

Table 3 Statistics of length of chromosome in S. malma genome.

Repetitive sequence annotation

Firstly, RepeatModeler (v2.06)19 software was employed to construct the de novo repeat sequence database for the S. malma genome. Subsequently, it was merged with the salmonid-specific repeat library from RepBase20, working as the reference for repetitive sequence annotation of S. malma genome via RepeatMasker (v4.1.3)21. The results revealed 1.66 Gb of repetitive sequences, accounting for 55.55% of S. malma genome. Among these, the DNA transposons dominated at 23.60%, followed by long interspersed nuclear elements (LINEs) at 14.04%, and long terminal repeats (LTRs) at 6.50% (Fig. 4, Table 4). The repetitive sequences were processed using the calcDivergenceFromAlign.pl script from the RepeatMasker package to calculate Kimura substitution levels. The plot of repeat landscape generated by the createRepeatLandscape.pl script was employed to visualize the genomic distribution and evolutionary dynamics of repetitive elements in S. malma (Fig. 5).

Table 4 Classification statistics of repeated sequences in S. malma genome.
Fig. 5
figure 5

Evolutionary dynamics of transposable elements in the S. malma genome. The repeat landscape, generated through Kimura substitution analysis of transposable element copy divergence, reflected the historical transposable element accumulation phases.

Genome annotation

Protein-coding genes in S. malma genome were annotated through a comprehensive strategy that integrated the RNA evidence, homology protein, ab initio prediction and NCBI Eukaryotic Genome Annotation Pipeline (EGAPx). For the RNA evidence, we collected a series of RNA-seq datasets in various tissues of Salvelinus sp., including liver, gonad, gill, stomach, head kidney, hind kidney, brain, muscle, gut, heart, and eye with the accession number of SRS2043860-SRS2043871. Then, these RNA-seq datasets were aligned to the S. malma genome by HISAT2 (v2.1.0)22 with default parameter. The SAM files generated from alignments were sorted using Samtools (v1.12)23. StringTie (v2.2.1)24 was employed to perform de novo transcript assembly on the merged BAM file. The LongOrfs module of TransDecoder (v5.7.1, https://github.com/TransDecoder/TransDecoder) was utilized to predict potential open reading frames in the cDNA sequences. For homology protein evidence, the protein sequences of homologous species including Oncorhynchus keta (GCA_023373465.1)25, O. mykiss (GCA_013265735.3)26, Coregonus clupeaformis (GCA_020615455.1)27, O. nerka (GCA_034236695.1)28, O. kisutch (GCA_002021735.2)29, O. gorbuscha (GCA_021184085.1)30, and Salmo trutta (GCA_901001165.2)31 were downloaded from the public NCBI database and aligned against the S. malma genome using miniprot (v0.13)32. For ab initio prediction, Helixer (v0.3.3)33 software was also used to predict structure based on the Deep Learning and a Hidden Markov Model. Predictions from RNA evidence, homology protein and ab initio prediction at the ratio of 5:1:1 were consolidated with EVidenceModeler (v2.1.0) software34. In addition, the NCBI EGAPx (v0.3.2) pipeline from (https://github.com/ncbi/egapx) was also executed for gene prediction using Nextflow (v24.10.5). Evidences from NCBI and EVM were integrated, which yielded 45,385 high-confidence protein-coding genes. These genes displayed an average gene length of 28,448 bp and an average coding sequence (CDS) length of 1,814 bp. Furthermore, the similarities in distributions of mRNA lengths, exon lengths, and intron lengths between the S. malma genome and the closely related species indicated conservation of gene structure patterns in evolution (Fig. 6).

Fig. 6
figure 6

The comparative patterns of protein-coding genes among S. malma, O. mykiss, S. salar and D. rerio, including gene length, mRNA length, exon length, and intron length.

Phylogeny analysis

A total of 7 Salmonidae species were selected and downloaded their reference genome from NCBI database (O. mykiss: GCA_013265735.326, O. gorbuscha: GCA_021184085.130, S. namaycush: GCA_016432855.135, S. fontinalis: GCA_029448725.136, S. salar: GCA_905237065.237, S. trutta: GCA_901001165.231, C. clupeaformis: GCA_020615455.127). The Esox lucius (GCA_011004845.1)38 was set as the outgroup for the construction of phylogenetic tree. Single-copy orthologous genes were identified and obtained from the sequence similar cluster analysis of genes using OrthoFinder (v2.3.11)39 pipeline. The protein sequences encoded by the single-copy orthologous genes were conducted with multiple alignment with MUSCLE (v3.8.1551)40, and non-conserved sites were filtered using GBLOCKS (v0.91b). Then, the single-copy orthologous genes were concatenated into a “supergene” using Perl scripts. The best-fitting model for construction of phylogenetic tree was determined with ModelTest-NG (v0.1.7). And the PROTGAMMAIJTTF was considered as optimal model. The phylogenetic tree was constructed based on the maximum likelihood method using RAxML (v8.2.12)41 software. Then, the divergence time among species was estimated via the MCMCtree software in PAML (v4.9)42 program with two fossil calibration points acquired from TimeTree43 (https://timetree.org/): E. lucius and S. salar (61–121.5 MYA); S. trutta and S. salar (74.4–96.5 MYA). The species tree was further visualized using FigTree (v1.4.4) (Fig. 7). The topology revealed that S. malma clustered within the Salvelinus clade, showing divergence from Oncorhynchus, Salmo, and Coregonus lineages.

Fig. 7
figure 7

Phylogeny and time scale of S. malma compared with other species. The split between S. malma and its sister species S. fontinalis occurred about 3.5 Mya, and the split between Salvelinus and Oncorhynchus occurred about 22.2 Mya.

Synteny analysis

The synteny analysis of S. malma genome was performed using WGDI (v0.74)44. Self-alignment of protein sequence was conducted using BLASTp (v2.2.31+) with an E-value cutoff of 1e-5. Syntenic dot plots were generated by integrating the BLASTp outputs, genome annotations, and chromosome lengths into a WGDI configuration file (default parameters; maximum of 5 homologous genes per locus). Subsequently, WGDI was used to identify syntenic blocks, calculate synonymous substitution rate (Ks value), integrate block information, and visualize Ks distributions among S. salar (GCA_905237065.2)37, danio rerio (GCA_049306965.1)45, and S. malma. It revealed the conserved synteny and chromosome inversions in S. malma genome (Fig. 8a). Additionally, two distinct Ks peaks were observed between S. malma and D. rerio suggesting salmonid-specific fourth vertebrate whole-genome duplication event (Fig. 8b).

Fig. 8
figure 8

Detection of whole genome duplication (WGD) and genomic synteny analysis in S. malma genome. (a) Synteny blocks of the S. malma genome. The axes refer to different chromosomes. (b) Distribution of Ks value in S. malma and S. salar, O. mykiss, D. rerio, which represents the Gaussian fit of the raw Ks counts.

Data Records

All sequencing data have been uploaded to the NCBI SRA database under the BioProject accession number of PRJNA1248052. Specifically, the Illumina sequencing data for genomic survey has been deposited in the NCBI SRA with accession number of SRR3306923246. The genomic PacBio sequencing data has been deposited in the NCBI SRA with accession number of SRR3306923347. The Hi-C data has been deposited in the NCBI SRA with accession number of SRR3536475548 and SRR3536475649. The genome assembly has been deposited in the GenBank with accession number JBQVVI00000000050 and the genome annotation have been deposited to figshare database (https://doi.org/10.6084/m9.figshare.28788059.v1)51.

Technical Validation

Genome assembly and annotation assessment

BUSCO (v3.0.2) analysis was performed to evaluate the completeness of the S. malma genome assembly and annotation, using three reference datasets: Eukaryota_db12, Vertebrata_db12, and Actinopterygii_db1252. The final genome assembly achieved BUSCO completeness scores of 99.6% (Eukaryota: 52.6% single-copy, 45.9% duplicated, 1.0% fragmented, 0.4% missing), 98.8% (Vertebrata: 47.6% single-copy, 51.2% duplicated, 0.9% fragmented, 0.3% missing), and 98.6% (Actinopterygii: 52.6% single-copy, 45.9% duplicated, 1.0% fragmented, 0.4% missing). Similarly, the annotated protein-coding genes showed BUSCO completeness of 100% (Eukaryota: 39.6% single-copy, 60.4% duplicated), 99.0% (Vertebrata: 38.7% single-copy, 60.2% duplicated, 0.6% fragmented, 0.5% missing), and 98.8% (Actinopterygii: 42.6% single-copy, 56.2% duplicated, 0.6% fragmented, 0.5% missing), collectively confirming the high quality of the S. malma genome (Fig. 9a,b).

Fig. 9
figure 9

BUSCO statistical results of the S. malma genome assembly and annotation using three reference datasets. (a) The BUSCO completeness of the genome assembly was 99.6%, 98.8%, and 98.6% at the Eukaryota, Vertebrata, and Actinopterygii datasets, respectively. (b) The BUSCO completeness of the genome annotation was 100.0%, 99.0%, and 98.8% at the Eukaryota, Vertebrata, and Actinopterygii datasets, respectively.

Karyotype analysis of S. malma

To validate the accuracy of S. malma genome assembly using Hi-C data, chromosome karyotyping was conducted through Giemsa staining method. Initially, phytohemagglutinin (PHA, 10 μg/g fish weight) was administered, followed by colchicine injection (5 mg/g) in 24 h later. Head kidney tissues were collected at 5 h post-colchicine treatment, rinsed with saline (85% NaCl), mechanically dissociated, and filtered through 100-mesh gauze. The cell suspension was centrifuged (1200 rpm, 8 min), treated with 6 mL hypotonic KCl solution (0.075 mol/L) for 50 min, and fixed three times with methanol: glacial acetic acid (3:1) via centrifugation (1200 rpm, 8 min each). Cell suspensions were then dropped onto slides, air-dried over an alcohol lamp, stained with Giemsa for 30 min, and microscopically Karyotype analysis revealed 42 chromosome pairs (2n = 84), consistent with Hi-C assembly results, thereby confirming the genomic integrity (Fig. 3a,b).