Background & Summary

The genus Hippophae (family Elaeagnaceae) is broadly distributed across the Qinghai–Tibet Plateau and its adjacent regions. Members of this genus exhibit remarkable adaptability to harsh environments, including drought, cold, salinity, and nutrient-poor soils1,2. Furthermore, Hippophae species can form nitrogen-fixing root nodules with Frankia bacteria2,3, thereby improving soil conditions. Through clonal propagation via root suckers, they can establish stable vegetation communities and effectively mitigate soil erosion. Moreover, their fruits provide a vital food source for various wild animals, thereby contributing to the maintenance of ecosystem diversity. Beyond their ecological significance, these species hold substantial potential for applications in the food, pharmaceutical, and cosmetic industries. Their fruits and leaves are particularly rich in vitamin C, flavonoids, essential fatty acids, and various secondary metabolites, which exhibit potential antioxidant, anti-inflammatory, antibacterial, and cardiovascular protective activities1,2,4.

Hippophae salicifolia D. Don, a dioecious deciduous tree named for its willow-like leaves5, is primarily distributed along riverbanks, slopes, and shrublands on the southern slopes of the Himalayas, including southeastern Tibet in China, as well as regions in Nepal, Bhutan, and northern India6. Previous studies have indicated that Hippophae species possess an XY sex determination system (2n = 24)7,8,9,10. Elucidating the mechanisms underlying sex determination and dioecy in Hippophae is crucial for gaining deeper insight into its adaptive evolution in the unique environment of the Qinghai–Tibet Plateau. However, research on sex-related differences in dioecious Hippophae species remains notably limited. Most previous studies have focused on chromosome-level genomes11,12,13,14,15,16, chloroplast genomes17,18,19, mitochondrial genomes20,21, sex-specific molecular markers7,9,10,22,23, fruit nutrient content24,25, and transcriptomes26,27. With advancements in third-generation sequencing technologies (e.g., PacBio HiFi), which offer higher accuracy and throughput at reduced costs28,29, it is now feasible to generate high-quality reference genomes for non-model species such as H. salicifolia, thereby facilitating comprehensive investigations into sex determination and adaptive evolution.

To date, several Hippophae species have been successfully assembled at the genome level, including H. rhamnoides11,12,13, H. tibetana15,16, and H. gyantsensis14. These genomic resources have substantially advanced our understanding of gene diversity, phylogenetics, adaptive evolution, and sex determination mechanisms within the genus. Nonetheless, genomic research on H. salicifolia remains scarce, hindering systematic insights into its genetic background, evolutionary status, and sex determination processes.

In the present study, we employed PacBio HiFi long-read sequencing, Illumina short-read sequencing, and Hi-C data to generate a high-quality, chromosome-level reference genome of H. salicifolia. The assembled genome is approximately 1.11 Gb in size, with a scaffold N50 of 95.29 Mb (Table 1), and 99.4% of the sequences were successfully anchored to 12 putative chromosomes. A total of 42,547 genes were predicted, with repetitive elements accounting for approximately 45.25% of the genome. This is the first high-quality reference genome for H. salicifolia, thereby providing crucial data for elucidating sex determination mechanisms, genome evolution, and adaptive evolution within the genus. Moreover, it lays a robust scientific foundation for future research in genetic improvement, resource conservation, and sustainable utilization of Hippophae species.

Table 1 Statistics of H. salicifolia genome assembly and annotation.

Methods

Sample collection and sequencing

In June 2024, samples of a female individual (XX) of H. salicifolia were identified and collected by La Qiong and Junwei Wang in the Lebu Valley Scenic Area of Cuona County, Shannan City, Tibet (27°55′25.98″ N, 91°48′49.62″ E) (Fig. 1a). A voucher specimen (No. LQ20240672) was deposited in the Herbarium of the College of Ecology and Environment, Xizang University. To ensure the acquisition of high-quality DNA and RNA, fresh young leaves and bark tissues were collected and promptly flash-frozen in liquid nitrogen on-site, followed by storage at –80 °C in the laboratory. Genomic DNA was extracted using a modified CTAB method30, and its purity and concentration were measured using a NanoDrop 2000 (Thermo Fisher Scientific, USA) and Qubit 2.0 (Invitrogen, USA). Integrity was assessed by 1% agarose gel electrophoresis. Only DNA samples that satisfied the quality criteria were selected for subsequent library construction and sequencing.

Fig. 1
figure 1

Genome survey of H. salicifolia. (a) Photograph of the H. salicifolia plant. (b) K-mer (k = 21)-based genome size estimation. The blue region depicts the observed 21-mer frequency distribution; the black curve illustrates the fitted model, and yellow and red portions correspond to unique and erroneous K-mer distributions, respectively.

High-quality genomic DNA was sheared into fragments of approximately 350 bp to construct Illumina sequencing libraries used for an initial genome survey. Library quality was verified using an Agilent 2100 Bioanalyzer and quantitative PCR. Paired-end sequencing (150 bp) was conducted on the Illumina NovaSeq. 6000 platform, yielding approximately 62.78 Gb of high-quality short-read data. These short reads were primarily utilized to estimate genome size, GC content, and heterozygosity (Table 2). To obtain a high-accuracy de novo genome assembly, PacBio Sequel II was employed for HiFi sequencing. DNA was sheared into large fragments of 15–20 kb, and the SMRTbell Express Template Prep Kit 2.0 was applied to construct HiFi libraries. Approximately 31.80 Gb of HiFi data were obtained, with an average read length of 15.94 kb and a sequencing depth of about 28.69× (Table 2).

Table 2 DNA sequencing statistics.

To achieve a chromosome-level assembly, high-throughput chromosome conformation capture (Hi-C) technology was employed. Fresh leaf tissues were fixed in formaldehyde and digested with the restriction enzyme MboI, then subjected to ligation and purification to construct Hi-C libraries. Libraries that passed quality checks underwent sequencing on the Illumina NovaSeq. 6000 platform (150 bp paired-end), generating approximately 102.43 Gb of Hi-C data (Table 2).

In addition, to support gene structure annotation and functional analysis, total RNA was extracted from leaves and bark of H. salicifolia to construct transcriptome libraries. The resulting libraries were sequenced on the Illumina NovaSeq. 6000 platform using 150 bp paired-end reads, generating approximately 8 Gb of transcriptome data per sample (Table 3). These transcriptomic resources provide valuable data for subsequent gene prediction and functional annotation.

Table 3 RNA sequencing statistics.

Genome size estimation and survey analysis

Using high-quality Illumina short-read data, a k-mer analysis was performed to estimate the genome size, heterozygosity, and proportion of repetitive sequences in H. salicifolia. First, raw reads were processed with fastp v0.20.031 to remove adapter contamination and low-quality reads. Next, Jellyfish v2.2.1032 was utilized to count the distribution of 21-mer frequencies, and GenomeScope v2.033 was employed to estimate genome characteristics. These analyses indicated that the H. salicifolia genome size is approximately 1.05 Gb, with a heterozygosity of about 0.49% (Fig. 1b).

Genome assembly

The H. salicifolia genome was de novo assembled using high-quality HiFi long-read data with hifiasm v0.19.934. Hifiasm was executed with default parameters to fully exploit the high accuracy and long fragment lengths of the HiFi data, resulting in a highly continuous initial assembly. Subsequently, Illumina short-read data were integrated for assembly polishing. Specifically, BWA-MEM v2.2135 was applied to map quality-controlled Illumina reads back to the initial assembly, followed by two rounds of error correction with Pilon v1.2436 (“–fix all”) to address potential base errors and fill small gaps, thereby enhancing the assembly’s accuracy and completeness.

To achieve a chromosome-level genome assembly, Hi-C data were integrated to further optimize the polished assembly. First, Juicer v2.037 was employed to preprocess and map the Hi-C reads to the corrected genome. Then, 3D-DNA38 was executed with default parameters to scaffold the assembly, followed by manual inspection and adjustment in Juicebox v2.15.0739 to ensure accuracy and integrity at the chromosome scale. Based on previously published H. rhamnoides genome information10, the chromosomes were designated as Chr01 through Chr12. The final chromosome-level assembly has a total length of 1.11 Gb, with a scaffold N50 of 95.29 Mb, a contig N50 of 38.57 Mb, and an L50 of 5, containing a total of 210 gaps and exhibiting a GC content of 29.76% (Table 1). Approximately 99.94% of the assembled sequences were successfully anchored onto 12 putative chromosomes (Table 4). The Hi-C interaction matrix revealed clear intra-chromosomal signals along the diagonal (Fig. 2), indicating a high level of continuity and accuracy, thus providing a robust foundation for subsequent gene annotation and functional analyses.

Table 4 Summary of the 12 pseudochromosomes.
Fig. 2
figure 2

Heatmap of genome-wide Hi-C data for H. salicifolia. Hi-C interaction frequencies are depicted by colors ranging from orange (low frequency) to dark red (high frequency).

Repeat annotation

To comprehensively characterize the repetitive elements in the H. salicifolia genome, de novo repeat prediction was conducted using RepeatModeler v2.0.140 (http://www.repeatmasker.org/RepeatModeler/) to establish a species-specific repeat library. This customized library was subsequently merged with the RepBase database41 (v20181026, http://www.girinst.org/repbase). The combined library served as input to RepeatMasker v4.1.042 (http://www.repeatmasker.org) for the identification and masking of repetitive sequences. RepeatMasker identifies and annotates repetitive elements by comparing sequences to a curated transposable element (TE) library, which defines the classification based on sequence similarity and structural features, including major types such as LTR, LINE, SINE, and DNA transposons.

The annotation indicated that the total length of repetitive sequences was approximately 501.59 Mb, representing 45.25% of the entire genome (Table 5). Among these, LTR retrotransposons comprised the highest proportion (32.54%, ~360.72 Mb), while DNA transposons (4.01%, ~44.50 Mb) and LINE elements (0.45%, ~4.97 Mb) were also present (Table 5). This comprehensive repetitive element profile provides a valuable foundation for investigating genome evolution, structural variation, and gene regulatory mechanisms.

Table 5 Summary of the repetitive sequences in H. salicifolia genome assembly.

Protein-coding gene prediction and functional annotation

To obtain a high-quality set of protein-coding genes, three complementary strategies were integrated: homology-based prediction, transcriptome-based prediction, and ab initio prediction. First, for homology-based prediction, protein sequences from sequenced Hippophae species (e.g., H. rhamnoides, H. tibetana, and H. gyantsensis) were aligned to the H. salicifolia genome using GeMoMa v1.943 to identify potential orthologous genes. Second, for transcriptome-based prediction, RNA-seq data from leaf and bark tissues were assembled into transcripts using stringtie v2.1.344, and coding regions were predicted using TransDecoder v5.1.0 (https://github.com/TransDecoder/TransDecoder). Lastly, ab initio predictions were performed with AUGUSTUS v3.3.345 (https://github.com/Gaius-Augustus/Augustus), GlimmerHMM v3.0.446, and GeneMark-ES v4.3847. Species-specific parameters were applied to improve prediction accuracy. All these predictions were integrated using EVidenceModeler (EVM) v1.1.148 to generate a high-confidence gene set, and PASA v2.5.249 (https://github.com/PASApipeline/PASApipeline) was subsequently employed for further refinement, adding UTR information and identifying novel transcripts. In total, 42,547 protein-coding genes were predicted, with an average gene length of 3,663 bp and an average of 4.72 exons per gene (Table 6). TBtools v2.12650 was utilized to visualize gene density, GC content, Gypsy and Copia element densities, and chromosomal synteny of the 12 chromosomes (Fig. 3).

Table 6 Summary of predicted protein-coding genes in H. salicifolia genome assembly.
Fig. 3
figure 3

Genomic features of H. salicifolia. From outer to inner circles: 12 chromosomes (Chr01–Chr12), GC content, gene positions, gene density, and syntenic gene blocks represented by connecting lines within the genome.

To gain comprehensive insights into gene functions, functional annotation was performed by conducting BLASTp51 searches (E-value ≤ 1e–5) against multiple databases, including the Kyoto Encyclopedia of Genes and Genomes (KEGG)52, euKaryotic Orthologous Groups (KOG), the National Center for Biotechnology Information Non-Redundant database (NCBI-NR, https://www.ncbi.nlm.nih.gov/), Gene Ontology (GO)53, Clusters of Orthologous Groups (COG)54, and SwissProt55. Additionally, HMMER v3.2.1 was employed, along with the Pfam database, to predict protein domains. Approximately 85.06% of the genes (36,189 genes) were functionally annotated in at least one database, thus providing a solid basis for subsequent functional genomics research (Table 7).

Table 7 Annotation results of functional genes in H. salicifolia.

Non-Coding RNA annotation

To identify non-coding RNAs (ncRNAs) in the H. salicifolia genome, multiple tools and databases were integrated. A total of 748 tRNA genes were detected using tRNAscan-SE v2.0.0 (http://lowelab.ucsc.edu/tRNAscan-SE/)56 (Table 8). Using the Rfam v14.2 database57 and Infernal v1.1.358, 5,924 rRNA genes, 196 miRNA genes, and 5,950 snRNA genes were identified. These ncRNA annotations will facilitate studies on transcriptional regulation and the functional mechanisms underlying the H. salicifolia genome.

Table 8 Annotation of Non-Coding RNAs in H. salicifolia.

Genome-wide synteny analysis

The Python version of MCScan implemented in JCVI v1.2.7.5259 (default parameters) was employed to examine genomic synteny between H. salicifolia and its close relatives. The resulting synteny maps support the assessment of structural accuracy and completeness of the assembled genome through comparative analysis (Fig. 4).

Fig. 4
figure 4

Genomic synteny relationships between H. salicifolia and its closely related species (H. rhamnoides and H. tibetana). Chromosomes of each species are highlighted in different colors, and the gray lines represent syntenic relationships between genomes.

Data Records

The raw sequencing data have been deposited in the Sequence Read Archive (SRA) of the National Center for Biotechnology Information (NCBI) under accession number SRP54855360, including PacBio HiFi reads, Illumina PE150 reads, Hi-C reads, and RNA-seq data from various tissues. The final chromosome-scale assembled genome has been deposited in the NCBI GenBank under accession number JBJWFA00000000061. In addition, the genome assembly and annotation files have been stored in the Figshare database62.

Technical Validation

Several strategies were employed to assess the quality of the genome. The completeness of the non-redundant draft genome was evaluated using Benchmarking Universal Single-Copy Orthologs (BUSCO v5.4.5)63 with the embryophyta odb10 dataset (1,614 single-copy genes) under default parameters. At the assembly level, BUSCO analysis showed that 98.8% of BUSCO genes were complete (89.8% single-copy, 9.0% duplicated), with only 0.4% fragmented and 0.8% missing (Table 9). At the annotation level, 99.3% of BUSCO genes in the predicted protein-coding gene set were complete (70.6% single-copy, 28.7% duplicated), with only 0.1% fragmented and 0.6% missing. These findings indicate that both the assembly and annotation are highly complete, meeting the standards of a high-quality reference genome.

Table 9 BUSCO assessment results for H. salicifolia genome assembly and annotation.

To validate the accuracy and completeness of the assembly, both Illumina short reads and PacBio HiFi long reads were mapped back to the assembled reference genome. Using BWA-MEM v2.2132, 97.82% of Illumina reads aligned successfully. With Minimap2 v2.1764, the mapping rate of PacBio HiFi reads reached 98.53%. These high mapping rates reinforce the accuracy and completeness of the assembled genome.

We further assessed the base-level consensus accuracy of the assembled genome using Merqury v1.365, based on 21-mers derived from Illumina short reads. Merqury calculated the consensus quality value (QV) for each chromosome individually. The QV scores ranged from 36.22 to 38.80 across all 12 chromosomes, corresponding to base-level error rates between 1.32 × 10−4 and 2.39 × 10−4, or approximately one error per 750 to 1,500 bases (Table 10). These results demonstrate high base-level accuracy in the genome assembly.

Table 10 Chromosome-level QV and estimated error rates of the H. salicifolia genome assembly.

In addition, a k-mer spectra-cn plot was generated (Fig. 5) to visualize the multiplicity distribution of 21-mers from Illumina reads compared to their presence in the assembly. The plot exhibited a dominant peak centered around multiplicity 50, with very few low-frequency or missing k-mers. This indicates that the vast majority of high-quality k-mers from the raw reads were accurately incorporated into the assembly, further supporting its high completeness and low redundancy.

Fig. 5
figure 5

K-mer spectra-cn plot generated by Merqury showing the distribution of 21-mers from Illumina reads according to their copy number in the assembly. The dominant red peak (~multiplicity 50) represents k-mers that appear once in the assembly, indicating well-represented unique sequences. Gray areas (read-only) correspond to k-mers present in the reads but absent in the assembly, reflecting sequencing errors or unassembled regions. The low abundance of blue (copy = 2) and other multi-copy k-mers suggests low redundancy and high assembly accuracy.