Background & Summary

Hippophae neurocarpa (H. neurocarpa) is a member of the Sea buckthorn genus (Hippophae) within the Elaeagnaceae family1. All sea buckthorn species are diploid, with a chromosome count of 2n = 242,3,4. Their distribution spans temperate regions of Eurasia4. These plants exhibit drought tolerance and resilience, and possess root systems capable of forming nitrogen-fixing nodules, which contribute to soil enhancement5. Moreover, Hippophae species demonstrate robust ecological adaptability and stress resistance, playing a vital role in ecological preservation6,7. H. neurocarpa represents one of the most recently diverged and evolutionarily advanced species within this group3,8. As a result of Quaternary climatic fluctuations, natural populations of H. neurocarpa are predominantly found in high-altitude regions of the Qinghai-Tibet Plateau9. This shrub species, endemic to the eastern Qinghai-Tibet Plateau, favors moist environments and exhibits sensitivity to precipitation variations10. The cylindrical fruits of H. neurocarpa are abundant in bioactive compounds, including vitamins A, C, E, K, and P, rendering them a valuable resource in green foods and traditional medicine, with considerable research potential11,12.

Previous studies have extensively investigated the morphological characteristics, ecological traits, and origins of H. neurocarpa. The chloroplast genome of H. neurocarpa has been sequenced, and comparative chloroplast genomics and phylogenetic analyses have been conducted with other species in the Hippophae genus12,13. However, its genetic and evolutionary relationships remain unclear due to the absence of a complete genome sequence. In this study, we employed multiple sequencing technologies to achieve a high-quality genome assembly of H. neurocarpa. We anticipate that this research will not only elucidate the basic characteristics of the H. neurocarpa genome but also enhance the understanding of its adaptability and medicinal properties at the molecular level. Furthermore, these data will provide a robust foundation for future molecular, ecological, and economic research on the entire Hippophae genus.

Methods

Plant materials and genome sequencing

To examine the genomic features of H. neurocarpa, we collected samples from the Tibet Autonomous Region (N 31°14′10″, E 96°36′14″) and extracted high-quality genomic DNA from the leaves using the cetyl trimethylammonium bromide (CTAB) method14. Following the protocols of the MGIEasy Universal DNA Library Preparation Kit, we constructed DNA libraries. These libraries were subsequently submitted to GrandOmics Bioscience Co, Ltd. for high-throughput sequencing using the DNBSEQ-T7RS platform to generate second-generation sequencing data. To ensure data quality, we employed Fastp v0.23.215 software (fastp -w 8 -n 0 -l 140) to filter the raw sequencing data generated by the Illumina platform, ultimately obtaining 66.79 Gb of high-quality clean next-generation sequencing (NGS) reads (Table 1). For Oxford Nanopore Technologies (ONT) sequencing, we adhered to the manufacturer’s guidelines, extracting DNA with a Grandomics kit, and utilized the Blue Pippin and Pippin HT systems to carefully select long DNA fragments. Following damage repair, end repair, and A-tailing, sequencing adapters were successfully ligated, and the library concentration was accurately measured using a Qubit® 3.0 fluorometer. Finally, sequencing was conducted on the Nanopore PromethION platform, yielding genome data that comprised 94.90 Gb of Nanopore long reads (Table 2).

Table 1 Characteristics of NGS data for genome assembly.
Table 2 Characteristics of ONT raw data for genome assembly.

Hi-C Sequencing

Hi-C sequencing is a high-throughput chromosome conformation capture technique that enables the analysis of the three-dimensional structure of chromosomes across the entire genome16. Following established research protocols, genomic DNA was extracted from young leaves of H. neurocarpa and sequenced using the DNBSEQ-T7RS platform. High-quality Hi-C samples were prepared through processes including formaldehyde cross-linking, cell lysis, chromatin digestion, and biotin labeling. The DNA quality was assessed, and high-quality DNA was retained for the standard library construction process. PCR conditions were optimized before amplification. Prior to constructing the sequencing library, a “Hi-C fragment ligation quality control test” was performed on the amplified products to ensure data accuracy. The DNBSEQ-T7RS platform ultimately generated 73.40 Gb of high-quality Hi-C data (Table 3).

Table 3 Characteristics of Hi-C raw data for genome assembly.

Genome survey and evaluation

In this study, we conducted K-mer analysis (k-mer = 21) on the clean NGS data, utilizing Jellyfish v2.3.017 (jellyfish count -t 22 -C -m 19 -s 1000 -o)software for processing. Subsequently, we employed GenomeScope v2.018 (genomescope.R -i k21.histo -o k21.gs -k 21)for visualization, precisely estimating the genome size and heterozygosity of H. neurocarpa at 684 Mb and 1.04%, respectively (Fig. 1).

Fig. 1
Fig. 1
Full size image

K-mer distribution (K = 21) of Hippophae neurocarpa genome analyzed using GenomeScope 2.

De Novo genome assembly

This study integrated advanced genome sequencing technologies, including Nanopore long reads, MGI short reads, and Hi-C reads, to perform a comprehensive and detailed de novo assembly of the H. neurocarpa genome using clean Illumina and Nanopore data. Initially, NextDenovo (v2.5.0) (https://github.com/Nextomics/NextDenovo) was utilized to construct an initial genome framework based on the Nanopore data. Subsequently, the preliminary assembly results were optimized and corrected using clean Illumina data with NextPolish (v1.4.1)19 (nextpolish2 -t 5), significantly enhancing assembly accuracy. To eliminate potential redundant sequences in the genome, Purge-dups (v1.2.5)20 (purge_dups -2 -T -c)was employed to construct a more refined and accurate chromosome-level genome assembly. Following this, Hi-C data were validated with HiCPro (v3.1.0)21 (hicpro -i reads -o out -c), and detailed manual inspection and fine-tuning were conducted using Juicebox assembly tools22 and high-quality Hippophae salicifolia as a reference genome23. This process included cluster optimization, sequence reordering, and orientation corrections, ensuring that each step met rigorous standards.

Building upon this foundation, we employed 3D-DNA24 tools (run-3ddna-pipeline.sh -r -c) to accurately anchor the contigs to chromosomes, resulting in a final chromosome-level assembly. The total genome size reached 682.80 Mb, comprising 38 scaffolds with an N50 length of 62.17 Mb. The 12 main chromosomes encompassed over 98.0% of the entire genome (Table 4). Furthermore, we utilized BUSCO (v5.4.5)25 (busco -i -o -c 70 -m geno -l) to comprehensively assess genome quality, with results indicating a genome completeness of 97.6%, confirming the high quality and completeness of the assembly. Concurrently, PlantLAI26-based (https://bioinformatics.um6p.ma/PlantLAI/) evaluation of long terminal repeat (LTR) retrotransposons yielded an assembly index (LAI) of 11.61, indicating that the genome assembly is of reference quality. To visually represent the distribution of genome elements, we employed ShinyCircos v2.027 (https://venyao.xyz/shinyCircos/) to generate a Circos plot, clearly illustrating the structural features of the genome (Fig. 3). Additionally, the whole-genome Hi-C heatmap generated using HiCExplorer28 (hicPlotMatrix–title–matrix–dpi–colorMap–log1p–fontsize 8–rotationX 30 –outFileName) vividly revealed interchromosomal interaction patterns, providing robust data support and visualization tools for further exploration of the genome’s complex structure and functional characteristics (Fig. 2).

Table 4 Characteristics of the Hippophae neurocarpa genome at scaffold level.
Fig. 2
Fig. 2
Full size image

Heatmap of genome-wide Hi-C data of Hippophae neurocarpa chromosomes.

Annotation of repetitive sequences

In annotating the repetitive sequences of the sea buckthorn genome, we utilized a comprehensive tool—Extensive de novo TE Annotator (EDTA, v2.1.2)29 (edta–genome–sensitive 1–threads)—to integrate results from multiple prediction tools. Initially, the transposable element (TE) library obtained from EDTA was further classified using TEsorter (v1.33)30 (TEsorter -db rexdb-plant -p 56 -pre), specifically reclassifying elements in the “LTR-unknown” category. Subsequently, we conducted an in-depth classification analysis of these reclassified elements using DeepTE31 (deepTE_domain.py -d -s -i -d). Ultimately, we merged the results from the three independent TE databases and employed RepeatMasker (v4.1.2)32 (repeatmasker -pa 14 -s -xsmall -lib), which defaults to the RepBase RepeatMasker Edition (RMB, 20181026) version of the repeat sequence database, to identify homologous sequences in the total TE database.

The analysis revealed that repetitive elements comprise 56.27% of the sea buckthorn genome, with LTRs accounting for 36.26% and terminal inverted repeats (TIRs) accounting for 12.21% (Table 5, Fig. 3). Furthermore, we used the MegaLTR33 (MegaLTR.sh -A 3 -F -G -T -P results -l 100 -L 7000 -d 1000 -D 15000 -S 85 -M 20 -B rexdb -C 20 -V 0.001 -Q 80-80-80 -E rexdb -R 0.000000015 -U 5000 -X 5000 -W 1000000 -N 12 -t 104) software to conduct a more refined classification of LTRs into superfamilies and lineages. We found that the LTR retrotransposons in H. neurocarpa are categorized into four superfamilies, specifically Ty1-copia, Ty3-gypsy, BARE-2, and TR-GAG. Additionally, we identified 13 distinct lineages, which include Ale, Tork, Athila, Ivana, Galadriel, Ikeros, SIRE, TAR, CRM, Reina, Crm, and Reina (Tables 6, 7).

Fig. 3
Fig. 3
Full size image

Circos plot illustrating the genomic landscape of Hippophae neurocarpa. (a) Gene density. (b) Repeat sequence density. (c) Copia element density. (d) Gypsy element density. (e) GC content. (f) Interspecies collinearity.

Table 5 Summary of transposable elements in Hippophae neurocarpa genome.
Table 6 Classification of LTR-RT in Hippophae neurocarpa genome.
Table 7 Classification of LTR-RT Clade in Hippophae neurocarpa genome.

In addition, we employed the MISA (v2.1)34 (misa.pl genome.fa) to identify SSR loci across the entire genome of Hippophae neurocarpa. The analysis identified 205,053 SSR loci across 12 sequences. All examined sequences contained multiple microsatellite loci, with mononucleotide and dinucleotide repeats being predominant, accounting for 49.1% and 37.1% of the total SSR loci, respectively. The frequency of SSRs decreased significantly with increasing repeat unit length (3–6 nucleotides). Notably, 20,473 compound microsatellites were detected, representing 10% of the total SSRs, indicating that a considerable proportion of microsatellites exist in tandem arrangements (Table 8).

Table 8 Summary of SSR in Hippophae neurocarpa genome.

Annotation of protein-coding gene structure

For the structural annotation of protein-coding genes, we employed three prediction methods: de novo prediction, homology protein sequence alignment, and RNA-seq data analysis. Initially, we performed soft masking of repetitive sequences in the genome using RepeatMasker. Subsequently, we annotated the masked genome using BRAKER (v3.0.8)35 software (braker.pl–genome–bam–prot_seq–species–threads 56–workingdir–PROTHINT_PATH–TSEBRA_PATH). The prediction results were then merged using TSEBRA36 software (tsebra.py–cfg–gtf–keep_gtf–hintfiles–out). Finally, we integrated the annotation files using MAKER (v3.01.04)37 (mpirun -n -R–ignore_nfs_tmp -TMP) and EVidenceModeler (v1.1.1)38 to obtain non-redundant gene models, resulting in a GFF3 file that includes the locations of genes, coding sequences, proteins, and mRNA. In total, we predicted 36,844 protein-coding genes, with gene lengths ranging from 155 to 185,811 bp.

Non-coding region annotation

Non-coding RNAs in H. neurocarpa were identified using the Infernal (v1.1.4)39 (infernal-tblout2gff.pl–cmscan–fmt2–desc) search tool with the Rfam40 database, employing default parameters. This analysis revealed 5917 non-coding RNAs spanning 817623 bp, comprising 724 transfer RNAs (53476 bp), 4194 small nucleolar RNAs (440956 bp), 607 ribosomal RNAs (270544 bp), 84 spliceosomal nuclear RNAs (11934 bp), 182 microRNAs (23063 bp), and 126 other RNA types totaling 17650 bp. It is particularly noteworthy that snoRNAs dominate with 4,194 loci (accounting for 70.9%), primarily involved in the modification and processing of rRNAs and tRNAs. The 724 tRNAs and 607 rRNAs collectively maintain the proper functioning of the protein synthesis system, while 182 miRNAs participate in fine-tuning gene expression through post-transcriptional regulation. Additionally, 84 spliceosomal RNAs are responsible for precise pre-mRNA splicing, along with the identification of 126 ncRNAs whose functions remain to be elucidated. These ncRNAs work synergistically to play crucial roles in key biological processes including transcriptional regulation, protein synthesis, and RNA processing (Table 9).

Table 9 Classification of non-coding RNA in the Hippophae neurocarpa genome.

Functional annotation

To enhance the functional annotation of predicted genes, we conducted a homology search using the BLASTP41 tool in conjunction with multiple public databases accessible through the Baimaike website (https://international.biocloud.net/zh/user/login), employing an e-value threshold of 1e−10. The databases utilized included non-redundant database (NR), Swissprot42, TrEMBL43, KOG, Gene Ontology (GO)44, KEGG45, and COG46. The analysis resulted in functional annotations for 89.44% of the genes. Specifically, the annotation percentages for individual databases were as follows: NR (89.22%), TrEMBL (92.01%), Swissprot (61.98%), KOG (50.90%), KEGG (15.22%), GO (5.43%), and COG (30.96%) (Table 10). To conclude, we assessed the annotation results using the OMArk47 website (https://omark.omabrowser.org/), and the genome annotation assessment revealed that among the 10,551 conserved orthologous gene groups (HOGs), the completeness rate reached 96.21%, with only 3.79% missing, demonstrating highly complete and reliable annotation results (Table 11).

Table 10 Statistical analysis of the functional gene annotations of the Hippophae neurocarpa genome.
Table 11 Annotation assessment of Hippophae neurocarpa.

Furthermore, we conducted systematic GO functional classification, KEGG pathway enrichment analysis, and gene family distribution of the annotated results. Out of 2001 annotated genes, 5606 were annotated by KEGG (Table 10). The GO analysis indicated a predominance of cellular components, molecular functions, and biological processes, highlighting biological processes (Fig. 4). KEGG analysis showed significant enrichment in metabolism and Genetic Information Processing pathways (Fig. 5). The identified gene families, such as RVT_2, UBN2, and Lipase_GDSL, suggest active transposon activity, essential protein functions, and complex metabolic pathways in Ribesia sandwedge. All functionally annotated genes were classified into 2,071 gene families, with predominant enrichment observed in families such as RVT_2, UBN2, NAM, among others, suggesting the potential presence of active transposable element activity, important intracellular protein functions, and complex metabolic pathways in H. neurocarpa.

Fig. 4
Fig. 4
Full size image

GO illustrating the functional annotation of Hippophae neurocarpa.

Fig. 5
Fig. 5
Full size image

KEGG illustrating the functional annotation of Hippophae neurocarpa.

Data Records

The raw sequencing data are publicly available in the Genome Sequence Archive (GSA) in National Data Center (https://ngdc.cncb.ac.cn/gsa) under the number CRA02068748. The genome assembly sequences and annotation files, including Gene Ontology (GO) annotation statistics, KEGG pathway analysis results, and gene family classification statistics, have been deposited in Figshare49 and NCBl GenBank database50.

Technical Validation

The completeness of the non-redundant draft genome was evaluated using Benchmarking Universal Single-Copy Orthologs (BUSCO)24 with the embryophyta odb10 dataset, which consists of 1614 single copy genes with the default parameters. Revealing that 97.6% of these genes exhibited complete coverage. Among them, 97.6% were complete, 89.0% were complete and single-copy (Table 12).

Table 12 Statistics for genome assessment using BUSCO.