Background & Summary

Jujube (Ziziphus jujuba Mill.), the most important cultivated species of both genus Ziziphus Mill and family Rhamnaceae, is a major fruit tree native to China, renowned for its tolerance to drought, poor soil, salinity, and alkalinity. These tolerances make it increasingly important globally1. Jujube fruit is rich in sugars and vitamins and can be consumed fresh, dried, or processed into various products2,3. Additionally, jujube fruit has significant medicinal value, with polysaccharides, cyclic nucleotides, and flavones exhibiting antioxidant, anti-tumor, and immunomodulatory properties4,5,6,7. ‘Huizao’, a leading variety of jujube for dry fruit with excellent fruit quality, covers approximately 210,000 hectares and produces over 3 million tons annually, accounting for nearly 30% of global jujube production. Originating from the lower reaches of the Yellow River, the mother river of China, ‘Huizao’ is now predominantly cultivated in the oases surrounding the Taklamakan Desert, the second-largest desert in the world8,9.

In 2014 and 2023, our group published the first genome sequence and the first telomere-to-telomere (T2T) genome of jujube, using second- and third-generation sequencing technology, respectively, based on the cultivar ‘Dongzao’ (Z. jujuba Mill. ‘Dongzao’)10,11. In addition, chromosome-level genome assemblies have also been reported for the multi use jujube cultivar ‘Junzao’ (Z. jujuba Mill. ‘Junzao’)12, the wild sour jujube (Z. jujuba var. spinosa)13, and the table cultivar ‘Lingwuchangzao’ (Z. jujuba Mill. ‘Lingwuchangzao’) and ‘Shiguang’ (Z. jujuba Mill. ‘Shiguang’)14. However, a haplotype-resolved, chromosome-level genome assembly for dried jujube ‘Huizao’ is still lacking.

In this study, we report a high-quality, haplotype-resolved genome of ‘Huizao’, the leading jujube cultivar for dry fruit. The genome consists of two haplotypes: Hap1 (371,219,385 bp) and Hap2 (385,424,944 bp), with contig N50 values of 12.70 Mb and 10.68 Mb, and scaffold N50 values of 30.69 Mb and 31.26 Mb, respectively. This genome provides a valuable resource for studying functional genes related to key economic traits in jujube, accelerates the application of genomics in jujube molecular breeding, and facilitates studies on genomic diversity, allele-specific expression and the evolution of the Ziziphus genus.

Methods & Results

Sample preparation

Young leaves were collected from ‘Huizao’ jujube grown at the experimental base of Hebei Agricultural University (115.43°E, 38.83°N, 79.8 m altitude). A total of 15 g of healthy young leaf tissues was sampled. The leaves were immediately frozen in liquid nitrogen for subsequent PacBio HiFi and Hi-C library preparation and sequencing (Fig. 1a).

Fig. 1
figure 1

Overview of the ‘Huizao’ plant and genome estimation using PacBio Hifi reads. (a) Leaves, flowers and fruits of ‘Huizao’ jujube. (b) Estimation of genome ploidy, size, and heterozygosity using GenomeScope2.

HiFi SMRTbell library construction and sequencing

High-quality DNA was extracted using the SDS method and purified with the QIAGEN® Genomic Kit (Cat# 13343, QIAGEN). DNA purity was assessed using a NanoDrop One UV-Vis spectrophotometer (Thermo Fisher Scientific, Waltham, MA, USA), and integrity was verified via agarose gel electrophoresis. The PacBio HiFi SMRTbell library was prepared using the SMRTbell Express Template Prep Kit 2.0 (PacBio, CA, USA). Long DNA fragments were sheared to 15–18 kb using a g-TUBE (Covaris, MA, USA), then concentrated and purified with AMPure PB beads (PacBio, CA, USA). Size selection for SMRTbell templates greater than 15 kb was performed using BluePippin (SageScience, MA, USA) to obtain large-insert SMRTbell libraries for sequencing. After data download, MD5 checksums were generated for the files to ensure data integrity.

Hi-C library construction and sequencing

For Hi-C library construction, approximately 2 grams of fresh leaves from the ‘Huizao’ jujube cultivar were used. Sample cells were fixed with formaldehyde to crosslink DNA with proteins, as well as proteins with each other. After crosslinking, the cells were lysed, and DNA quality was evaluated through sampling. Upon confirmation of sufficient quality, Hi-C fragment preparation was initiated.

Chromatin was digested using the restriction enzyme DpnII, which recognizes the GATC motif. The primer index used was CGCTCATT. The efficiency of enzymatic digestion was assessed by sampling. Following digestion, the DNA underwent biotin labeling, blunt-end ligation, and purification. DNA quality was re-evaluated at this stage, and upon meeting quality requirements, standard library construction proceeded.

Library construction included the removal of biotin from unligated DNA ends, ultrasonic fragmentation, end repair, A-tailing, and adapter ligation to generate sequencing-ready fragments. PCR amplification was then optimized and performed. The amplified products underwent quality control to assess enrichment for Hi-C junctions. Libraries that passed QC were sequenced on the Illumina NovaSeq platform using a paired-end 150 bp (PE150) sequencing strategy.

In total, Hi-C sequencing generated approximately 54.3 Gb of data, consisting of 181 million paired-end reads, which were used for chromosome-level genome scaffolding.

Genome size and ploidy estimation

The genome size and ploidy of the ‘Huizao’ jujube were estimated using 4.8 Gb of high-quality PacBio HiFi sequencing data (Table 1). To accurately assess genome size and heterozygosity, we performed GenomeScope modeling based on a series of odd-numbered k-mer sizes (k = 17 to 31). Among these, the 17-mer model yielded the best performance for our dataset, showing the lowest model error (0.116%), clear separation between homozygous and heterozygous peaks, and a more consistent estimation of repetitive content. Consequently, k = 17 was selected as the optimal parameter for k-mer analysis in this study, using K-Mer Counter (KMC, v3.0.0)15 (Fig. S1). The resulting k-mer frequency distribution was further analyzed with GenomeScope (v2.0)16 to estimate genome size, ploidy, and heterozygosity, with the parameters “-m64 -ci1 -cs10000 -cx10000 -p 2”. The analysis indicated that ‘Huizao’ jujube is diploid, with an estimated haploid genome size of approximately 361.46 Mb and a heterozygosity rate of 1.54% (Fig. 1b).

Table 1 Statistics of genomic sequencing data.

Genome assembly

De novo assembly of PacBio HiFi reads was performed using Hifiasm (v0.19.6-r595)17, with the following parameters: -o 04-HZ -t 80–ul-cut 20000 -D10–hom-cov 20. Both PacBio HiFi reads and Hi-C paired-end sequencing data were used to generate the initial assembly, resulting in two haplotype-resolved contig sequences.

The preliminary assemblies of Hap1 and Hap2 were 389.01 Mb and 393.82 Mb in size, containing 161 and 123 contigs, with contig N50 values of 11.77 Mb and 10.45 Mb, respectively. To eliminate haplotypic duplications and enhance assembly quality, we applied Purge_dups (v1.2.6) (https://github.com/dfguan/purge_dups). This refinement step produced final assemblies with improved contiguity: Hap1 was 371.65 Mb in size with 47 contigs and a contig N50 of 12.70 Mb, while Hap2 measured 385.33 Mb with 49 contigs and a contig N50 of 10.68 Mb.

Chromosome anchoring by Hi-C

To evaluate the quality of the Hi-C libraries, we conducted alignment and statistical analysis for both haplotypes (Hap1 and Hap2) using Hicup (v0.9.2)18 with the parameter “--re1 ^GATC,DpnII”. The results demonstrated high valid-pair percentages and reasonable ratios of intra- and inter-chromosomal interactions in both datasets (Table S1), indicating that the Hi-C libraries were of high quality and suitable for downstream chromosome-level genome assembly and analysis (Fig. S2). Raw Hi-C reads were first quality-filtered using fastp (v0.21.0)19 with default parameters, resulting 54.3 Gb of clean data, comprising 181 million paired-end reads. These reads were then aligned to the preliminary genome assembly using BWA (v0.7.19-r1273)20 with the -5SP parameter to accommodate Hi-C-specific split reads. The alignment output was processed with samblaster (v0.1.26)21 using default parameters to remove PCR duplicates. Low-quality and invalid alignments were filtered using samtools (v1.21)22 with the -F 3340 parameter. To further refine the data, we applied the filter_bam script from the HapHiC toolkit (v1.0.5)23, using the –nm 3 parameter to allow a maximum of three mismatches. The resulting filtered alignments were used for subsequent scaffolding analysis.

Scaffolding was performed using the HapHiC pipeline, with the restriction enzyme set to DpnII (recognition sequence: GATC), the chromosome number specified as 12, and the –processes 5 parameter enabled. The resulting scaffold structures were manually curated and refined using JuiceBox (v1.11.08)24 to adjust chromosome boundaries, resolve misjoins, and correct structural variations such as inversions and translocations (Fig. 2a). Subsequently, the juicer post tool was used to generate the final chromosome sequences and the corresponding agp file. To assess the quality of the chromosome-level assembly, the Hi-C contact matrix was visualized using the HapHiC plot tool.

Fig. 2
figure 2

Interaction heatmap of the two haplotype genomes and synteny between haplotypes. (a) Hi-C interaction heatmaps of the two haplotypes. (b) Collinearity relationship between the two haplotypes.

Both haplotypes were successfully clustered into 12 groups and ordered according to the reference genome11. The final assemblies anchored 371.65 Mb of contigs in Hap1 and 385.33 Mb in Hap2 to the chromosomes, achieving scaffold N50 values of 30.69 Mb and 31.26 Mb, respectively, with L50 values of 6 (Table 2). The completeness of single-copy genes was assessed using BUSCO (v5.8.2)25 with the embryophyta_odb10 database using default parameters. In Hap1, 2,326 genes were identified, of which 97.6% were complete and 0.5% were partial. Similarly, Hap2 also contained 2,326 genes, with 98.4% complete and 0.6% partial (Fig. 2b). These results demonstrate the successful assembly of a high-quality, haplotype-resolved, chromosome-scale genome for the ‘Huizao’ jujube cultivar (Fig. 3).

Table 2 Genome assembly statistics of the two haplotypes of ‘Huizao’ jujube.
Fig. 3
figure 3

Circular maps of the two haplotypes of ‘Huizao’ jujube. (a) Chromosome name and size (b) Gene density. (c) GC skew. (d) GC content. (e) Repeat sequence density. (f) Collinearity of CDS genes.

PacBio HiFi reads were mapped to the genome, achieving coverage of 99.90% for Hap1 and 99.98% for Hap2. The BUSCO scores and mapping statistics confirmed the high completeness and accuracy of the assemblies (Table 2).

Genome annotation

Repetitive sequences in the ‘Huizao’ genome were annotated using both de novo and homology-based methods. A custom repeat library was built with RepeatModeler (v2.0.2a)26, RepeatScout (v1.0.6)27, and LTR_retriever (v2.9.0)28 and used by RepeatMasker (v4.1.2-p1)29 to annotate repeats in GFF format. Repetitive sequences at both the DNA and protein levels were identified by mapping to the Repbase database30 using RepeatMasker and RepeatProteinMask. Tandem repeats were annotated de novo with TRF (v4.10.0)31. In total, repetitive elements spanned 203.4 Mb (54.79%) of Hap1 and 215.3 Mb (55.87%) of Hap2, with LTRs being predominant (26.01% in Hap1, 26.77% in Hap2) (Table 3).

Table 3 Transposable element (TE) information from genome annotation.

Protein-coding gene prediction was performed through a combination of de novo, homology-based, and transcriptome-based approaches. RNA-seq reads from leaf tissue were quality controlled and aligned to the assembled genome using STAR (v2.7.9a)32, followed by transcript assembly with StringTie (v2.1.7b)33 and structural annotation via PASA (v2.5.3)34. Protein sequences from six representative species35 (Malus domestica, Arabidopsis thaliana, Ziziphus jujuba, Prunus armeniaca, Populus, and Prunus persica) were retrieved from public NCBI databases and annotated with GeMoMa (v1.9)36. De novo gene prediction was performed using Augustus (v3.5.0)37.

The results were integrated using EVM (v2.1.0)38 with the parameters “–segmentSize 100000 –overlapSize 10000”, resulting in 32,065 protein-coding genes in Hap1 and 33,004 in Hap2. Functional annotation was carried out using InterProScan (v5.57–90.0)39 and eggNOG-mapper (v2.1.8)40, with data from TrEMBL, Swiss-Prot, InterPro, the NCBI Non-Redundant Protein Database (nr), eukaryotic orthologous groups, and Gene Ontology for comprehensive functional classification (Table 4). Except for EVM (v2.1.0), all other software were used with their default parameters.

Table 4 Assembly metrics of the two haplotypes of ‘Huizao’.

Genome collinearity analysis

MCScan (v1.0)41 was used with default parameters to examine the collinearity between the two haplotype genomes of ‘Huizao’ jujube, with plots generated using the option ‘–minspan = 30’. A total of 50 collinear blocks were identified, encompassing 25,826 gene pairs. Of these, 78.67% of the genes were from Hap1 and 76.65% from Hap2. The genome collinearity analysis demonstrated a high degree of synteny between the two haplotype genomes (Fig. 2b).

Structural variation detection

Intra-species structural variations between the two haplotype genomes were identified using the SyRI (v1.7.0)42 pipeline with default parameters. Minimap2 (v2.28)43 was used to align the two haplotype genomes with the parameters “–eqx -ax asm5 -c –secondary=no.” The resulting SAM files were converted to BAM format, sorted, and analyzed for structural variations using the SyRI pipeline with default settings. The identified variations were classified into two categories: genomic rearrangements and sequence variations. Seven types of structural variation sites were detected, including 329 collinear regions, 48 inversions, 333 translocations, 182,766 insertions, and 182,368 deletions (Fig. 4a).

Fig. 4
figure 4

Comparative analysis. (a) Structural variations between the two haplotype genomes of ‘Huizao’. (b) Collinearity and structural variations between the two haplotypes of ‘Huizao’ and the reference genome of ‘Dongzao’.

Data Records

The genome assembly and associated raw sequencing data are available at the National Genomics Data Center (NGDC) under GSA accession numbers CRA02191344 and CRA02194745, with BioProject number PRJCA036471. The haplotype genomes of ‘Huizao’ jujube have been uploaded to the GWH database, with the assembly number GWHFIKR00000000.1 for Hap1 and GWHFIKS00000000.146 for Hap2. The annotation files have been deposited in Figshare47. In addition, the raw data have also been deposited in the National Center for Biotechnology Information (NCBI) under BioProject accession number PRJCA036471, with the sequencing data available in the SRA48 and the genome assembly in GenBank49,50.

Technical Validation

The completeness of the genome was assessed from both the assembled genome sequence and the annotated protein sequence perspectives. For genome sequence validation, we compared the two haplotype assemblies with the published T2T genome assembly of ‘Dongzao’ jujube using MUMMER (v4.0.0beta2)51 to evaluate collinearity and identify differences (Fig. 4b). Coverage was calculated using a custom Python script, yielding 99.0% for haplotype 1 and 99.8% for haplotype 2 (Table 2). Various assembly metrics, including contig N50, scaffold N50, and GC content, were also computed to assess the quality of the assembled genomes. Combined with the BUSCO results, both haplotype genomes exhibited high completeness.

Additionally, MUMMER (v4.0.0beta2) was used to compare the ‘Huizao’ haplotypes with the T2T genome assemblies of ‘Junzao’52 and ‘Dongzao’ jujube as reference genomes. The alignment was performed using nucmer with parameters (-l 100 -c 100). The resulting files were processed with delta-filter using parameters (−1 -i 98 -l 500), and the plots were generated with mummerplot (Fig. S3). These comparisons confirmed the high quality and completeness of the ‘Huizao’ genome assemblies.