Background & Summary

Chinese cherry (Prunus pseudocerasus (Lindl.)) belongs to the family Rosaceae, genus Prunus, and subgenus Cerasus. It originates from Southwest China and is distributed in the temperate zone of the Northern Hemisphere1. Chinese cherry has been cultivated for more than 3000 years2. Most Chinese cherries are tetraploid, with a main karyotype formula of 2n = 4x = 32 = 28 m + 4sm3. Karyotype analysis and rDNA distribution have shown that the Chinese cherry is more likely an autopolyploid rather than an allopolyploid4. And this is further demonstrated by the phylogenetic and comparative genomic analyses5.

Chinese cherry fruit contains rich nutritional ingredients and trace elements, such as proteins, carotene, Vitamin C, saccharides, iron, and phosphorus1. Among 60 representative accessions, the soluble solids content ranged from 10.97% to 34.00%; about 70% of these accessions had a high yield ability3. In addition, the flowers, leaves, roots, bark, and core of Chinese cherry are of high medicinal value. Chinese cherries have a good affinity, developed roots, and soil salinity tolerance; thus, they have also been used as the root stock for sweet cherry6.

‘Duiying’, a Chinese cherry variety local to Beijing, is distributed in the valleys and on slopes. It has better performance than sweet cherry because of its leaf spot and crown gall disease resistance and adaptability to the Chinese soil and climate7. By crossing ‘Duiying’ with sweet cherry and sour cherry, serials of sweet cherry rootstocks have been released that present resistance to crown gall and leaf spot diseases8. It possesses great application potential for transferring resistance genes to sweet or sour cherry. However, the genomic features that underlie these important biological characteristics remain unclear. Several draft genomes or high-quality genomes have been assembled and released for sweet cherry varieties (2n = 2x = 16)9,10,11,12,13,14, while no high-quality reference genomes are available for Chinese cherry ‘Duiying’ to date.

To understand the genetic and molecular basis of Chinese cherry and to promote genomic-associated breeding studies in cherry and Prunus crops, we present a high-quality chromosome-level genome assembly for Chinese cherry ‘Duiying’. The high-quality genome of ‘Duiying’ was obtained using Illumina, Pacific Biosciences (PacBio), high-fidelity (HiFi), and BioNano sequencing combined with 10 × genomic and high-throughput/resolution chromosome conformation capture (Hi-C) technologies. The genome sequence of P. pseudocerasus ‘Duiying’ reported here will be a valuable resource for genetic studies and breeding programs on cherry plants, both for exploring the genome evolution and functional genomic studies of Rosaceae/Prunus and for its excellent trait gene resources.

Methods

Sampling and whole genome sequencing

Leaf samples of ‘Duiying’ were collected from the cherry orchard of the Institute of Pomology and Forestry, Beijing Academy of Agriculture and Forestry Sciences, in Tongzhou District, Beijing. Genomic DNA of ‘Duiying’ was extracted from leaf samples using a plant genomic DNA extraction kit (TIANGEN, Beijing, China). The quality and quantity of the extracted DNA were assessed using NanoDrop 2000 (Thermo Fisher Scientific, Boston, MA, USA).

For Illumina paired-end sequencing, 1.5 μg of genomic DNA was used to construct a 350-bp DNA library using an Illumina TruSeq® Nano DNA library preparation kit (Illumina, San Diego, CA, USA). The refined library was subsequently sequenced using the Illumina Novaseq 6000 platform (Illumina, San Diego, CA, USA), generating 42.76 Gb of raw sequences. Fastp software (v0.23.4)15 was employed to filter out low-quality paired reads. The remaining 42.68 Gb (99.81%) of high-quality data, with 97.44% and 93.88% of the bases having a quality score of ≥Q20 and ≥Q30, respectively, was utilized for genome survey and assessment.

For long-read sequencing, a 40-kb SMRTbell library was constructed based on the PacBio protocol. PacBio polymerase reads were obtained using the PacBio Sequel II System (PacBio, Menlo Park, CA, USA) in circular consensus sequencing (CCS) mode. After the adapter sequences were removed from the raw polymerase reads, we derived subreads, with the parameter set to ‘Filtering subreads by minimum length = 50’. We then utilized ccs software (https://github.com/PacificBiosciences/ccs) to generate HiFi reads, using ‘min-passes = 3 and min-rq = 0.99’ parameters. This process yielded 39.21 Gb of HiFi data, with a contig N50 of 15,530 bp, which was then used for genome assembly (Table 1).

Table 1 Summary of sequencing data for Chinese cherry ‘Duiying’ (Prunus pseudocerasus).

To generate Bionano optical mapping data, the Bionano official extraction kit16 was initially used to isolate long fragment molecules exceeding 150 Kb in length from high-quality DNA. Then, a single-enzyme cutting technique was applied with the DLE-1 (CTTAAG) endonuclease for digestion. Following standard Bionano protocols, the DNA molecules were labeled and subsequently imaged using the Bionano Irys system (Bionano Genomics, San Diego, CA, USA). The raw imaging data were transformed into BNX files, with the basic labeling and DNA length information converted via AutoDetect in the Bionano Solve package (v3.5.1) (https://bionanogenomics.com/support/software-downloads/). Following filtration based on molecule length and label density, we successfully produced optical mapping data for ‘Duiying’. We generated 584.546 Gb of data, with an average label density of 22.64 per 100 Kb and an N50 value of 366.4 Kb (Table 1).

Hi-C libraries were constructed using leaf cells from ‘Duiying’. The process started with cell fixation using formaldehyde, followed by cell lysis. The cross-linked DNA was then digested with the DpnII enzyme. The resulting sticky ends were biotinylated and proximity ligated to form chimeric junctions. We then enriched DNA fragments of 300–500 bp using a physical shearing process. These chimeric fragments, which are indicative of the original long-distance physical interactions within the cross-linked DNA, were converted into paired-end sequencing libraries. The paired-end reads were then sequenced using the Illumina NovaSeq platform (Illumina, San Diego, CA, USA), resulting in 145 Mb of read pairs. To ensure data quality, we employed fastp software17 to filter out low-quality reads from the raw sequencing data. After removing duplicate reads, we obtained 127 Mb of read pairs to assemble the chromosome-level genome.

Transcriptome sequencing and analysis

Total RNA was extracted from three tissues (leaf, stem, and root) using an RNA extraction kit (QIAGEN China(Shanghai) Co., Ltd., Shanghai, China). High-quality cDNA libraries were prepared using the TruSeq Stranded mRNA Sample Preparation Kit and sequenced on the Novaseq 6000 platform by Novogene (Beijing, China). Quality control was performed using fastp software15. An average of 6.94 Gb of high-quality RNA-seq data was used per tissue for transcript evidence analysis to determine the gene structure annotation for the ‘Duiying’ genome (Table 1).

Genome survey

Before genome assembly, we conducted a genome survey using k-mer spectrum analysis. Specifically, we used Jellyfish (v2.3.0)18 to count the k-mer frequency from high-quality paired-end reads by setting k to 17. We removed k-mers with a low frequency of 3, which occur due to sequencing errors. The genome size was calculated by dividing the total k-mers by their coverage depth, and the distribution of the k-mer frequency reflected that of this genome.

The k-mer frequency distribution graph displayed three distinct peaks (Fig. 1), suggesting that the ‘Duiying’ genome is a homologous tetraploid. Our analysis identified 30.08 billion k-mers, with a significant majority of 30.02 billion (98.05%) categorized as high frequency (≥3). The primary peak in the k-mer frequency distribution was observed at a depth of 27×. As a result, the genome size was estimated to be approximately 1118.42 Mb (Table 2).

Fig. 1
figure 1

Frequency distribution of 17-mers.

Table 2 Genome survey of Chinese cherry ‘Duiying’.

In addition, we aligned the high-quality paired-end reads of ‘Duiying’ to the genome sequence of its closely related diploid species, Prunus avium ‘Tieton’ (GCA_014155035.1), using the BWA-MEM algorithm (v0.7.17-r1188)19. Of ‘Duiying”s reads, 82.15% covered 95.78% of the P. avium genome (Table 2), supporting that the ‘Duiying’ genome is a homologous tetraploid.

Genome assembly of Chinese cherry ‘Duiying’

PacBio HiFi reads were used to assemble the initial contigs in the hifiasm (0.19.5-r587) package20 with default parameters. This process yielded a 1013.46 Mb assembly for Chinese cherry ‘Duiying’, with a contig N50 value of 4.18 Mb (Table 3). We then conducted hybrid scaffolding analysis using Bionano optical maps by mapping the Bionano data to the initial contigs using RefAligner in the Bionano Solve software package (v3.5.1). The alignment results were visualized using IrysView within the Bionano Solve software package (v3.5.1). We combined the genome maps with the initial contigs to generate hybrid scaffold genome maps using the Bionano Solve software package (v.3.5.1), with the parameters set to ‘-B 2 -N 2’. We obtained a scaffold-level assembly with a genome size of 1023.26 Mb and a scaffold N50 value of 11.68 Mb (Table 3). Pseudochromosome construction was then performed to obtain the ‘Duiying’ assembly, and the single-ended model in Bowtie2 software (v2.4.1)21 was used to map the Hi-C data onto the previously established scaffold-level assembly. After discarding the invalid self-ligated and unligated fragments within the uniquely mapped pairs using the HiCUP pipeline (version 0.8.0)22, 91,274,501 interaction pairs were used to calculate the linkage frequency among all scaffolds via an agglomerative hierarchical clustering algorithm implemented in ALLHiC software (v0.9.8)23 (Table 1). We manually rectified any placement and orientation errors that exhibited distinct chromatin interaction patterns. As a result, we produced a final assembly for ‘Duiying’ with a genome size of 1035.19 Mb and a scaffold N50 value of 28.99 Mb. A total of 978.61 Mb (94.54%) assembled sequences were anchored onto 32 pseudochromosomes (Tables 3, 4; Fig. 2). All chromosomes were grouped into eight clusters based on their sequence similarity, indicating that our assembly effectively distinguished the sequences of the four haplotypes in the ‘Duiying’ genome (Fig. 2). The synteny analysis indicated that the four haplotype sequences exhibited very high synteny, with a synteny rate exceeding 85%, which is significantly higher than the synteny between the ‘Duiying’ genome and its closely related species, P. avium ‘Tieton’ (68.15%) (Fig. 3).

Table 3 Assembly summary of Chinese cherry ‘Duiying’ in different assembly steps.
Table 4 Chromosome length of the assembled genome of Chinese cherry ‘Duiying’.
Fig. 2
figure 2

Chromatin interactions in each chromosome of the ‘Duiying’ genome at a resolution of 1 Mb. The dark red dots show a high probability of interaction, and the light dots show a low probability of interaction.

Fig. 3
figure 3

Synteny plot. Align the other three haplotype sequences of P. pseudocerasus ‘Duiying’ (Hap-b, Hap-c, Hap-d) and its diploid relative species P. avium ‘Tieton’ to the P. pseudocerasus ‘Duiying’ Hap-a sequence.

Genome assessment

We evaluated the genome assembly quality from two perspectives: completeness and accuracy. For assembly completeness, complete Benchmarking Universal Single-Copy Orthologs (BUSCOs) were evaluated in the assembled genome by searching against the 1614 BUSCOs in embryophyta_odb10 (version 5.4.2)24, and the mapping ratio and coverage depth were calculated when the Illumina pairs were realigned to the assembled genome using BWA software19. For assembly accuracy, we detected homozygous SNPs from the realignment results, which represent single base errors in the assembly.

Genome structure annotation for Chinese cherry ‘Duiying’

Repetitive sequences

We utilized both homologous searching and ab initio prediction techniques to annotate repeated sequences within the ‘Duiying’ genome. For ab initio prediction, we concurrently utilized four transposable element (TE) prediction software packages—LTR_FINDER v1.0.725, PILER v3.3.017, RepeatScout v1.0.526, and RepeatModeler v1.0.827—to build a candidate de novo library within the ‘Duiying’ genome. All software was run using their default parameters. Following this, the de novo libraries and the Repbase database were used to annotate repeated sequences in the ‘Duiying’ assembly with RepeatMasker (v4.0.5)27. For homologous searching, we used RepeatProteinMask (v4.0.5) with default parameters to predict TEs. We then amalgamated these results, identifying 547.16 Mb (equivalent to 52.86%) of the ‘Duiying’ assembly as repeat sequences (Table 5). Notably, among these repeat sequences, long terminal repeat (LTR) sequences were the most abundant, accounting for 46.46% of the whole genome sequences.

Table 5 Summary of the repetitive sequences in the ‘Duiying’ genome.

Protein-coding genes

We utilized homologous-, de novo-, and transcriptome-based approaches to predict protein-coding genes within the ‘Duiying’ genome. For homologous-based gene prediction, the protein sequences from eight Prunus genomes, namely P. avium ‘Bigstar’ (GCA_013416215.1)10, P. avium ‘Tieton’11, P. persica28, P. mume29, P. yedoensis30, P. armeniaca31, P. salicina32, and P. armeniaca33, were aligned against the ‘Duiying’ genome using TBLASTN (version 2.2.29 +) with an e-value cut-off of 1e−534. All remaining blast hits were concatenated using Solar software (version 0.9.6). We extracted the corresponding genomic region, including 1000 bp upstream and downstream of each candidate gene, to predict the precise gene structure using wise2 (v2.4.1)35. The resulting predictions were designated as the ‘Homology set’. For transcriptome-based prediction, RNA-seq data were assembled and transcript sequences were generated using Trinity (v2.1.1)36. We aligned the transcript sequences against the ‘Duiying’ genome using the Program to Assemble Spliced Alignment (PASA)37, in which effective alignments were clustered based on their genome mapping location and assembled into gene structures. The gene models created by PASA were labeled as the PASA Trinity set. RNA-seq reads were also directly mapped to the ‘Duiying’ genome using TopHat (v2.0.13)38, and the mapped reads were assembled into gene models (Cufflinks-set) using Cufflinks (v2.1.1)39. For de novo gene prediction, we employed Augustus (v2.5.5)40, GeneID (v1.4)41, GeneScan (v1.0)42, GlimmerHMM (v3.0.1)43, and SNAP (version 2013-11-29)44 to predict genes in the repeat-masked genome. The specific parameters used in Augustus, SNAP, and GlimmerHMM were trained with the gene models from the PASA Trinity set. All gene models from these sets were integrated using EVidenceModeler (v1.1.1), with the following weights assigned to each type of evidence: PASA-T-set > Homology-set = Cufflinks-set > Augustus > GeneID = SNAP = GlimmerHMM = GeneScan. In addition, we filtered out genes that were less than 50 amino acids in length, supported only by ab initio evidence, and with an expression value of less than 1. As a result, 114,451 protein-coding genes were obtained in the ‘Duiying’ genome (Table 6). The length distribution of each element type in the gene structure annotated for ‘Duiying’ was similar to that of gene elements in other species within the Prunus genus (Fig. 4), reflecting the accuracy of the ‘Duiying’ gene structure annotation.

Table 6 Summary of gene structure in the ‘Duiying’ genome.
Fig. 4
figure 4

Length comparison chart of gene elements in closely related species within the Prunus genus.

We annotated the function of protein-coding genes within the ‘Duiying’ genome using SwissProt45, Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway46, Non-Redundant Protein Sequence Database (NR, from NCBI), and InterPro databases, leveraging a homologous searching method. We obtained Pfam domain and Gene Ontology (GO) information from the InterPro database and predicted these using the InterProScan tool47, based on conserved protein domains and functional sites. For the other databases, we used BLATP with an e-value cut-off of 1e−434. Consequently, 99.24% of the protein-coding genes were supported by functional databases (Table 7).

Table 7 Gene function annotation of the Chinese cherry ‘Duiying’.

Noncoding RNA gene

We predicted the gene structures of noncoding RNAs in the ‘Duiying’ genome, using the t-RNAscan-SE tool (v1.3.1) to predict tRNAs48. We predicted ribosomal RNA (rRNA) sequences by searching against the invertebrate rRNA database using BLAST, with an E-value cut-off of 1e−1049. We also annotated small nuclear and nucleolar RNAs, as well as miRNAs using Infernal (v1.1rc4) based on the Rfam database8. As a result, we identified 1635 microRNA (miRNA), 6637 transfer RNA (tRNA), 38,258 ribosomal RNA (rRNA), and 169 small nuclear RNAs (snRNA) genes (Table 8).

Table 8 Summary of noncoding RNA genes.

Data Records

The raw data (Illumina reads, PacBio HiFi reads, and Hi-C sequencing reads) used for genome assembly were deposited in the SRA at the National Center for Biotechnology Information (NCBI)50. The RNA-seq data were deposited in the SRA at NCBI with accession numbers SRR2966054551 and SRR2966054652. The assembled genome was deposited in the DDBJ/ENA/GenBank databases under the accession number JBFBPF00000000053, and the genome annotation files are available on figshare repository54.

Technical Validation

Assembly assessment of Chinese cherry ‘Duiying’

The analysis results of the genome showed that the Chinese cherry genome was homologous tetraploid (Figs. 1, 2), supporting the previous karyotype research results on Chinese cherry chromosomes4. Our assembled ‘Duiying’ genome exhibited exceptional completeness, as evidenced by the coverage of 98.52% of Illumina paired reads across 99.82% of the genome. In addition, it recovered 99.4% of BUSCOs in the 1614 conserved Embryophyta genes from the embryophyta_odb10 database9 (Table 9). This assembled genome also demonstrated superior accuracy, with a single base error ratio of 9.08 × 10−8, indicating that there were only 9 assembly error sites per 100 Mb genome region.

Table 9 Assembly assessment for the genome of Chinese cherry ‘Duiying’.