Background & Summary

Oshima cherry [Cerasus speciosa (Koidz.) H. Ohba and Prunus speciosa (Koidz.) Nakai] is a cherry tree species endemic to Japan, found primarily in the wild on the Izu Islands1. The classification of flowering cherry regarding the nomenclature of the genera as Prunus or Cerasus remains a controversy. However, we discuss the Oshima cherry and other flowering cherries within the genus Cerasus based on recent molecular and population genetic analyses2. Flowering cherries have held significant cultural importance in Japan for millennia, not only utilized as a source of firewood, timber for buildings, and material for cords, but also cherished for their high ornamental value since the late 8th century3,4,5. This appreciation has continued to the contemporary “Hanami” culture, a tradition of cherry blossom viewing that persists in Japan. Japan is native to ten species of wild cherry trees1,6,7,8, and the number of named cultivars exceeds several hundred9,10,11. Numerous cultivars within the “Cerasus Sato-zakura Group” are presumed to be descendants of wild Oshima cherry12. Additionally, Cerasus × yedoensis ‘Somei-yoshino’, one of the most famous Japanese cherry cultivars and an interspecific hybrid of C. itosakura and C. speciosa, is speculated to have one haplotype derived from the wild Oshima cherry lineage13,14,15. The Oshima cherry stands out among Japan’s native wild species because of its diverse blooming periods, relatively large petal size, rapid growth, and strong fragrance derived from coumarins16. These characteristics have made the Oshima cherry highly favored, significantly influencing the development of cherry cultivars in Japan17. Both wild and cultivated Japanese cherry species are primarily diploids (2n = 2x = 16), possessing eight pairs of chromosomes similar to the karyotype of C. speciosa18. Analysis of the Oshima cherry genome may greatly aid in understanding the genetics of the Japanese cherry cultivar.

Izu Oshima, one of the Izu Islands, houses a giant Oshima cherry tree, referred to as “Sakurakkabu,” which is estimated to be over 800 years old19. This particular C. speciosa tree has been conserved as a Natural Monument of Japan since 1935. This tree was originally a single massive tree until the Edo period (1603–1868), but its main trunk eventually broke. While multiple sprout trunks emerged from the remaining main trunk, they were repeatedly toppled due to the impact of numerous typhoons. From its remains, branches of various sizes spread out in all directions, with some growing diagonally to the ground or lying horizontally and rooting from points of contact, before growing upward again, forming a unique tree shape. As a result, the tree now comprises the original trunk and three additional trunks, each pointing north, east, and west. “Sakurakkabu” is believed to be the oldest existing Oshima cherry tree in Izu, Japan. Decoding the genome of this Oshima cherry tree may contribute significantly to the research on the origin of Oshima cherry and Japanese cherry trees. Furthermore, our findings may benefit studies of cultivars in Japan and other regions.

Herein, we report the near complete genome assembly of the Oshima cherry, C. speciosa. Samples of C. speciosa were obtained from the leaf buds of the original trunk of “Sakurakkabu” on Izu Oshima, with the permission of the Tokyo Metropolitan Government and the Agency for Cultural Affairs. By extracting genomic DNA and sequencing it using Illumina short reads and PacBio long reads, we constructed a high-quality telomere-to-telomere reference genome for C. speciosa.

Methods

Sample collection and sequencing

Young leaves collected from the Oshima cherry tree “Sakurakkabu” in Izu Oshima, Tokyo, Japan, were used for genomic DNA extraction (Fig. 1a). The leaves were collected, rinsed with deionized water, blotted dry with paper towels, and stored at ‒80 °C until the extraction of genomic DNA. Extraction of genomic DNA from the leaves was performed with the method reported previously with some modifications20. Briefly, 0.1–0.5 g of leaves were ground into a powder in liquid nitrogen using a chilled pestle and mortar. This powder was suspended in 10 mL of Carlson lysis buffer (100 mM Tris-HCl pH 9.0, 2% CTAB, 1.4 M NaCl, 1% PEG 6000, 1% Polyvinylpyrrolidone K30, 20 mM EDTA) supplemented with 0.25% β-mercaptoethanol, and then incubated at 74 °C for 20 min with gentle mixing by inverting the tube slowly every 5 min. The mixture was cooled to room temperature, followed by the addition of 20 mL chloroform/isoamyl alcohol (24:1), careful and gentle mixing, and centrifugation at 1,800 G for 10 min. The supernatant was transferred to a new tube, and an equal volume of 2-propanol were added. The mixture was mixed uniformly and centrifuged at 2,500 G for 30 min. The supernatant was discarded, and the pellet was washed with 75% EtOH, followed by centrifugation for 10 min. The supernatant was discarded again, and the pellet was briefly air-dried. Subsequently, 1 mL of TE was added to dissolve the pellet. Then RNase was added to 10~100 µg/mL and the sample was stored overnight at 4 °C. Genomic DNA was purified using a QIAGEN Genomic-tip 500 G and QIAGEN DNA Buffer Set (QIAGEN, Hilden, Germany), following the manufacturer’s protocols.

Fig. 1
figure 1

(a) Oshima cherry tree “Sakurakkabu” of Oshima, Cerasus speciosa. (b) Genome size estimation and heterozygosity through k-mer analysis.

For sequencing using the Illumina NovaSeq6000 platform with 250 bp paired-end reads, the genomic DNA was fragmented using Covaris, followed by library preparation using the TruSeq DNA PCR-Free Library Prep Kit. Size selection was performed via agarose gel excision. Sequencing was performed for 500 cycles using a NovaSeq6000 Reagent Kit. For the sequencing of continuous long reads (CLR) using the Pacific Biosciences (PacBio) Sequel II platform, the genomic DNA was fragmented using g-TUBE. Library preparation was performed using the SMRTbell Express Template Prep Kit 2.0 (Pacific Bioscience, CA, USA), followed by the removal of DNA fragments under 40 kbp using BluePippin. Sequencing was performed using Sequel II Binding Kit 2.0/Sequencing Kit 2.0. For the sequencing of high-fidelity (HiFi) reads using the Pacific Biosciences (PacBio) Sequel II platform, the genomic DNA was sheared into fragments of 10–30 kbp using a g-tube device (Covaris Inc., MA, USA). A SMRTbell library for HiFi reads was prepared using the SMRTbell Prep Kit 3.0 (Pacific Bioscience, CA, USA), according to the manufacturer’s instructions. The HiFi library was size-selected for 15–30 kbp DNA fragments using the Sage ELF system (Sage Science, MA, USA) and sequenced on one SMRT Cell (8 M) using the Sequel II system with the Binding Kit 3.2/Sequencing Kit 2.0 (Pacific Bioscience, CA, USA), with 1800 min collection times. Consensus HiFi reads were generated from raw full-pass subreads using the DeepConsensus v1.2 program21. The Hi-C library was prepared using the Dovetail Omni-C Kit (Dovetail Genomics, CA, USA), according to the manufacturer’s protocol. Briefly, the chromatin was fixed in nucleus by disuccinimidyl glutarate (DSG) and formaldehyde. The crosslinked chromatin was digested using DNase I, and the cells were lysed to extract chromatin DNA fragments. The chromatin DNA ends were repaired and ligated to a biotinylated bridge adapter, and then proximity ligated of the adapter-containing ends. Subsequently, the cross-linked DNA was reversed and converted into a sequencing library using Illumina-compatible adaptors. Biotin-containing fragments were enriched with streptavidin magnetic beads before PCR amplification. The concentration and quality of the libraries were evaluated using a Qubit 4 fluorometer (Thermo Fisher Scientific, MA, USA), 2100 Bioanalyzer System (Agilent Technologies, CA, USA), and 7900HT Fast Real-Time PCR System (Thermo Fisher Scientific, MA, USA). Finally, the Omni-C library was sequenced on an Illumina NovaSeq6000 system with a 2 × 150 bp read length, generating 61.2 Gbp of Omni-C reads.

RNA was extracted separately from the homogenized leaf buds, floral buds, stamens, and ovules of C. speciosa, which were processed in liquid nitrogen. The resulting frozen powder (10–20 mg) was used for RNA extraction, employing the Fruit-mate™ and NucleoSpin™ RNA Plant kits (Takara Bio Inc., Shiga, Japan), in accordance with the manufacturers’ instructions. RNA quality and concentration were assessed using an Agilent 2100 Bioanalyzer (Agilent Technologies, CA, USA) with the RNA6000 Pico kit (Agilent Technologies, CA, USA). Subsequently, the RNA samples were reverse-transcribed to cDNA using the SMART-Seq HT Kit (Takara Bio USA, CA, USA) following the manufacturer’s guidelines. Sequencing libraries were generated from the cDNA using the Nextera XT DNA Library Preparation Kit (Illumina, CA, USA) and sequencing was conducted on an Illumina NovaSeq6000 system with a 2 × 100 bp read length.

Genome size and heterozygosity estimation

The genome size of C. speciosa was estimated using k-mer based method and Illumina short-read sequencing datasets. The acquired short reads were subjected to rigorous quality control using fastp (v0.23.2)22 with the parameter “-q 30 -f 5 -F 5 -t 5 -T 5 -n 0 -u 20.” To manipulate and count the k-mer in short reads dataset, we utilized KMC3 (v3.2.1)23 counting canonical 21-mers with the parameter “-k 21 -ci 1 -cs 10000.” Our decision to select k = 21 was based on recommendations for genome size estimation processes, where k = 21 is advocated. This recommendation is based on the premise that a k-mer length of k = 21 is sufficiently long to ensure that most k-mer are not repetitive, yet short enough to enhance the robustness of the analyses against sequencing errors. For generating a histogram of k-mer statistics, kmc_tools implemented in KMC3 were executed by operation “transform” and “histogram” with parameter “-cx 10000.” The obtained histogram was analyzed by the Rscript version of GenomeScope 2.0 (v2.0)24,25 to estimate the haploid genome size and heterozygosity from k-mer count dataset with parameter “-k 21 -p 2.” The majority of cherry blossoms in Japan are diploid, and karyomorphological research has also confirmed that C. speciosa is diploid18. Consequently, a diploid model (-p 2) was employed for genome size estimation using GenomeScope 2.0. Based on k-mer analysis, the haploid genome size of C. speciosa was estimated to be ~245.6 Mbp with a heterozygosity of 0.91% and a repeat content of 39.8% (Fig. 1b).

De novo genome assembly and quality assessments

Primary genome assembly was performed using Hifiasm (v0.19.5-r587)26,27, incorporating PacBio HiFi reads along with Omni-C Illumina short reads. HiFi reads were generated from subread output by the Sequel II system using DeepConsensus (v1.2)21, with default parameters and then assembled with Omni-C data using Hifiasm under default settings. The number of HiFi reads was adjusted to optimize the assembly, ultimately using 85% for improved contiguity. To improve the accuracy of the genome assembly and remove redundant sequences, we employed a de-redundancy approach using the purge_dups (v1.2.5)28, with the mapping process facilitated by minimap2 (v2.26-r1175)29,30. In the purge_dups process, we followed the pipeline recommended by the developer. However, during the initial step, where we aligned the PacBio data using minimap2, we employed the “-x map-pb” option. The workflow was applied iteratively across three rounds of processing. Redundancy in genome assemblies often arises from heterozygosity or collapsed haplotypes, especially in highly heterozygous species. The de-redundancy process is crucial for improving the contiguity and accuracy of the pseudo-haploid assembly by retaining only a single representative copy of duplicated regions. Contamination was removed by filtering out contigs with more than 10% of the length aligning to the NCBI Prokaryotic RefSeq Genomes or the organelle genome of C. speciosa using blastn (v2.10.0+)31, with an e-value threshold of “1e-10.” To demonstrate the accuracy of the location and orientation of our assembled genome, we verified it using a Hi-C contact matrix. We mapped the Hi-C Illumina short reads to our assembled pseudo-haploid genome, following the default pipeline recommended by HiC-Pro (v3.0.0)32. After creating the Hi-C contact matrix, we plotted the Hi-C contact map using HiCExplorer (v.2.2.3)33 (Fig. 2). We also conducted a haplotype-resolved genome assembly using the same pipeline. The haplotypes were resolved using the phased haplotype-1 and haplotype-2 output from Hifiasm. Since the haplotype-resolved assembly generated more contigs than the pseudo-haploid assembly, we used RagTag (v2.1.0)34 with default parameters to perform scaffolding, using the pseudo-haploid genome as the reference. The results of the haplotype-resolved genome assembly have also been deposited online. Additionally, the Hi-C contact map results for the haplotype-resolved genome assembly are available on Figshare.

Fig. 2
figure 2

Hi-C interaction heatmap for C. speciosa. The chr1–8 are the abbreviations of Chromosome 1–8.

Finally, we obtained a complete pseudo-haploid genome assembly spanning 269,259,082 bp, characterized by a contig N50 value of 32.0 Mbp and GC ratio of 38.1%, with no gaps within the eight pseudomolecules corresponding to the chromosomes (Fig. 3, Table 1).

Fig. 3
figure 3

Schematic illustration of the genetic features of the C. speciosa genome. The outermost circle represents the length of each chromosome, with telomere sequences at the ends (blue circles) and estimated centromeric regions (darker color). From the outermost to the innermost track, each circle represents the following genetic characteristics: guanine-cytosine (GC) ratio (a), gene distribution (b), non-coding RNA distribution (c), Long Terminal Repeats (LTR) (d), Terminal Inverted Repeats (TIR) (e), non-LTR elements (LINE and SINE) (f), and non-TIR elements (Helitron) (g).

Table 1 Summary statistics of the Cerasus speciosa genome assembly and other closely related species genomes.

To evaluate our newly assembled genome of C. speciosa, we used Benchmarking Universal Single-Copy Orthologs (BUSCO) (v5.5.0)35 to assess the completeness of the genome. The BUSCO lineage dataset was downloaded from OrthoDB v1036,37. To determine the dataset most suitable for the C. speciosa genome, the “--auto-lineage” option was specified, using the eukaryota_odb10 and eudicots_odb10. The BUSCO assessments revealed completeness of 99.6% (254 out of 255 genes) for the eukaryota dataset and 98.4% (2289 genes out of 2326 genes) for the eudicots dataset (Table 2). The BUSCO datasets were evaluated further using compleasm (v0.2.4)38 by “run” mode with “-l eudicots” option. The compleasm is recognized as a more rapid and accurate tool for reassessment than BUSCO. Using the same eudicots dataset as in the previous BUSCO analysis for evaluating genome assemblies, we observed a completeness of 99.14% (2,306 of 2,326 genes), with a composition of 97.03% single-copy complete genes (2,257 genes), 2.11% duplicated complete genes (49 genes), 0.39% fragmented genes (9 genes), and 0.47% missing genes (11 genes) (Table 2). In addition to the conventional evaluation of genome assemblies using BUSCO, we utilized the methodology known as Inspector (v1.0.1)39, which assesses genome assemblies using long reads. This approach involves aligning the long reads used for genome assembly with the assembled genome to evaluate the completeness of the genome structure. In this study, we conducted evaluations using PacBio HiFi reads (“-d hifi” option) employed in the assembly process. The mapping rate of the long reads to the assembled genome was 99.99%, with a quality value (QV) of 35.89, indicating an accuracy of 99.97% (Table 2). Furthermore, we used a k-mer-based approach to evaluate genome assembly with Merqury (v1.3)40. For the k-mer counting, Meryl v1.3, implemented in Merqury, was used to compute the k-mers from PacBio HiFi reads with k = 21 settings. Merqury was proceeded with its default parameters for pseudo-haplotypes using the k-mer datasets quantified by Meryl. The results indicated a QV > 67, corresponding to an error rate of 1.80 × 10−7 (Table 2).

Table 2 Genome assembly and annotation statistics for C. speciosa.

Assembled genome annotations

To annotate repetitive sequences, we employed EDTA (v2.0.0)41 to conduct annotation using de novo and curated libraries using a homology-based approach. For the curated libraries, we employed a collection known as nrTEplants (v0.3)42, which contains repetitive sequences annotated from various databases, including REdat43, RepetDB44, TREP45, and other collections. In conducting the EDTA analysis, coding sequences (CDS) from closely related species were provided, using the CDS of the C. speciosa haplotype from the already published genome of C. × yedoensis ‘Somei-yoshino’15. The parameters employed in the execution of EDTA were as follows: --genome [C. speciosa genome] --cds [C. speciosa haplotype of C. × yedoensis ‘Somei-yoshino’] --curatedlib [nrTEplants curated library] --overwrite 1 --sensitive 1 --anno 1 --evaluate 1. Through the annotation of repetitive sequences, approximately 128.8 Mbp of repetitive regions were identified, accounting for 47.8% of the C. speciosa genome (Fig. 3, Table 3). Each type of repetitive sequence was classified using RepeatMasker (v4.1.1) (https://www.repeatmasker.org/), which was implemented in an EDTA Docker image (oushujun/edta:2.0.0).

Table 3 Summary of repetitive contents in C. speciosa.

The non-coding RNAs (ncRNA) were annotated using three primary sources, tRNAscan-SE 2.0 (v2.0.12)46 for transfer RNA (tRNA); RNAmmer (v1.2)47 for ribosomal RNA (rRNA); and Rfam 14.1048 for ncRNAs, including microRNAs (miRNAs) from miRbase49. We detected tRNA using tRNAscan-SE 2.0 with default options. We used RNAmmer to detect rRNA sequences with the following options specified: “-S euk -m lsu,ssu,tsu -multi.” miRNAs and other ncRNAs were identified using Infernal (v1.1.4), which was predicted based on the Rfam database. The parameters for Infernal were specified in accordance with the guidelines documented in Rfam 14.10, and the final E-value cutoff for ncRNA was set to “1e-10.” Our predictions identified 572 tRNA genes encoding 20 amino acids, 24 tRNA pseudogenes, 794 rRNA genes, and 623 other ncRNAs (Fig. 3).

While annotating protein-coding genes, the initial step involved soft-masking repetitive sequences within the genome using RepeatModeler2 (v2.0.5)50 and RepeatMasker (v4.1.5), implemented in the Dfam Transposable Element Tools Docker container (v1.87) (https://github.com/Dfam-consortium/TETools). We constructed the repeat library using RepeatModeler2 incorporating the “-LTRStruct” option to engage the LTR discovery pipeline. For the soft-masking genome sequence using RepeatMasker, the following parameters were employed: “-pa 12 -a -html -gff -s -no_is -norna -nolow -div 40.” For the structural annotation of protein-coding genes, the BRAKER3 (v3.0.7.1)51,52 pipeline was employed, integrating predictions from three strategies: ab initio prediction, transcription evidence from RNA-seq, and homologous proteins from closely related species. To maximize the gene annotation predictions using RNA-seq, we implemented a noise-reduction process on the RNA-seq data. Specifically, the quality of the RNA-seq data was assessed using fastp with the following parameters: “-q 30 -f 5 -F 5 -t 5 -T 5 -n 0 -u 20.” We performed de novo transcriptome assembly using Trinity (v2.15.1)53 and mapped the RNA-seq data to this assembled genome using Hisat2 (v2.2.1)54. Only reads that were correctly mapped to the de novo assembled transcriptome genome were used for downstream analysis. During this process, we selected the genome-guided assembly option in Trinity, with a maximum intron size of 1000 bp. The rationale for setting the maximum intron size to 1000 bp was based on the average and median intron sizes of less than 1000 bp in the genomes of closely related species (Prunus persica, Prunus mume, and Cerasus avium (formerly named Prunus avium)) registered in the NCBI RefSeq database. Additionally, the “--very-sensitive” option was utilized in Hisat2 to enhance mapping sensitivity. The extracted reads were then re-mapped to the complete C. speciosa genome using Hisat2, which served as an input file for BRAKER3. For this remapping, Hisat2 was configured with the “--very-sensitive” and “--max-intronlen 15000” options. For the structural annotation of protein-coding genes using BRAKER3, the “--AUGUSTUS_ab_initio” option was used to perform ab initio gene predictions using AUGUSTUS55, implemented in BRAKER3. Furthermore, protein sequences from the C. speciosa haplotype of C. × yedoensis ‘Somei-yoshino’ were provided as the required protein information from a closely related species for BRAKER3. The RNA-seq data used as input files for BRAKER3 were obtained using previously described processes. To enhance the annotation results from BRAKER3, the “--busco_lineage = eudicots_odb10” option was selected for BUSCO assessment, utilizing compleasm38 implemented in BRAKER3. The annotation of protein-coding genes by BRAKER3 does not include information on untranslated regions (UTRs). Therefore, to supplement this deficiency, GUSHR (v1.0.0) (https://github.com/Gaius-Augustus/GUSHR) was executed using the RNA-seq data. To predict the functions of genes inferred by structural annotation using BRAKER3, functional annotation was performed using EnTAP (v1.0.1)56. EnTAP performed homologous alignments using DIAMOND (v2.1.8.162)57 across multiple databases, including SwissProt, TrEMBL, and the non-redundant plant protein database of NCBI (NR). By predicting the structural and functional annotations of protein-coding genes, we predicted 33,773 protein-coding genes from the C. speciosa genome, of which 29,776 genes were functionally annotated against public databases (Fig. 3). The BUSCO completeness of the protein sequences was 97.4% (n = 2267) in “eudicots_odb10” lineage (n = 2326), including 78.3% (n = 1822) single-copy, 19.1% (n = 445) duplicated, 0.3% (n = 8) fragmented, and 2.3% (n = 51) missing BUSCOs. The compleasm completeness of the protein sequence using BUSCO gene set was 97.5% (n = 2268) in “eudicots_odb10” lineage (n = 2326), including 69.7% (n = 1622) single-copy, 27.8% (n = 646) duplicated, 0.7% (n = 16) fragmented, and 1.8% (n = 42) missing BUSCOs. In both evaluations, the annotations were of very high quality.

Chloroplast genome assembly and annotation

The chloroplast genome of C. speciosa was de novo assembled using Illumina short reads. Short reads were filtered and trimmed using fastp, as previously mentioned. Clean reads were then assembled using the get_organelle_from_reads.py script implemented in GetOrganelle (v1.7.7.0)58, with the parameters set to “-R 15 -k 21,55,85,115,127 -F embplant_pt.” Among the assemblies, one with a k-mer size of k = 127 was selected, resulting in a complete circular chloroplast genome sequence of 157,895 bp in length. The GFA file of the chloroplast, representing graphical genomic information, is available online on Figshare. Gene annotation of the assembled complete chloroplast genome was performed using GeSeq (https://chlorobox.mpimp-golm.mpg.de/geseq.html)59. For the configuration of GeSeq, the default settings were primarily utilized. However, modifications were made to the following options: the “FASTA file to annotate” was set to circular, the “sequence source” was changed to “plastid (land plants),” and C. speciosa (NCBI RefSeq: NC_043921.1) was employed as the reference sequence for BLAT analysis. Furthermore, the results of gene annotation were visualized as vector files using the OGDRAW web service (https://chlorobox.mpimp-golm.mpg.de/OGDraw.html)60 (Fig. 4a).

Fig. 4
figure 4

Schematic diagram of gene annotations for the chloroplast and mitochondrial genomes of C. speciosa. The diagram illustrates the positional relationships among genes within the circular genomes of the chloroplasts (a) and mitochondria (b). The gene types displayed in each genome are specified in the legends corresponding to each genome. The track within the inner circle indicates the guanine (GC) ratio (dark color). For the chloroplast genome, the small single-copy (SSC), large single-copy (LSC), and inverted repeat regions (IRA and IRB) are illustrated in the inner circle.

Mitochondrial genome assembly and annotation

Given the considerable length and abundance of repetitive sequences in plant mitochondria, which makes their assembly notably challenging compared with that of animal mitochondria, we adopted a methodology of de novo assembly based on PacBio long reads, with sequencing errors corrected using Illumina short reads. Initially, the subreads obtained using PacBio CLR were mapped to the complete mitochondrial genome of a closely related species using minimap2 (v2.1)29,30 with the parameter “-x map-pb”. The complete mitochondrial sequence of C. avium (NCBI RefSeq: NC_044768.1) was used as a reference. Subsequently, subreads accurately mapped to the reference were de novo assembled using Canu (v2.0)61, under the assumption that the complete mitochondrial length of C. speciosa is approximately 400 kbp, with the parameter set as genomeSize = 400 k. The resulting circular complete genome was then subjected to an polishing process using short reads, employing BWA-MEM (v0.7.17)62 with default parameters for mapping and Pilon (v1.23)63, as well as default parameters for error correction. Similar to the chloroplast genome, GeSeq was used for gene annotation. While the default settings of GeSeq were primarily employed for the annotation of the mitochondrial genome, the following adjustments were made: the “FASTA file to annotate” was set to circular, the “sequence source” was changed to “mitochondrial,” and complete mitochondrial genome of C. × yedoensis ‘Somei-yoshino’ (NCBI RefSeq: NC_065235.1), Cerasus × kanzakura (formerly named Prunus kanzakura) (NCBI RefSeq: NC_065230.1), C. avium (NCBI RefSeq: NC_044768.1), and Arabidopsis thaliana (NCBI RefSeq: NC_037304.1) were employed as the reference sequences for BLAT analysis. The final results of gene annotation were visualized as vector files in a manner similar to the complete chloroplast genome using the OGDRAW web service (Fig. 4b).

Detecting telomeric and centromeric regions

The repetitive sequence units in the telomeric regions of C. speciosa genome were detected using the Telomere Identification Toolkit (tidk, https://github.com/tolkit/telomeric-identifier) (v0.2.41) with the following parameters: “tidk explore --minimum 5 --maximum 12.” This approach assumes that the repetitive sequence units in telomeric regions range from 5 to 12 bp in length. A genome-wide search conducted with “tidk search,” using the simple telomeric repeat units identified by “tidk explore” as queries, revealed telomeric repeat regions positioned at the ends in 12 out of 16 identified regions (chromosomes 1, 2, 3, and 8 exhibited telomeric repeats at both ends, whereas chromosomes 4, 5, 6, and 7 demonstrated such repeats at one end only), characterized by the sequence “CCCTAAA/TTTAGGG.” The same telomeric repeat has been reported in the closely related species Cerasus campanulata (formerly named Prunus campanulata)64. These regions were identified as telomeres (Fig. 3).

To detect centromeric regions, we conducted a comprehensive scan of tandem repeat sequences of 30–1000 bp across the entire genome using the Tandem Repeats Finder (TRF) (v4.09.1)65 with the parameters set to “2 7 7 80 10 50 1000 -f -d -m.” The output data files generated by TRF were converted into gff files using the TRF2GFF (v0.0.5.dev2 + gbfe80a6) (https://github.com/Adamtaranto/TRF2GFF) software. We counted the number of repeats per period of the tandem repeat sequences detected by TRF and constructed histograms (Fig. 5a, b). In the C. speciosa genome, we found that repeats of 167 bp were the most common in the genome, with a total of 65,746.4 repetitions (copies ≥ 2), accounting for approximately 4.08% of the genome. Subsequently, repeats of 166 (3.03%), 332 (2.72%), 498 (2.71%), and 334 (2.34%) bp sequences were observed at high frequencies. Visualization of these detected repetitive sequences using the Integrative Genomics Viewer (IGV) (v2.17.0), suggested that the minimal repeat units comprising the centromeric repetitive sequences were either 166 or 167 bp in length. Furthermore, these units appeared to have distinct distributions across different chromosomes (Fig. 5c).

Fig. 5
figure 5

Identification of the centromeric regions via repetitive sequence analysis. (a) Frequency distribution of repetitive sequences across the genome binned by sequence length (period) in base pairs (bp). The horizontal axis represents the period of repetitive sequences, and the vertical axis represents the number of occurrences in each period. (b) Logarithmic scale representation of data shown in (a). The horizontal axis represents the period of repetitive sequences in bp, and the vertical axis represents the log-transformed count of occurrences for each period. (c) Distribution of genes, non-coding RNAs (ncRNA), LTR elements (Gypsy and Copia), LINE, SINE, and specific lengths of repetitive sequences considered to constitute centromeric structures. The horizontal axis represents the chromosomal position, and the vertical axis indicates the presence of these elements. Notably, repetitive sequences associated with centromeric structures lacked gene-rich regions, indicating putative centromeric regions.

To further investigate the detected centromeric repetitive sequences, we used the most frequently connected repetitive sequence unit (166 or 167 bp) identified in each chromosome as a monomer sequence input. We conducted a structural analysis of the regions considered as centromeres using StainedGlass (v0.5)66 and HiCAT (v1.1.0)67 (Fig. 6). The analyzed regions were specified based on the centromeric regions identified by the TRF results. For the StainedGlass analysis, a window size of 2000 bp was set. The results from StainedGlass and HiCAT revealed varying structural characteristics of the centromeric regions across chromosomes. These findings are similar to those previously reported for Arabidopsis68.

Fig. 6
figure 6

Structure and annotation of centromeric regions across chromosomes. The upper triangular sections of each chromosome represent the sequence identity of tandem repeat structures within the centromeric regions and are visualized as a heatmap. Higher similarity is indicated by warmer colors. At the bottom of the triangular sections of each chromosome, the hierarchical annotations of the estimated higher-order repeats (HORs) within the centromeric structure are displayed, showing up to the top five ranks (excluding repeats of fewer than 50 repeats). The x-axis denotes the genomic positions in kilobase pairs (kbp), illustrating the precise location and organization of the centromeric sequences along each chromosome.

Comparative genomic analysis

We collected the reference genome and protein sequences of P. persica (NCBI RefSeq GCF_000346465.2), P. mume (NCBI RefSeq GCF_000346735.1), C. avium (NCBI GenBank GCA_013416215.1), C. campanulata (NCBI GenBank GCA_034190965.1), and two haplotypes of C. × yedoensis ‘Somei-yoshino’ (NCBI GenBank GCA_005406145.1) from public databases for comparative genomic analysis. Among these, the P. mume sequence required chromosome sorting because of a mismatch in chromosome numbers compared to other related species. For genome comparison, we constructed dot plots using D-GENIES (v1.5.0)69 to compare each genome with C. speciosa, and aligned the sequences using Minimap2 (v2.24)30 implemented in D-GENIES, with the “Few repeats” option specified (Fig. 7). The genomes of related species lack regions presumed to be centromeres, resulting in observable gaps when compared to our genome. Structural differences among the compared species were generally minimal, with C. campanulata showing a remarkably similar genomic structure, suggesting a close phylogenetic relationship. In the comparisons with C. × yedoensis ‘Somei-yoshino’, some regions exhibited inversions and translocations, but these findings are likely artifacts due to the low quality of the pseudomolecule assembly of this genome. Furthermore, we performed an interspecies synteny analysis using the protein sequences of P. persica and C. campanulata, which are particularly well-assembled among closely related species. For the detection of synteny regions, we employed MCscanX70, preceded by a homology search of the protein sequences from each genome using Protein-Protein BLAST (v2.12.0+), with an e-value cutoff of “1e-10.” The results from MCscanX were visualized using SynVisio (https://synvisio.github.io) (Fig. 8).

Fig. 7
figure 7

Synteny analysis of C. speciosa and other reference genomes. The dot plot illustrates the syntenic relationship between the eight chromosomes of C. speciosa (x-axis) and the contigs of the reference genome (y-axis). Each dot represents a conserved genomic block between two genomes. The diagonal green line indicates collinear regions, suggesting high conservation and alignment among the compared genomes.

Fig. 8
figure 8

Comparative analysis of protein-coding genes in P. persica, C. campanulata, and C. speciosa, and chromosomal distributions of protein-coding genes in P. persica, C. campanulata, and C. speciosa. In the upper panel (a), each color represents a different chromosome, illustrating the collinearity patterns of genes across different chromosomes. In the lower panel (b), each line shows the orientation of the collinearity patterns of the genes. The figure shows whether the collinearity patterns of the genes were in forward or reverse orientation. Genes in red indicate a reverse orientation.

Data Records

All raw data used in this study for genome assembly (Illumina short reads, PacBio CLR reads, PacBio HiFi reads, and Hi-C reads) were deposited in the DNA Data Bank of Japan (DDBJ) under BioProject accession number PRJDB17512. As DDBJ is a member of the International Nucleotide Sequence Database Collaboration, the data are accessible via the National Center for Biotechnology Information (NCBI) and European Nucleotide Archive (ENA). The accession number for the SRA Study containing all raw data is DRP01169371, and its details are as follows: DRR532494 for Illumina short reads, DRR532493 for PacBio CLR reads, DRR529729 for PacBio HiFi reads, and DRR529730 for Hi-C reads. The RNA-seq data from the Illumina short reads used for annotating the assembled genome were also registered under the same SRA Study accession number DRP011693, with individual data for each tissue type under accession numbers DRR530280–DRR530285. The assembled pseudo-haploid genome sequences were deposited in the NCBI under the accession numbers GCA_041154625.172 for full assembly. The assembled haplotype-resolved genome sequences were also deposited in the NCBI under the accession numbers GCA_046000225.173 for principal haplotype assembly and the accession numbers GCA_046000245.174 for alternate haplotype assembly. The assembled genome sequences and gene annotation data were submitted to the Genome Database for Rosaceae (GDR; www.rosaceae.org) under analysis URL (https://www.rosaceae.org/Analysis/20425168). These datasets are publicly available from the cherry genome database (https://sakura.nig.ac.jp/genome/). The monomer sequence of the centromeric region used for the centromeric region structure and hierarchy annotation analyses is available on Figshare with the [https://doi.org/10.6084/m9.figshare.26046853]75. All the raw data generated from this analysis are compiled in a list (Table 4).

Table 4 Summary of all the raw data generated in this study.

Technical Validation

The quality of the C. speciosa genome assembly was assessed using four methodologies. The first method employed was an evaluation of completeness using the BUSCO assessment with the “eudicots_odb10” lineage dataset (n = 2326), representative of eudicots. The final BUSCO assessment revealed a completeness rate of 98.4% (n = 2289), comprising 95.9% (n = 2231) single-copy, 2.5% (n = 58) duplicated, 0.5% (n = 11) fragmented, and 1.1% (n = 26) missing BUSCOs. Additionally, a second method involves comfort evaluation, which is proposed to be faster and more accurate than BUSCO. Utilizing the same “eudicots_odb10” BUSCO gene set, evaluation of compleasm showed a completeness of 99.1% (n = 2306), with 97.0% (n = 2257) single-copy, 2.1% (n = 49) duplicated, 0.4% (n = 9) fragmented, and 0.5% (n = 11) missing BUSCOs. Although the evaluation using compleasm indicated a higher quality than that of BUSCO, the BUSCO assessment itself sufficiently demonstrated the high quality of the genome sequence produced in this study. Third, we used Inspector to assess our novel genome sequences by generating alignments of PacBio HiFi sequencing reads, which were used for genome assembly, with the assembled genome sequences. The mapping rates from reads-to-contigs and QV were 99.99% and 35.89 (corresponding to 99.97% accuracy), respectively. Finally, we employed Merqury, a reference-free, k-mer-based genome assembly evaluation approach, for validation. This method involves measuring the k-mer spectrum from a read set used for assembly, thereby enabling independent evaluation without the use of a reference. To construct the k-mer database, we specified k = 21 using PacBio HiFi reads from Meryl. The results from this approach further indicated that our assembly data exhibited high-quality metrics (QV > 67 with an error rate of 1.80 × 10−7). A comprehensive assessment using all these methodologies demonstrated the completeness of the C. speciosa genome assembly (Table 2).