Background & Summary

The ornamental crabapples (Malus spp., Rosaceae) are woody plants with fruit diameters of less than 5 cm, whose flowers, leaves, fruits and other traits have significant ornamental value1. Interspecific hybridization plays a pivotal role in the breeding of new varieties of ornamental crabapple. For a considerable period, the predominant colours observed in crabapple flowers were white and pink. Following the discovery of Malus niedzwetzkyana (MN), there was a notable increase in the intensity of the colours red and purple, which significantly enhanced the ornamental value of crabapples. The Rosybloom hybrids, which are descendants of MN, exhibit flowers that are dark, deep rose, red, and purple in colour1. They have been widely used in garden landscapes around the world. Previous studies have demonstrated that the colour of flowers, as well as foliage and fruit is primarily influenced by anthocyanins, a category of flavonoid compounds2,3,4. The genetic analysis of MN at the molecular level provides a valuable opportunity to enhance our understanding of the genetic basis of flower colour and other ornamental traits in crabapples, thereby advancing the field of genetics and breeding.

De novo genome assembly represents a foundational and efficacious instrument employed in the field of molecular genetics research. Moreover, the development of third-generation PacBio HiFi sequencing and Hi-C techniques has markedly enhanced the completeness of genome assembly. Several chromosome-level genomes of crabapples are now accessible in public databases, including the European crabapple (M. sylvestris)5, the Pacific crabapple (M. fusca)6, M. prunifolia7, and two ornamental crabapple cultivars ‘Royalty’ and ‘Flame’4. The advent of chromosome-level genome sequences has opened new avenues of research, offering researchers the opportunity to investigate the functional, regulatory, and evolutionary aspects of the genome in the Malus genus, and to gain a more nuanced understanding of its significant characteristics.

This study presents a high-quality genome for MN, generated through an integrated approach utilising Illumina short-read, PacBio high-fidelity (HiFi) long-read, and high-throughput chromosome conformation capture (Hi-C) sequencing data. In total, we generated 40.02 Gb (~59 × coverage) of clean short reads, 29.42 Gb (~43 × coverage) of PacBio HiFi CCS reads with an N50 of 17.97 kb, and 63.95 Gb of clean Hi-C data (~94 × coverage) (Table 1). The estimated genome size, heterozygosity, and repetitive content were 678.26 Mb, 0.57% and 33.74%, respectively, using 17-mer analysis. The final assembled genome size was 672.64 Mb with a contig N50 of 36.45 Mb (Table 2). The assembled contigs were then anchored onto 17 pseudochromosomes, with an anchor rate of 98.38% (Fig. 1). The quality and completeness of the assembly were validated by employing four distinct approaches. First, the clean short reads and the PacBio HiFi CCS reads were mapped to the assembly, yielding a mapping ratio of 99.32% and 99.95%, respectively (Table 3). Secondly, two telomeres were identified at both ends of the twelve chromosomes, one telomere was identified at one end of four chromosomes (chr1, chr5, chr13, and chr16), and only chromosome 15 did not have a telomere identified (Fig. 2). Thirdly, the BUSCO results showed that 1590 (98.6%) genes could be compared with the lineal homologous database, of which 1058 (65.6%) were single-copy and 532 (33.0%) were duplicates (Table 4). Finally, the long terminal repeat (LTR) assembly index (LAI) score of the MN genome assembly was 21.99, reaching the ‘gold standard’ level (LAI > 20).

Table 1 Summary of MN sequencing data.
Table 2 Summary of MN genome assembly data.
Fig. 1
Fig. 1
Full size image

Hi-C interaction analysis and circos map. (a) Hi-C interaction heatmap of MN. (b) The circos map of MN. For the circos map, the tracks from inside to outside are chromosome ID and length (I), GC content (II), density of protein-coding genes (III), density of LTR elements (IV).

Table 3 Mapping ratio of short reads and PacBio HiFi reads on the MN genome assembly.
Fig. 2
Fig. 2
Full size image

Overview of the near T2T and gap-free MN genome. The triangle and block represent the telomere regions and gaps, respectively.

Table 4 BUSCO values of the assembly and annotation genes.

Repeat sequences accounted for 476.89 Mb, representing 70.90% of the assembly (Table 5). The retransposon LTR was the most abundant component among the repetitive elements, accounting for 52.82% (Table 6). Gene annotation identified a total of 43,813 protein-coding genes (Table 2). The predicted proteins attained a complete BUSCO score of approximately 98.30%, which indicates a high quality of annotation (Table 4). A total of 42,972 (98.08%) protein-coding genes were successfully annotated in various databases, including Interpro, Gene Ontology (GO), Kyoto Encyclopedia of Genes and Genomes (KEGG), SwissProt, Translation of European Molecular Biology Laboratory (Trembl), and NCBI non-redundant database (NR) (Table 7). A total of 4,351 non-coding RNA genes (i.e., 152 microRNAs [miRNAs], 770 transfer RNAs [tRNAs], 3074 ribosomal RNAs [rRNAs], and 355 small nuclear RNAs [snRNAs]) in the genome (Table 8).

Table 5 General statistics of repeats in the MN genome assembly.
Table 6 Transposable elements (TEs) in the MN genome assembly.
Table 7 Number of functional annotations for predicted genes in the MN genome assembly.
Table 8 Number of noncoding RNA genes in the MN genome assembly.

Methods

Sample collection, DNA and RNA extractions

We collected fresh leaves, flowers, bark on tender branches and young fruits from a mature MN tree growing in the China National Botanical Garden (Beijing, China). The leaf samples were subjected to DNA isolation via a modified cetyltrimethylammonium bromide (CTAB) method, and all samples collected were subjected to RNA isolation using Trizol reagent (Invitrogen, CA, USA). The quality of DNA and RNA were examined using a NanoDrop 2000 spectrophotometer (NanoDrop Technologies, Wilmington, DE, USA) and agarose gel electrophoresis. DNA and RNA quantity was determined using the Qubit dsDNA HS Assay Kit on a Qubit 3.0 Fluorometer (Life Technologies, Carlsbad, CA, USA) and a Bioanalyzer 2100 system (Agilent Technologies, CA, USA), respectively.

Preparation and sequencing of short insert libraries

For the DNA short insert libraries, a starting material of 1 μg DNA was used and sequencing libraries were constructed using the VAHTS Universal DNA Library Prep Kit for MGI (Vazyme, Nanjing, China). For the RNA-seq libraries, 1 μg RNA per sample served as the input material, with mRNA being isolated from the total RNA magnetic beads equipped with poly-T. Sequencing libraries were created using the VAHTS Universal V6 RNA-seq Library Kit for MGI (Vazyme, Nanjing, China). Unique index codes were incorporated to differentiate sequences among various samples. The quantification and sizing of these libraries were assessed with the Qubit 3.0 Fluorometer (Life Technologies, Carlsbad, CA, USA) and the Bioanalyzer 2100 system (Agilent Technologies, CA, USA). Finally, both DNA and RNA libraries were sequenced on a MGI-SEQ. 2000 platform to generate 150-bp paired-end reads.

Preparation and sequencing of long insert libraries

For the DNA PacBio long insert sequencing, a SMRTbell library was constructed following the manufacturer’s protocol using the SMRTbell Express Template Prep kit 2.0 (Pacific Biosciences). Initially, 15 μg of genomic DNA was processed in the first enzymatic step to eliminate single-stranded overhangs, followed by application of repair enzymes to mend any damage along the DNA backbone. The ends of the double-stranded DNA fragments were smoothed and then extended to create an A-overhang. The ligation with T-overhang SMRTbell adapters was conducted at 20 °C for 60 minutes. The SMRTbell library was cleaned with 1 × AMPure PB beads. The library’s size distribution and concentration were evaluated with the FEMTO Pulse automated pulsed-field capillary electrophoresis device (Agilent Technologies, Wilmington, DE) and the Qubit 3.0 Fluorometer (Life Technologies, Carlsbad, CA, USA). Following library characterization, 3 μg DNA underwent size selection with the BluePippin system (Sage Science, Beverly, MA), to eliminate SMRTbells shorter than 15 kb. The library was again purified with 1 × AMPure PB beads. The library’s size and quantity were reassessed using the FEMTO Pulse and the Qubit dsDNA HS reagents Assay kit. The sequencing primer and Sequel II DNA Polymerase were annealed and bound to the final SMRTbell library, respectively. The library was loaded at an on-plate concentration of 120 pM via diffusion loading. SMRT sequencing was executed using a single 8 M SMRT Cell on the Sequel II System, using the Sequel II Sequencing Kit and 1800-minute movies.

Hi-C Library construction and sequencing process

The Hi-C library was prepared according to a previous report8. In summary, samples were fixed by vacuum infiltration with 3% formaldehyde for 30 min at 4°C and then treated with 0.375 M glycine solution to quench the action for 5 min. The fixed samples were lysed, and endogenous nucleases were neutralized with 0.3% SDS. Chromatin DNA was digested using 100 U of MboI (NEB), labeled with biotin-14-dCTP (Invitrogen) and subsequently ligated with 50 U T4 DNA ligase (NEB). After reversing the cross-links, the ligated DNA was purified using the QIAamp DNA Mini Kit (Qiagen) as per the manufacturer’s guidelines. The extracted DNA was fragmented to 300–500 bp pieces, subjected to blunt-end repair, A-tailing and adaptor ligation, followed by purification via biotin-streptavidin pull-down and PCR amplification. The Hi-C libraries were then quantified and sequenced using the Illumina Nova-seq platform (San Diego, CA, USA).

Genome survey and analysis

After quality assessment and filtering, 40.02 Gb of clean DNA short reads were obtained for genome survey. The genome size, heterozygous ratio, and percentage of repetitive sequence were estimated using the GCE (v1.0.2)9 software with K-mer size of 17 bp and cleaned short reads.

Genome assembly

The yielding 29.42 Gb PacBio long high-fidelity (HiFi) reads were assembled to construct the draft genome using hifiasm (v0.16.1)10 with default parameters. The Hi-C reads were applied to assemble the contigs into 17 chromosomes using Juicer tools (v.1.6)11 and 3D-DNA12 based on Hi-C interaction data (70.74 Gb) (Fig. 1). Finally, JucieBox (v1.11.08)13 was used to manually adjusted the assembled genome. The BUSCO (Benchmarking Universal Single-Copy Orthologs) pipeline (v5.2.2)14 was used to assess the coverage of highly conserved genes to validate the completeness of the genome assembly with the embryophyte_odb10 dataset, which contains 1614 BUSCO gene sets.

Repetitive sequences annotation and telomeres identification

We used two methods to identify the repeat contents in the MN genome, combining de novo and homology-based prediction. For the homology-based analysis, we utilized RepeatMasker (v4.1.2; http://repeatmasker.org) with the Repbase TE library15 to detect known transposable elements (TEs) within the genome. Additionally, RepeatProteinMask (v4.1.2) was employed to search against the TE protein database. In terms of de novo prediction, we generated a custom repeat library for the genome using RepeatModeler (v2.0.2a; http://www.repeatmasker.org/RepeatModeler/), which automates the execute of RECON (v1.08)16 and RepeatScout (v1.0.5)17, two key tools for identifying, refining, and classifying potential interspersed repeats. Furthermore, LTR_FINDER (v1.0.7)18 was deployed for a de novo search of long terminal repeat (LTR) retrotransposons within the genome sequences. Tandem repeats were identified with the Tandem Repeat Finder (v4.10.0)19 and non-interspersed repeat sequences, such as low-complexity repeats, satellites and simple repeats, were detected using RepeatMasker. The repeat libraries from both methods were then integrated to ascertain the repeat content. Telomeric sequences in MN genome assembly were pinpointed using quarTeT (v1.0.3)20 with the “-c plant” option, revealing the telomere repeat monomer as TTTAGGG.

Prediction and functional annotation of protein-coding genes

We used three strategies to predict genes, including ab initio prediction, homology-based prediction, and transcriptome-based prediction, all applied to the repeat-masked, chromosome-scale genome. First, ab initio prediction was performed with Augustus (v3.3.1)21 and Genescan (v1.0)22 utilizing models trained on a curated set of high-quality proteins derived from RNA-Seq data. Second, the genome sequences were aligned with the protein sequences of seven plants, M. domestica, M. sieversii, M. sylvestris, M. prunifolia, Pyrus pyrifolia, Crataegus pinnatifida, and Prunus persica, and gene structures were predicted with Exonerate (v2.2.0)23 with default parameters. Third, for transcriptome-based gene prediction, the gene structure was built using PASA (v2.4.1)24. For Iso-Seq-based gene prediction, the Iso-seq reads were aligned to scaffolds using GMAP (v2017-11-15)25. The transcripts were used to predict Open Reading Frames (ORFs) using PASA, and full-length cDNA was screened as a training set. Finally, Maker (v3.00)26 synthesized the predictions from the three approaches to generate gene models. The output was a coherent and non-overlapping set of sequence assemblies that delineated the gene structures. Gene functions were assigned based on the highest scoring matches of the alignments to the National Center for Biotechnology Information (NCBI) Non-Redundant (NR), TrEMBL27, InterPro28 and Swiss-Prot27 protein databases using BLASTP (v2.6.0+)29 and the Kyoto Encyclopedia of Genes and Genomes (KEGG) database30, all with an E-value threshold of 1E-5. Protein domains were annotated with InterProScan (v5.3574.0)31 based on InterPro protein databases. Pfam database32 was used to identify the motifs and domains within gene models. IDs of Gene Ontology (GO)33 were assigned to each gene by Blast2GO34. Non-coding RNA, including miRNA, tRNA, rRNA, and snRNA, were annotated using the following methods: tRNA were predicted using tRNAscan-SE (v1.3.1)35 with default parameters; rRNA were identified by mapping Arabidopsis thaliana rRNA sequences to the MN genome using BLASTN-short v2.2.28; miRNA and snRNA were analyzed using INFERNAL (v1.1.3)36 against the Rfam database with default parameters.

Data Records

The DNA sequence reads of MN (Experiments of DNA sequencing data from PacBio HiFi library: SRR3012738137; Experiments of DNA sequencing data from Hi-C library: SRR3014070838) have been deposited in the Sequence Read Archive (SRA) under BioProject accession PRJNA1143952. The genome assembly have been deposited in the GenBank database under the accession number JBHDYX00000000039. The files of the gene structure annotation, repeat predictions and gene functional annotation have been deposited at Figshare database (https://doi.org/10.6084/m9.figshare.26962936)40.

Technical Validation

We utilized a variety of methods to ascertain the accuracy and completeness of the MN genome assembly. First, the Hi-C heatmap demonstrated the accuracy of the genome assembly, with distinct Hi-C signals between the 17 pseudo-chromosomes, indicating their relative independence (Fig. 1). Second, the completeness and accuracy of the assembled genome was further substantiated by the benchmarking universal single-copy orthologues (BUSCO) analysis, which revealed that 1,590 complete plant orthologues (98.6%) were identified (Table 4). Third, the long terminal repeat (LTR) assembly index (LAI) score of the present assembly was 21.99, reaching the ‘gold standard’ level (LAI > 20). The accuracy was further corroborated by the very high mapping rates of two types of sequences to the MN assembly, with 99.32% of short insert reads and 99.95% of HiFi reads mapping to the MN assembly (Table 2). Finally, the chromosome telomere location map showed that the assembled genome extended to the telomeres, with the exception of chromosome 15, and that the majority of the chromosomes were assembled into double-ended telomeres (Fig. 2).