Background & Summary

Lagerstroemia speciosa, a species within the genus Lagerstroemia and the Lythraceae family, originates from India and Oceania, and exhibits a widespread distribution across tropical and subtropical regions1. L. speciosa prefers warm and humid climates with strong resistance and is a typical summer flowering plant with long flowering periods, large flowers and bright colors2. The breeding goal of L. speciosa is to enrich its color and flower pattern, improve its ornamental characteristics, and to apply it in various landscaping applications3. L. speciosa possesses robust fertility, boasts a high survival rate for cuttings, and is capable of interspecific hybridization with numerous Lagerstroemia species4. At present, studies on L. speciosa primarily focus on the exploration of its medicinal properties, with over 40 compounds discovered in its foliage5,6. Numerous scientific investigations have demonstrated that extracts from L. speciosa exhibit remarkable benefits in facilitating bodily functions and enhancing overall health, positioning it as a promising candidate for future therapeutic research on biologically active plant components7,8.

With the recent advancements in PacBio and Hi-C sequencing technologies, we have achieved a high-quality chromosome-level assembly of L. speciosa. The whole-genome assembly size of L. speciosa is 306.76 Mb, which was anchored to 24 chromosomes, with a scaffold N50 of 13.03 Mb and an impressive 98.75% mapping rate. The successful assembly of the genome will greatly facilitate research on the ornamental characteristics of L. speciosa and molecular-assisted breeding efforts.

Methods

Samples collection and sequencing

The diploid superior individual of L. speciosa were used as sequencing materials (Fig. 1A), which were obtained from the wild germplasm collected from the nursery of Guangxi Forestry Research Institute, located in Nanning, China (22°92′ N, 108°36′ E). The fresh young leaves were collected and soaked in liquid nitrogen immediately, then stored at −80 °C for subsequent genome sequencing and Hi-C analysis.

Fig. 1
Fig. 1
Full size image

Characteristics of L. speciosa genome assembly. (A) Inflorescence of L. speciosa. (B) Genome survey results based on K-mer analysis. (C) Chromosomal interaction heatmap of L. speciosa. Chr1 to Chr24 are marked with blue boxes from top to bottom. Red bar represents interaction intensity. (D) Circos map of L. speciosa genome. The tracks from outside to inside are: 24 chromosomes length (a), gene density (b), GC content (c), distribution of unknown bases (d) and syntenic blocks (e).

Genomic DNA was isolated from these samples according to the CTAB method and its concentration and purity were evaluated by NanoDrop 2000 (Thermo Scientific, USA) and gel electrophoresis. The short-read WGS sequencing data were obtained using the Illumina HiSeq2000 platform and generated approximately 290.3 Gb of data. The long-read sequencing data were sequenced on the PacBio platform, which is based on the Single Molecule Real Time (SMRT) sequencing technology, generating a total of 18.39 Gb of sequencing data. Before genome assembly, the frequencies of 17-mer were generated by Jellyfish9 (v2.2.10) based on the clean data and used for the genome evaluation by GenomeScope10. The analysis results showed that the expected genome size is approximately 349.89 Mb, with a repeat rate of about 43.65% and a heterozygosity rate of about 1.03% (Fig. 1B). Hi-C libraries were all sequenced on the BGISEQ-500 platform at the Qingdao Huada Gene Research Institute, corresponding to approximately 62.24 Gb of sequencing data. Hi-C data can assist in the scaffolding process of genome assembly. The total RNA was extracted using OminiPlant RNA Kit. The Oligo (dT) beads facilitated the enrichment of the isolated mRNA, which was subsequently fragmented into shorter segments. The cDNA fragments were sequenced on an Illumina HiSeq4000 platform. The full-length RNA sequencing was performed on the SMRTbell DNA libraries. Three libraries of different lengths were constructed using the SMARTerTM PCR cDNA Synthesis Kit. The library sizes were verified using Qubit2.0 and Agilent 2100. During sequencing, raw polymerase reads were obtained from each Zero-Mode Waveguide single-molecule sequencing reactor. These polymerase reads were then processed to yield the final read of insert or Circular Consensus Sequence (CCS). The cDNA libraries were sequenced using the PacBio RSII sequencing platform, and the raw data underwent filtering to remove redundancy.

Genome assembly

A draft contig-level genome was initially assembled using CANU (v1.8) and PacBio sequencing data. The Pilon11 (v1.23) was used to polish the preliminary assembly with short-read data through two iterations. The genome assembly reached a contig level of 306.17 Mb, with a contig N50 of 808.25 Kb. Subsequently, HiC-Pro (v2.8.0)12 software was used to filter and process the unassembled data. Based on the interaction relationship between chromosome spatial locations, the use of endonuclease to capture interaction regions and sequencing technology. The closer the chromosome regions are, the stronger the interaction becomes. This allows the scaffold to be sequenced and oriented and assembled at the chromosomal level. After data quality control, available data were assembled using Juicer (v1.5)13 and 3D-DNA14 software, and the results were compared with the initial assembly. According to the statistical file of chromosome interaction intensity, the heat map of chromosome interaction was drawn by Juicebox (v1.11.08)15 software (Fig. 1C) based on the default processing. The alignment rate between the reads captured by Hi-C and the initially assembled genome was approximately 73.48%. The complete genome assembly of L. speciosa had a size of 306.758 Mb, including 367 scaffolds and 1,539 contigs. The N50 length was 13.03 Mb, while the maximum length was 18.28 Mb (Table 1). After Hi-C-assisted assembly, the total length of sequences on chromosomes represented 98.75% of the genome, which were assembled into 24 chromosomes ranging from 9.3 Mb to 18.28 Mb (Fig. 1D & Supplementary Table 1). In terms of embryophyta_odb10 reference gene concentration, BUSCO analysis revealed a completeness of 92.8%, with 83.2% of single-copy BUSCOs, 9.6% of multicopy BUSCOs, and 1.9% of fragmented BUSCOs (Table 2).

Table 1 Summary of the genome assembly.
Table 2 Genome BUSCO evaluation result.

Genome annotation

Genome annotation primarily encompasses repeat sequence annotation, gene structure annotation, gene function annotation and ncRNA annotation. Repeat sequence annotation combines both homology-based annotation and de novo annotation methods. Homology-based annotation is performed using software such as RepeatProteinMask/RepeatMasker (v4.0.9)16, based on the RepBase17,18 database (http://www.girinst.org/repbase). Additionally, genome self-sequence alignment is used with software like RepeatModeler (v1.0.11)19, Piler20 and RepeatScount, while software TRF21 and LTR-Finder22 are employed based on the intrinsic characteristics of repetitive sequences. The results showed that approximately 118.10 Mb of the repeat sequence in L. speciosa accounted for 38.58% of the whole genome. The repetitive sequence that occupies the highest proportion is the long terminal repeats (LTRs), with a length exceeding 65.62 Mb, accounting for 21.43% of the total. Furthermore, the repetitive sequences also include 21.16 Mb (6.91%) of DNA transposons, 13.66 Mb (4.46%) of long interspersed elements (LINE) and 0.016 Mb (0.053%) of short interspersed nuclear elements (SINEs) (Table 3).

Table 3 Classification of repetitive sequence.

Gene structure annotation was combined with various methods. Firstly, Augustus (v3.3.4)23, Genscan24 and GlimmerHMM (v3.0.4)25 software were used for de novo prediction. Then, homology-based annotation was performed using closely related sequenced plants, including Punica granatum, Eucalyptus grandis, Arabidopsis thaliana, Brassica napus, Camelina sativa, Cucumis melo, Eutrema salsugineum, Gossypium ramondii, Raphanus sativus and Theobroma cacao. RNA-seq data were compared with StringTie (v2.1.6)26,27 and HISAT2 (v2.2.0)28 to complement and refine the predicted gene set. Finally, all these annotation results were integrated and screened by EVM (v1.1.1)29, resulting in 31,378 genes with an average CDS length of 1.4 kb (Supplementary Table 2). Using BUSCO to evaluate the completeness of gene structure annotation, 92.1% of single-copy genes were fully annotated.

Using BLASTp with an E-value cutoff of 1E-5, the proteins in the gene set were functionally annotated using databases such as SwissProt, TrEMBL30, KEGG31, InterPro32 and GO33. A total of 93.54% of the genes in the genome of L. speciosa were successfully predicted, and 93.31% of the genes were annotated in the TrEMBL annotation library (Table 4).

Table 4 Genome function annotation result.

Based on the structural characteristics of tRNA, the tRNAscan-SE (v1.4)34 software is utilized to identify tRNA sequences within the genome. rRNA is highly conserved, and rRNA sequences of closely related species can be selected as reference sequences to search for rRNA in the genome by performing BLASTN (v2.2.26) with an E-value of 1E-5. Additionally, covariance models from the Rfam35 are employed to predict miRNA and snRNA sequence information in the genome using the Infernal (v1.0)36 software, a dedicated tool for predicting non-coding RNA (Table 5).

Table 5 Genome ncRNA annotation result.

Data Records

The raw sequencing data and genome assembly of L. speciosa have been submitted to the National Center for Biotechnology Information (NCBI). The PacBio sequencing data were deposited in the Sequence Read Archive at NCBI under accession number SRP49438137. The illuminate raw data were accessible via accession numbers SRP49493638. Hi-C sequencing data were deposited in the Sequence Read Archive at NCBI under accession number SRP49431339. The second-generation RNA-seq data were deposited in the Sequence Read Archive under accession number SRP17640040. The full-length transcriptome data were deposited in the NCBI database under accession number SRP52850141. The genome assembly have been deposited at GenBank under accession number GCA_037672795.142. The dataset of gene annotation, CDS sequences and protein sequences have been deposited at Figshare (https://doi.org/10.6084/m9.figshare.26861248)43.

Technical Validation

To verify the integrity and accuracy of the assembled chromosomal level genome, we completed a BUSCO analysis using the embryophyta_odb10 dataset to assess the integrity of the assembly. In L. speciosa, a total of 92.8% of BUSCO was found intact, indicating a relatively complete and high-quality genome (Table 2). From the heatmap of chromosome interaction, it can be seen that the interaction intensity within the same chromosome is obviously stronger than that between chromosomes, and the chromosome boundary is more obvious, indicating that the pairing effect is better and the auxiliary assembly effect was ideal.