Background & Summary

Hibiscus is a genus of flowering plants belonging to the Malvaceae family, which encompasses over 300 species1. It is widely distributed in warm-temperate, tropical, and subtropical regions across the world and has relatively high economic value. These species are well-known for their large, colorful, and visually striking flowers, making them highly valued as ornamental plants. Certain species, such as H. syriacus (also known as Rose of Sharon) and H. rosa-sinensis (China rose), are extensively cultivated in gardens and landscapes for their aesthetic appeal2.

Besides its ornamental value, Hibiscus has long been utilized for diverse purposes. Studies have shown its significant medicinal properties, including antioxidant3,4, anti-inflammatory5,6, antitumor7, antihypertensive8, and antimicrobial activities9,10. The genus also has culinary applications, with flowers and other plant parts used in beverages such as hibiscus tea, which is globally popular for its refreshing flavor and health benefits, as well as in food products11. In addition, species such as H. tiliaceus provide fibers for textile production, including rope-making and construction materials12. Similarly, H. cannabinus is used in paper-making and supplies materials for crafting and construction13.

Hibiscus yunnanensis S.Y. Hu is a perennial shrub that primarily grows in subtropical regions on sunny, dry, and hot mountain slopes at elevations of 400–600 m. Since Hu (1955) established the species with Herry A. [13218] as the type specimen14, it has been found that H. yunnanensis is only distributed in Yuanjiang Hani, Yi and Dai Autonomous County, Yuxi City, Yunnan Province. It has been threatened and classified as endangered (EN)15 due to its restricted distribution, limited number of mature individuals, and declining habitat quality. It exhibits unique morphological characteristics, such as a yellow, bell-shaped corolla with a central purplish-red coloration (Fig. 1a), distinguishing it from the predominately white, pink, and red flowers of most other Hibiscus species. This distinct floral morphology imparts significant ornamental value, making H. yunnanensis a promising candidate for horticultural applications, including seed propagation and cultivation. Furthermore, it grows in dry and hot valleys, showing remarkable adaptation to high temperatures and limited water availability, with ecological characteristics resembling those of savannas, such as open landscapes, sparse vegetation, and seasonal climatic extremes16. Hibiscus yunnanensis has a long flowering period and can flower off-season in winter. Currently, no studies have specifically investigated the style curvature mechanism in H. yunnanensis. However, given its characteristic automatic selfing without inbreeding depression, we hypothesize that the species may utilize a unique style curvature mechanism to ensure reproductive success and prevent inbreeding depression. Therefore, a high-quality reference genome is important for promoting the comprehensive study of H. yunnanensis. Such genomic resources will facilitate the integration of genomic data with ecology, thereby enhancing our understanding and exploitation of this species.

Fig. 1
Fig. 1
Full size image

Morphology and genomic features of H. yunnanensis. (a) Floral morphology from a frontal view, showing the characteristic yellow, bell-shaped corolla with a central purplish-red coloration. (b) Lateral view showing the calyx morphology along with trichomes, which can emit fetid smell to deter potential herbivores. (c) Circos plot of H. yunnanensis genome. The tracks from outside to inside display the chromosomes, gene number, GC content, Repeat density, LTR density, LTR/Copia density, LTR/Gypsy density, collinearity block of self-vs-self. (d) Hi-C interactive heat map.

Here, we present the genome of H. yunnanensis using ONT reads (263.7 Gb, 119.9×), NGS reads (315.4 Gb, 141.4×), Hi-C reads (210.8 Gb, 94.5×), and RNA-seq 134.3 Gb, 94.5×). The assembled contig size is close to the estimated genome size of 2.2 Gb based on k-mer estimates, with a scaffold N50 length of 137.1 Mb. A total of 99.2% of the assembled sequences are anchored to 17 pseudo-chromosomes. The genome contains 42,085 protein-coding genes, and 96.4% of them are annotated. The high-level genome assembly and annotation of H. yunnanensis will provide insights into the ecology within the genus Hibiscus, laying the foundation for ecological and molecular genetics studies.

Methods

Sampling

Hibiscus yunnanensis individuals were collected from Yuanjiang Hani, Yi, and Dai Autonomous County in Yuxi City, Yunnan Province. These plants were then self-pollinated to produce second-generation individuals in the greenhouse of Kunming Botanical Garden (Fig. 1a). Fresh young leaves from these second-generation plants were collected and stored for DNA extraction. Additionally, we collected young leaves, mature leaves, stems, fruits, budding flowers, full-blooming flowers, and roots from the same plant for transcriptome sequencing. Three biological replicates for each sample were immediately frozen in liquid nitrogen and subsequently stored at –80 °C.

Library construction and sequencing

We employed a modified CTAB (cetyltrimethylammonium bromide) method17 to extract high-quality genomic DNA from young H. yunnanensis leaves. The concentration of the extracted DNA was measured using both a Nanodrop (Nanodrop Technologies, Wilmington, DE, USA) and a Qubit 3.0 Fluorometer (Life Technologies, Carlsbad, CA, USA). To ensure the purity and integrity of the DNA, a 1% agarose gel electrophoresis was performed.

Genomic DNA fragments ranging from 200 to 400 base pairs (bp) were used for second-generation short-read library preparation. A total of 1 μg of genomic DNA was used following the manufacturer’s protocol provided by BGI. Short-read libraries were then subjected to paired-end (PE) sequencing using the BGI-DNBSEQ platform18 (BGI Inc., Shenzhen, China) with a PE 150 model, producing 315.4 Gb of raw data at approximately 141.4 × coverage (Table 1; Supplementary Table S2).

Table 1 Summary of H. yunnanensis genome assembly.

For ONT library preparation and sequencing19, the Nanopore DNA library was prepared using the SQK-LSK108 Kit from Oxford Nanopore Technologies (Oxford, UK). The sequencing of this library was performed on a Nanopore GridIONX5 sequencer using five flow cells. Base calling was carried out using Guppy v4.0.11 within the MinKNOW package, generating 151.2 Gb of data with roughly 67.8 × coverage for assembly (Table 1; Supplementary Table S3).

We used the TIANGEN kit with DNase I to extract total RNA, which was then prepared into a paired-end library with a 250 bp insert size using the NEBNextUltraTM RNA Library Prep Kit (Supplementary Table S5). These libraries were then sequenced on the BGI-DIPSEQ platform. Low-quality data was filtered out using Trimmomatic v0.3920 with the parameters ILLUMINACLIP:adapter.fa:2;30:10 LEADING:5TRAILING:5, generating 134.3 Gb of 100 bp paired-end data.

Hi-C library construction and sequencing

Hi-C library construction was carried out using the DpnII restriction enzyme and a method from the BGI QingDao Institute21. The chromatin was digested with DpnII and labeled at the ends with biotin-14-dATP (Thermo Fisher Scientific, Waltham, MA, USA). The DNA was then extracted, purified, and sheared using a Covaris S2 (Covaris, Woburn, MA, USA). Thus, prepared Hi-C libraries were sequenced on the BGI-DIPSEQ platform, producing approximately 210.8 Gb (94.5×) of data with 150 bp paired-end reads (Table 1; Supplementary Table S4).

Genome size estimation

In order to estimate the genome size of the H. yunnanensis genome, k-mer spectral analysis of 60 × BGI-DIPSEQ short reads was utilized. The k-mer frequencies with a size of 17 were used to estimate the genome size from the short BGI-DIPSEQ reads. The 17-mer frequency distribution analysis, performed with GenomesScope222, estimated the H. yunnanensis genome to be 2.2 Gb. We applied strict quality control using SOAPfilter v2.223 to reduce sequencing errors. For assessing genome heterozygosity, variant calling on whole-genome short-read data was performed using the Genome Analysis Toolkit (GATK) v4.2.3.024, resulting in a heterozygosity value of 0.001%.

Genome assembly and optimization

For genome assembly, four different software and parameters were tested to optimize assembly quality. NextDenovo v2.225 was utilized with parameters read_cutoff = 1k, seed_cutoff = 32937, along with other default settings, processing both 100 × and 65 × nanopore data. Canu v2.026 was used with 65 × nanopore data, employing parameters-d./result merylThreads = 40 genomeSize = 2.32 g minReadLength = 1000 minOverlapLength = 500 corOutCoverage = 120 corMinCoverage = 2 and other defaults. Similarly, we used Wtdbg2 v2.527 to assemble the genome using parameters -x ont -g 2.4 g, utilizing 65 × nanopore data. Also, Flye v2.9.128 was used for assembly in three different configurations: using 65 × ONT data with a parameter of min-overlap 5000; using 65 × ONT data with a parameter of min-overlap 10,000; and using 100 × ONT data with a parameter of min-overlap 10,000. By comparing the results of the different software and parameters, including coverage, BUSCO, and N50, the Flye assembly with a contig N50 value of 11.6 Mb was considered the primary contig genomes for subsequent analysis (Supplementary Table S6).

Subsequently, NextPolish v1.3.029 was utilized to refine the initial draft of assembled contigs (Flye ) through six polishing rounds—two rounds with ONT long reads and four rounds with short reads. Following this, purge_dups v1.2.330 was employed to curate the contigs, generating scaffold N50 length of 12,097,940 bp (Table 1; Supplementary Table S7). This selection process considered the mapped read coverage obtained from short read data and alignments using Minimap231.

Hi-C paired-end reads were processed with Trimmomatic v0.3920 to remove low-quality bases and adapter sequences. To compute the contact frequency, all filtered reads were aligned to the contig assembly using Juicer v332. The 3D-DNA pipeline v18092233 was then executed with two iterative rounds to correct misjoining (-r2), while other with default parameters. Manual inspection and refinement of the draft assembly were performed using Juicebox v1.11.0834 (Fig. 1c).

Assessment of the assembled genome

We used BUSCO v3.0.235 to assess the quality and completeness of the assembled H. yunnanensis genome by comparing it against the embryophyta_odb10 dataset, capturing 99.6% of the 1,614 core eukaryote genes (Supplementary Table S8). Additionally, RNA reads were mapped to the draft assembly using Hisat2 v2.1.036, achieving a mapping rate of >95.1%. Whole genome sequence short reads were also mapped with BWA37, resulting in a 99.8% mapping rate and 99.1% coverage, which collectively indicates a high-quality genome assembly.

Repetitive elements identification

Both de novo and homology-based approaches were used to identify repeat sequences in the genome. The de novo approach involved constructing a novel repeat library using LTR_retriever v2.838, LTR_FINDER v1.0.739, and RepeatModeler240. Following this, we used RepeatMasker v4.0.841 to annotate the repeat elements. Tandem repeats were specifically identified using Tandem Repeats Finder v4.0742. For the homology-based approach, repeat elements were predicted through comparative analysis using RepeatMasker v4.0.8 and RepeatProteinMask v4-0-741. Repetitive elements in H. yunnanensis showed a moderate proportion of repetitive elements, comprising 73.2% of genome assembly, with LTR/Gypsy elements contributing 42.5% (Fig. 1b, Table 2).

Table 2 Summary of genome annotation of H. yunnanensis.

Protein coding genes prediction

Protein-coding gene sets were predicted using de novo gene prediction, homology-based annotation, and transcriptome-based prediction methods. For the de novo approach, gene prediction was executed on a repeat-masked genome using Augustus v3.0.343, GlimmerHMM v3.0.244, and SNAP v11/29/201345. For homology-based gene prediction, amino acid sequences from Durio zibethinus, Gossypium raimondii, Theobroma cacao, and three related species (H. cannabinus, H. syriacus, and H. trionum) were compared using GeMoMa v1.3.146 and the UniProt database (release 2021_04). TBLASTN v2.2.18 (e-value cutoff: 1e-5)47 was employed to identify putative homologous genes by aligning protein sequences across the entire genome. Subsequently, GeneWise v2.2.048 was used to refine the alignment regions, providing accurate exon and intron information. In the RNA-seq-based gene prediction approach, clean RNA-seq reads were aligned to the assembled genomes using Hisat2 v2.0.436. Gene prediction was performed by identifying cDNAs through a genome-guided method with StringTie v1.2.249, followed by mapping these cDNAs back to the genome using PASA v2.3.350. The assembled cDNA sequences from Trinity v2.6.651 were then aligned to the H. yunnanensis genome sequences using BLAT v34 × 1252. Similarly, a non-redundant gene set was generated using maker v353 pipeline, resulting in the identification of 42,085 protein-coding genes. The distributions of mRNA length, CDS length, intron length, and exon number in H. yunnanensis align closely with those observed in other species genomes (Fig. 2), which supports the assembly of a high-quality genome for H. yunnanensis.

Fig. 2
Fig. 2
Full size image

Comparison of the distribution of gene elements for each gene among seven representative species. (a) mRNA length. (b) CDS length. (c) Exon length. (d) Intron length. The x-axis represents the length (bp) and the y-axis represents the density of genes or exons or introns. The species compared are H. yunnanensis (HYUN), D. zibethinus (Dzib), G. raimondii (Grai), H. cannabinus (Hcan), H. syriacus (Hsyr), H. trionum (Htri), and T. cacao (Tcac).

Functional annotation

To functionally annotate the protein-coding genes, we performed sequence similarity and domain conservation analysis. BLASTP was used for the initial homolog search against public protein databases, with an e-value cutoff of 1e-5 and criteria including top hit 5, amino acid identity >0.3, and match length >0.5. The databases included SwissProt (release-2020_05)54, KEGG (59.3)55, TrEMBL (release-2020_05)54, and the NCBI non-redundant protein NR database (20201015). InterProScan v5.28-67.056 was then used to detect and classify domains and motifs, providing comprehensive functional annotations. The annotation rate for H. yunnanensis was found to be 96.4% (Table 3).

Table 3 Statistics of gene functional annotations of H. yunnanensis.

Data Records

The Nanopore, Hi-C, BGI-DIPSEQ, and RNA sequencing data used for genome assembly and annotation have been deposited in the Genome Sequence Archive (GSA) of the National Genomics Data Center (NGDC) under accession number CRA02220957. Additionally, all raw genomic sequencing data are available in the CNGB Nucleotide Sequence Archive (CNSA) under accession CNP000398558. The final contigs and chromosome assembly have been submitted to NCBI with the accession number of GCA_048544135.159. Annotation files, including predicted CDS and protein sequences as well as GFF files can be accessed on Figshare60. All other data produced or analyzed in this study are included within the article.

Technical Validation

We conducted an assessment of genome completeness using BUSCO v3.0.235, employing the embryophyta_odb10 database. Of the 1,614 core embryophyta genes, H. yunnanensis exhibited an identification rate of 99.6% (Table 1). To further validate the assembly’s completeness, we performed short-read mapping using clean raw data, where 99.8% of reads were properly paired with H. yunnanensis. We used Bridger tool61 to assemble the transcriptome sequences, followed by mapping to scaffold assemblies using BLAT52, yielding a pairing rate of 95.1%. Subsequently, BUSCO analysis was repeated after the Hi-C assembly, confirming similar results to those obtained from the ONT genome assemblies.