Background & Summary

Hoya R. Br. is the largest genus in the tribe Marsdenieae (Asclepioideae, Apocynaceae), comprising 350–450 species, most of which are epiphytic or hemi-epiphytic vines or subshrubs distributed in the tropical and subtropical areas of the Asia-Pacific region1,2,3,4,5. As in many Asclepiadoideae genera, Hoya flowers exhibit highly specialized features, particularly the fusion of the androecium and gynoecium, leads the formation of novel structures, such as the gynostegium, corona and pollinarium6,7,8,9,10. The coronas of Hoya, often star-shaped, originate from the stamens and display significant variation in morphology and color. Owing to their unique floral traits and distinctive scents, Hoya species are increasingly popular as ornamental plants11. However, the genetic regulation of floral organogenesis remains poorly understood, and the market is dominated by domesticated and mutation-bred varieties, with limited crossbreeding due to the lack of high-quality genomic data.

To address these challenges, we selected the type species H. carnosa for reference genome assembly, a crucial step toward understanding the regulation of floral morphological traits and advancing molecular breeding in the genus. H. carnosa is widely distributed across central and southern China, as well as southern Japan, Malaysia and Vietnam, where it typically grows along humid and shaded forest edges12,13, yet maintains a conserved diploid karyotype (2n = 22) despite its broad range and diverse habitats14. The species exhibits extended flowering period and remarkable adaptability in various growing conditions, that establishes it a favored option in horticulture (Fig. 1) and has led to the development of numerous cultivars.

Fig. 1
figure 1

H. carnosa in garden and greenhouse.

In this study, we present a high-quality H. carnosa genome assembly at near-complete level, featuring ten telomeres and only five gaps. The assembled genome size was 465.7 Mb, with a contig N50 of 39.3 Mb. Of the assembled sequences, 464.3 Mb (99.7%) were anchored to 11 pseudochromosomes, achieving high completeness and accuracy. The genome predicted 24,309 protein-coding genes, of which 21,927 (90.2%) were functionally annotated. This genome represents a valuable resource for further studies on the adaptive traits of Hoya plants and molecular breeding within the genus.

Methods

Sampling and sequencing

The H. carnosa plants were originally collected from a natural population in Yangshan, Guangdong Province, China, in 2017 (accession number 20170662), and cultivated in the South China Botanical Garden (SCBG) of the Chinese Academy of Sciences (CAS).

The juvenile leaves were collected, gently rinsed with distilled water and dry on autoclaved filter paper. To minimize RNA/protein degradation, the samples were immediately frozen in liquid nitrogen, and then preserved at −80°C until genomic DNA isolation. CTAB method was employed in extracting genomic DNA15. For short-read sequencing, libraries with 350 bp fragments were prepared and subjected to Illumina NovaSeq6000 sequencing platform (PE 150), yielding approximately 66.8 Gb clean data that is equivalent to 147 × coverage. Long-read sequencing was conducted using PacBio Sequel II system with Circular Consensus Sequencing (CCS) mode, generating 45.4 Gb of high-fidelity (HiFi) reads at 100 × genome coverage. Chromatin conformation capture (Hi-C) library was constructed through DpnII restriction enzyme digestion and sequenced using Illumina NovaSeq6000 platform (PE 150), generating 69.2 Gb clean data (152 × genome coverage) after quality filtering.

Transcriptomic profiling was performed on five tissues, including root, stem, leaf, flower bud and mature flower, that were collected from the same cutting propagated plants. Stranded mRNA-seq libraries were constructed from individual samples and sequenced on the Illumina NovaSeq6000, obtaining approximately 6 Gb clean reads for each sample.

Genome size estimation and de novo assembly

The basic genomic features, including genome size, heterozygosity, and repeat content were resolved by GCE (https://github.com/fanagislab/GCE) according to PacBio HiFi reads, setting k = 17. The genome size was directly provided by the software. The heterozygosity rate (H) is calculated using the formula: H = a1/2 / 17 / (2 - a1/2), where a1/2 indicates the proportion of heterozygous k-mer types, as derived from the software output. The repeat content is determined by calculating the proportion of k-mer with depth surpassing twice the homozygous peak coverage. Finally, the genome size was estimated to be 454.3 Mb, with a heterozygosity of 0.90% and a repetition content of 60.0% (Fig. 2A).

Fig. 2
figure 2

Genome of H. carnosa. (A) k -mer frequency distribution curve. (B) Heatmap showing Hi-C interaction frequencies within and across chromosomes. The color gradient scale shows interaction strength, where red indicates strong interactions and yellow denotes weaker interactions. (C) Genomic features of H. carnosa. Concentric circles, from the outermost to the innermost, represent the 11 assembled pseudochromosomes (grey), GC content (red), transposable element density (blue), gene density (green), density of duplicates resulted from ancient polyploidization (purple), density of duplicates resulted from tandem duplication (cyan), density of transcription factors (yellow), and the H. carnosa flower. Red triangles and red bars at the outermost circus indicate gaps and telomeres, respectively.

The initial primary contigs (i.e., merged genome sequence) were generated using hifiasm v0.16.1-r37516, with the integration of PacBio HiFi reads and Hi-C data. To reduce redundancy and enhance the clarity of assembly, Purge Haplotigs v1.1.217 was applied with parameter ‘a’ = 85 (cutoff for identifying a contig as a haplotig). To achieve pseudochromosome-level assembly, Hi-C reads were aligned to the contig assembly using Juicer v1.618. The resulting data (“merged_nodups.txt”) were then processed with the 3D-DNA pipeline19 (parameter r = 0, which minimizes excessive fragmentation) to correct misassemblies and to anchor, order, and orient the contigs into pseudochromosomes. Final manual adjustments were performed using Juicebox v1.11.0820 and Hi-C interaction map was visualized via plotHic (https://github.com/Jwindler/PlotHiC). These analyses resulted in an assembly of 465.7 Mb with a contig N50 of 39.3 Mb. Of this assembly, approximately 464.3 Mb of contigs were anchored to 11 pseudochromosomes, accounting for 99.7% of the total assembled genome size (Fig. 2B, C, Table 1). The remaining unanchored sequences comprised 32 scaffolds, representing 0.3% (1.4 Mb) of the genome.

Table 1 Statistics of the H. carnosa genome assembly and annotation.

Terminal telomeric repeats of all pseudochromosomes were systematically detected using QuarTeT21 pipeline with default parameters, revealing a total of ten telomeric regions for eight pseudochromosomes. Notably, both telomeric repeats and no gap were found on pseudochromosome 11 (Fig. 2C).

To assess the completeness of the genome assembly, Benchmarking Universal Single-Copy Orthologues (BUSCO) v5.4.722 was performed under genome mode using the embryophyta_odb10 dataset, revealing a high completeness score of 98.5% (Table 1). To evaluate the continuity of the assembly, the Long Terminal Repeat Retrotransposons Assembly Index (LAI), a reference-free metric for assessing the assembly quality of repeat sequences, was calculated by LTR_retriever23, yielding an LAI value of 27.0, which surpasses the gold-standard threshold (LAI > 20). Furthermore, the assembly’s accuracy was assessed by calculating the Quality Value (QV) with Merqury v1.324 based on HiFi reads, resulting in a high QV of 74.1. In addition, Illumina short reads were mapped to the assembly using BWA v0.7.15-r114025, achieving a high mapping rate of 97.9% and coverage of 99.99% (Table 1). These results collectively confirm the H. carnosa genome assembly had reached high completeness and reliability.

Gene prediction and genome annotation

Repeat sequences in the H. carnosa genome were annotated using EDTA v2.0.026 with the parameters: “--sensitive 1--anno 1--evaluate 1--cds Hcar.cds”, augmented by a species-specific coding sequence (CDS) library (Hcar.cds) derived from RNA-Seq data of diverse tissues using Trinity v2.15.127. Based on the comprehensive, non-redundant TE library for H. carnosa, both soft-masked and hard-masked genomes were generated for gene structure annotation analysis using RepeatMasker v4.1.428. Overall, 258.4 Mb (55.5%) of the assembled genome was identified as repeat sequences. Copia and Gypsy constituted the main LTR elements, accounting for 20.3% and 12.6% of the genome size, respectively (Table 2).

Table 2 Statistics of repeat sequences in H. carnosa genome.

RNA sequencing data derived from a diverse set of tissues (root, stem, leaf, flower bud and mature flower) were mapped to the assembled reference genome using HISAT2 v2.2.129. The alignments were processed with SAMtools for sorting and cleaning. Then we performed a species-specific gene prediction pipeline using BRAKER2 v2.1.630, and incorporating GeneMark-ET31 and AUGUSTUS v3.5.032. Protein sequences from multiple plant species including Arabidopsis thaliana33, Coffea canephora34, Calotropis gigantea35, Catharanthus roseus36, Marsdenia tenacissima37, Voacanga thouarsii38, Vitis vinifera39 were downloaded and merged, the redundant sequences were removed using CD-HIT v4.6.840. The cleaned homologous protein sequences, along with the RNA-seq alignments, served as input for the initial round of MAKER241 to train gene models with SNAP42. The resulting gene models, combined with those trained by BRAKER2, were employed in the second round of MAKER2 to generate new training models. Gene models with an Annotation Edit Distance (AED) score greater than 0.5 were removed to ensure high confidence. The final gene set was further polished using PASA43, iteratively improving model accuracy through two rounds of updates. Finally, a total of 24,309 protein-coding genes were predicted for H. carnosa, with average gene, intron, and exon lengths of 5,447.9 bp, 707.0 bp, and 326.0 bp, respectively (Table 1). BUSCO v5.4.722 (embryophyta_odb10) protein mode analysis revealed 87.1% completeness (Table 1).

For functional annotation of protein-coding genes, a BLASTP search with stringent criteria was conducted against publicly available protein databases, including SwissProt and the NCBI non-redundant protein database (Nr). KEGG pathways were identified using the online annotation tool KAAS (KEGG Automatic Annotation Server; http://www.genome.jp/tools/kaas/), while Gene Ontology (GO) terms were assigned via eggNOG-mapper v2.1.544. This comprehensive annotation framework annotated 21,927 protein-coding genes, representing 90.2% of the total predicted genes (Table 1). For non-coding RNA (ncRNA) identification, INFERNAL45 was employed to search against the Rfam database46. This analysis revealed 1,936 ncRNAs, with 537 transfer RNAs (tRNAs), 913 ribosomal RNAs (rRNAs), 89 microRNAs (miRNAs), 183 small nucleolar RNAs (snoRNAs) and 214 other ncRNAs (Table 1).

Data Records

The raw data, including Illumina short reads, HiFi reads, Hi-C reads and RNA-seq reads have been deposited to the Genome Sequence Archive (GSA)47 in the National Genomics Data Center (NGDC)48 with accession number CRA02055849 under BioProject accession number PRJCA032224. The genome assembly and annotation files are available at Figshare database50. The genome assembly has also been submitted to the European Nucleotide Archive (ENA) with the accession number GCA_96560141551.

Technical Validation

The quality of the genome assembly was evaluated using several key metrics: (1) Ten telomeres were identified at the chromosome ends, with only five gaps remaining. (2) Genome completeness, assessed by BUSCO v5.4.34, showed 98.5% of BUSCO genes were complete, with 95.9% as single-copy, 2.6% as duplicated, and 0.6% as fragmented. (3) The LTR Assembly Index (LAI) of the genome assembly was 27.0, exceeding the threshold for gold-standard genomes. (4) Quality value (QV) calculated using Merqury v1.3, was 74.1. (5) Short-read mapping to the assembly revealed a high mapping rate of 97.9% and genome coverage of 99.99%. These results demonstrated the exceptional quality of the H. carnosa genome assembly.