Abstract
Hoya R. Br. is the largest genus in the tribe Marsdenieae (Apocynaceae), comprising 350–450 species. Hoya species are popular in horticulture for their distinctive floral traits and fragrances, primarily sourced from domestication and mutation breeding. However, the lack of molecular analysis for floral morphological traits has limited their cultivation and application. In this study, we assembled a near-complete reference genome for H. carnosa, the model species of the genus, using PacBio HiFi reads and Hi-C method. The genome size was approximately 465.7 Mb with a contig N50 of 39.3 Mb. 99.7% of the sequences were anchored to 11 pseudochromosomes, and the assembly achieved a BUSCO score of 98.5%. We predicted 24,309 protein-coding genes, of which 90.2% (21,927) were functionally annotated. This high-quality genome provides a valuable reference for the research of evolution, conservation and molecular breeding in Hoya.
Similar content being viewed by others
Background & Summary
Hoya R. Br. is the largest genus in the tribe Marsdenieae (Asclepioideae, Apocynaceae), comprising 350–450 species, most of which are epiphytic or hemi-epiphytic vines or subshrubs distributed in the tropical and subtropical areas of the Asia-Pacific region1,2,3,4,5. As in many Asclepiadoideae genera, Hoya flowers exhibit highly specialized features, particularly the fusion of the androecium and gynoecium, leads the formation of novel structures, such as the gynostegium, corona and pollinarium6,7,8,9,10. The coronas of Hoya, often star-shaped, originate from the stamens and display significant variation in morphology and color. Owing to their unique floral traits and distinctive scents, Hoya species are increasingly popular as ornamental plants11. However, the genetic regulation of floral organogenesis remains poorly understood, and the market is dominated by domesticated and mutation-bred varieties, with limited crossbreeding due to the lack of high-quality genomic data.
To address these challenges, we selected the type species H. carnosa for reference genome assembly, a crucial step toward understanding the regulation of floral morphological traits and advancing molecular breeding in the genus. H. carnosa is widely distributed across central and southern China, as well as southern Japan, Malaysia and Vietnam, where it typically grows along humid and shaded forest edges12,13, yet maintains a conserved diploid karyotype (2n = 22) despite its broad range and diverse habitats14. The species exhibits extended flowering period and remarkable adaptability in various growing conditions, that establishes it a favored option in horticulture (Fig. 1) and has led to the development of numerous cultivars.
In this study, we present a high-quality H. carnosa genome assembly at near-complete level, featuring ten telomeres and only five gaps. The assembled genome size was 465.7 Mb, with a contig N50 of 39.3 Mb. Of the assembled sequences, 464.3 Mb (99.7%) were anchored to 11 pseudochromosomes, achieving high completeness and accuracy. The genome predicted 24,309 protein-coding genes, of which 21,927 (90.2%) were functionally annotated. This genome represents a valuable resource for further studies on the adaptive traits of Hoya plants and molecular breeding within the genus.
Methods
Sampling and sequencing
The H. carnosa plants were originally collected from a natural population in Yangshan, Guangdong Province, China, in 2017 (accession number 20170662), and cultivated in the South China Botanical Garden (SCBG) of the Chinese Academy of Sciences (CAS).
The juvenile leaves were collected, gently rinsed with distilled water and dry on autoclaved filter paper. To minimize RNA/protein degradation, the samples were immediately frozen in liquid nitrogen, and then preserved at −80°C until genomic DNA isolation. CTAB method was employed in extracting genomic DNA15. For short-read sequencing, libraries with 350 bp fragments were prepared and subjected to Illumina NovaSeq6000 sequencing platform (PE 150), yielding approximately 66.8 Gb clean data that is equivalent to 147 × coverage. Long-read sequencing was conducted using PacBio Sequel II system with Circular Consensus Sequencing (CCS) mode, generating 45.4 Gb of high-fidelity (HiFi) reads at 100 × genome coverage. Chromatin conformation capture (Hi-C) library was constructed through DpnII restriction enzyme digestion and sequenced using Illumina NovaSeq6000 platform (PE 150), generating 69.2 Gb clean data (152 × genome coverage) after quality filtering.
Transcriptomic profiling was performed on five tissues, including root, stem, leaf, flower bud and mature flower, that were collected from the same cutting propagated plants. Stranded mRNA-seq libraries were constructed from individual samples and sequenced on the Illumina NovaSeq6000, obtaining approximately 6 Gb clean reads for each sample.
Genome size estimation and de novo assembly
The basic genomic features, including genome size, heterozygosity, and repeat content were resolved by GCE (https://github.com/fanagislab/GCE) according to PacBio HiFi reads, setting k = 17. The genome size was directly provided by the software. The heterozygosity rate (H) is calculated using the formula: H = a1/2 / 17 / (2 - a1/2), where a1/2 indicates the proportion of heterozygous k-mer types, as derived from the software output. The repeat content is determined by calculating the proportion of k-mer with depth surpassing twice the homozygous peak coverage. Finally, the genome size was estimated to be 454.3 Mb, with a heterozygosity of 0.90% and a repetition content of 60.0% (Fig. 2A).
Genome of H. carnosa. (A) k -mer frequency distribution curve. (B) Heatmap showing Hi-C interaction frequencies within and across chromosomes. The color gradient scale shows interaction strength, where red indicates strong interactions and yellow denotes weaker interactions. (C) Genomic features of H. carnosa. Concentric circles, from the outermost to the innermost, represent the 11 assembled pseudochromosomes (grey), GC content (red), transposable element density (blue), gene density (green), density of duplicates resulted from ancient polyploidization (purple), density of duplicates resulted from tandem duplication (cyan), density of transcription factors (yellow), and the H. carnosa flower. Red triangles and red bars at the outermost circus indicate gaps and telomeres, respectively.
The initial primary contigs (i.e., merged genome sequence) were generated using hifiasm v0.16.1-r37516, with the integration of PacBio HiFi reads and Hi-C data. To reduce redundancy and enhance the clarity of assembly, Purge Haplotigs v1.1.217 was applied with parameter ‘a’ = 85 (cutoff for identifying a contig as a haplotig). To achieve pseudochromosome-level assembly, Hi-C reads were aligned to the contig assembly using Juicer v1.618. The resulting data (“merged_nodups.txt”) were then processed with the 3D-DNA pipeline19 (parameter r = 0, which minimizes excessive fragmentation) to correct misassemblies and to anchor, order, and orient the contigs into pseudochromosomes. Final manual adjustments were performed using Juicebox v1.11.0820 and Hi-C interaction map was visualized via plotHic (https://github.com/Jwindler/PlotHiC). These analyses resulted in an assembly of 465.7 Mb with a contig N50 of 39.3 Mb. Of this assembly, approximately 464.3 Mb of contigs were anchored to 11 pseudochromosomes, accounting for 99.7% of the total assembled genome size (Fig. 2B, C, Table 1). The remaining unanchored sequences comprised 32 scaffolds, representing 0.3% (1.4 Mb) of the genome.
Terminal telomeric repeats of all pseudochromosomes were systematically detected using QuarTeT21 pipeline with default parameters, revealing a total of ten telomeric regions for eight pseudochromosomes. Notably, both telomeric repeats and no gap were found on pseudochromosome 11 (Fig. 2C).
To assess the completeness of the genome assembly, Benchmarking Universal Single-Copy Orthologues (BUSCO) v5.4.722 was performed under genome mode using the embryophyta_odb10 dataset, revealing a high completeness score of 98.5% (Table 1). To evaluate the continuity of the assembly, the Long Terminal Repeat Retrotransposons Assembly Index (LAI), a reference-free metric for assessing the assembly quality of repeat sequences, was calculated by LTR_retriever23, yielding an LAI value of 27.0, which surpasses the gold-standard threshold (LAI > 20). Furthermore, the assembly’s accuracy was assessed by calculating the Quality Value (QV) with Merqury v1.324 based on HiFi reads, resulting in a high QV of 74.1. In addition, Illumina short reads were mapped to the assembly using BWA v0.7.15-r114025, achieving a high mapping rate of 97.9% and coverage of 99.99% (Table 1). These results collectively confirm the H. carnosa genome assembly had reached high completeness and reliability.
Gene prediction and genome annotation
Repeat sequences in the H. carnosa genome were annotated using EDTA v2.0.026 with the parameters: “--sensitive 1--anno 1--evaluate 1--cds Hcar.cds”, augmented by a species-specific coding sequence (CDS) library (Hcar.cds) derived from RNA-Seq data of diverse tissues using Trinity v2.15.127. Based on the comprehensive, non-redundant TE library for H. carnosa, both soft-masked and hard-masked genomes were generated for gene structure annotation analysis using RepeatMasker v4.1.428. Overall, 258.4 Mb (55.5%) of the assembled genome was identified as repeat sequences. Copia and Gypsy constituted the main LTR elements, accounting for 20.3% and 12.6% of the genome size, respectively (Table 2).
RNA sequencing data derived from a diverse set of tissues (root, stem, leaf, flower bud and mature flower) were mapped to the assembled reference genome using HISAT2 v2.2.129. The alignments were processed with SAMtools for sorting and cleaning. Then we performed a species-specific gene prediction pipeline using BRAKER2 v2.1.630, and incorporating GeneMark-ET31 and AUGUSTUS v3.5.032. Protein sequences from multiple plant species including Arabidopsis thaliana33, Coffea canephora34, Calotropis gigantea35, Catharanthus roseus36, Marsdenia tenacissima37, Voacanga thouarsii38, Vitis vinifera39 were downloaded and merged, the redundant sequences were removed using CD-HIT v4.6.840. The cleaned homologous protein sequences, along with the RNA-seq alignments, served as input for the initial round of MAKER241 to train gene models with SNAP42. The resulting gene models, combined with those trained by BRAKER2, were employed in the second round of MAKER2 to generate new training models. Gene models with an Annotation Edit Distance (AED) score greater than 0.5 were removed to ensure high confidence. The final gene set was further polished using PASA43, iteratively improving model accuracy through two rounds of updates. Finally, a total of 24,309 protein-coding genes were predicted for H. carnosa, with average gene, intron, and exon lengths of 5,447.9 bp, 707.0 bp, and 326.0 bp, respectively (Table 1). BUSCO v5.4.722 (embryophyta_odb10) protein mode analysis revealed 87.1% completeness (Table 1).
For functional annotation of protein-coding genes, a BLASTP search with stringent criteria was conducted against publicly available protein databases, including SwissProt and the NCBI non-redundant protein database (Nr). KEGG pathways were identified using the online annotation tool KAAS (KEGG Automatic Annotation Server; http://www.genome.jp/tools/kaas/), while Gene Ontology (GO) terms were assigned via eggNOG-mapper v2.1.544. This comprehensive annotation framework annotated 21,927 protein-coding genes, representing 90.2% of the total predicted genes (Table 1). For non-coding RNA (ncRNA) identification, INFERNAL45 was employed to search against the Rfam database46. This analysis revealed 1,936 ncRNAs, with 537 transfer RNAs (tRNAs), 913 ribosomal RNAs (rRNAs), 89 microRNAs (miRNAs), 183 small nucleolar RNAs (snoRNAs) and 214 other ncRNAs (Table 1).
Data Records
The raw data, including Illumina short reads, HiFi reads, Hi-C reads and RNA-seq reads have been deposited to the Genome Sequence Archive (GSA)47 in the National Genomics Data Center (NGDC)48 with accession number CRA02055849 under BioProject accession number PRJCA032224. The genome assembly and annotation files are available at Figshare database50. The genome assembly has also been submitted to the European Nucleotide Archive (ENA) with the accession number GCA_96560141551.
Technical Validation
The quality of the genome assembly was evaluated using several key metrics: (1) Ten telomeres were identified at the chromosome ends, with only five gaps remaining. (2) Genome completeness, assessed by BUSCO v5.4.34, showed 98.5% of BUSCO genes were complete, with 95.9% as single-copy, 2.6% as duplicated, and 0.6% as fragmented. (3) The LTR Assembly Index (LAI) of the genome assembly was 27.0, exceeding the threshold for gold-standard genomes. (4) Quality value (QV) calculated using Merqury v1.3, was 74.1. (5) Short-read mapping to the assembly revealed a high mapping rate of 97.9% and genome coverage of 99.99%. These results demonstrated the exceptional quality of the H. carnosa genome assembly.
Code availability
Neither custom programming nor coding was used in this study. All softwares mentioned in the Methods section are publicly available. If no detailed parameters were mentioned for the software, default parameters recommended by the developer were used.
References
Forster, P. I. & Liddle, D. Hoya. in Flora of Australia Vol. 28. (ed. Orchard AE) (CSIRO, 1996).
Endress, M. E., Liede-Schumann, S. & Meve, U. An updated classification for Apocynaceae. Phytotaxa 159, 175–194 (2014).
Rodda, M. Two new species of Hoya R.Br. (Apocynaceae, Asclepiadoideae) from Borneo. PhytoKeys 53, 83–93 (2015).
Endress, M. E., Meve, U., Middleton, D. J. & Liede-Schumann, S. Apocynaceae. in The Families and Genera of Vascular Plants 15. Flowering Plants. Eudicots. Apiales, Gentianales (except Rubiaceae) (eds. Kadereit, J. W., Bittrich, V.) (Springer Nature, 2019).
Middleton, D. J & Rodda, M. Apocynaceae. in Flora of Singapore Vol. 13. (eds. Middleton, D. J, Leong-Škorničková, J., Lindsay, S.) (National Parks Board, 2019).
Ollerton, J., Johnson, S. D., Cranmer, L. & Kellie, S. The pollination ecology of an assemblage of grassland asclepiads in South Africa. Ann. Bot. 92, 807–834 (2003).
Demarco, D. Secretory tissues and the morphogenesis and histochemistry of pollinarium in flowers of Asclepiadeae (Apocynaceae). Int. J. Plant Sci. 175, 1042–1053 (2014).
Endress, P. K. Development and evolution of extreme synorganization in angiosperm flowers and diversity: a comparison of Apocynaceae and Orchidaceae. Ann. Bot. 117, 749–767 (2016).
Ollerton, J. et al. The diversity and evolution of pollination systems in large plant clades: Apocynaceae as a case study. Ann. Bot. 123, 311–325 (2019).
Kuang, Y. F. et al. Morphological diversity and evolutionary changes of pollinaria in Hoya (Marsdenieae: Apocynaceae). Bot. J. Linn. Soc. 206, 29–54 (2024).
Lamb, A. & Rodda, M. A Guide to Hoyas of Borneo (Natural History Publications, 2016).
Yamazaki, T. Asclepiadaceae. in Flora of Japan. Vol. 3a, 168–183. (eds. Iwatsuki, K., Yamazaki, T., Boufford, D. E., Ohba, H.) (Kodansha Scientific, 1993).
Li, P. T., Gilbert, M. G. & Stevens, D. W. Asclepiadaceae. in Flora of China Vol. 16, 228–236 (Missouri Botanical Garden Press, 1995).
Nakamura, T. Speciation of Hoya carnosa (Asclepiadaceae). La Kromosomo II–71-72, 2479–2489 (1993).
Winnepenninckx, B., Backeljau, T. & De Wachter, R. Extraction of high molecular weight DNA from molluscs. Trends Genet 9, 407 (1993).
Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat. Methods 18, 170–175 (2021).
Roach, M. J., Schmidt, S. A. & Borneman, A. R. Purge Haplotigs: allelic contig reassignment for third-gen diploid genome assemblies. BMC Bioinformatics 19, 460 (2018).
Durand, N. C. et al. Juicer provides a one-click system for analyzing loop-resolution Hi-C experiments. Cell Syst 3, 95–98 (2016).
Dudchenko, O. et al. De novo assembly of the Aedes aegypti genome using Hi-C yields chromosome-length scaffolds. Science 356, 92–95 (2017).
Durand, N. C. et al. Juicebox provides a visualization system for Hi-C contact maps with unlimited zoom. Cell Syst 3, 99–101 (2016).
Lin, Y. Z. et al. QuarTeT: a telomere-to-telomere toolkit for gap-free genome assembly and centromeric repeat identification. Hortic. Res. 5, 25 (2023).
Simão, F. A., Waterhouse, R. M., Ioannidis, P., Kriventseva, E. V. & Zdobnov, E. M. BUSCO: Assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics 31, 3210–3212 (2015).
Ou, S. J. & Jiang, N. LTR_retriever: A highly accurate and sensitive program for identification of long terminal repeat retrotransposons. Plant Physiol 176, 1410–1422 (2018).
Rhie, A., Walenz, B. P., Koren, S. & Phillippy, A. M. Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies. Genome Biol 21, 245 (2020).
Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754–1760 (2009).
Ou, S. et al. Benchmarking transposable element annotation methods for creation of a streamlined, comprehensive pipeline. Genome Biol 20, 275 (2019).
Grabherr, M. G. et al. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat. Biotechnol. 29, 644–652 (2011).
Tarailo‐Graovac, M. & Chen, N. Using RepeatMasker to identify repetitive elements in genomic sequences. Curr. Protoc. Bioinformatics 25, 4.10.1–4.10.14 (2009).
Kim, D., Langmead, B. & Salzberg, S. L. HISAT: A fast spliced aligner with low memory requirements. Nat. Methods 12, 357–360 (2015).
Gabriel, L., Hoff, K. J., Bruna, T., Borodovsky, M. & Stanke, M. TSEBRA: transcript selector for BRAKER. BMC Bioinformatics 22, 566 (2021).
Lomsadze, A., Burns, P. D. & Borodovsky, M. Integration of mapped RNA-Seq reads into automatic training of eukaryotic gene finding algorithm. Nucleic Acids Res 42, e119 (2014).
Stanke, M. et al. AUGUSTUS: ab initio prediction of alternative transcripts. Nucleic Acids Res 34, W435–439 (2006).
Cheng, C. Y. et al. Araport11: a complete reannotation of the Arabidopsis thaliana reference genome. Plant J 89, 789–804 (2017).
Denoeud, F. et al. The coffee genome provides insight into the convergent evolution of caffeine biosynthesis. Science 345, 1181–1184 (2014).
Hoopes, G. M. et al. Genome assembly and annotation of the medicinal plant Calotropis gigantea, a producer of anti-cancer and anti-malarial cardenolides. G3. Genes Genom. Genet. 8, 385–391 (2018).
Xu, Z. P. et al. A near-complete genome assembly of Catharanthus roseus and insights into the biosynthesis of vinblastine and its high susceptibility to Huanglongbing pathogen. Plant Commun 4, 100661 (2023).
Zhou, Y. L. et al. The genome of Marsdenia tenacissima provides insights into calcium adaptation and tenacissoside biosynthesis. Plant J 113, 1146–1159 (2023).
Cuello, C. et al. Genome assembly of the medicinal plant Voacanga thouarsii. Genome Biol. Evol. 14, evac158 (2022).
Jaillon, O. et al. The grapevine genome sequence suggests ancestral hexaploidization in major angiosperm phyla. Nature 449, 463–467 (2007).
Fu, L., Niu, B., Zhu, Z., Wu, S. & Li, W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 28, 3150–3152 (2012).
Holt, C. & Yandell, M. MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects. BMC Bioinformatics 12, 491 (2011).
Korf, L. Gene finding in novel genomes. BMC Bioinformatics 5, 59 (2004).
Haas, B. J. et al. Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Res 31, 5654–5666 (2003).
Cantalapiedra, C. P., Hernández-Plaza, A., Letunic, I., Bork, P. & Huerta-Cepas, J. EggNOG-mapper v2: Functional annotation, orthology assignments, and domain prediction at the metagenomic scale. Mol. Biol. Evol. 38, 5825–5829 (2021).
Nawrocki, E. P. & Eddy, S. R. Infernal 1.1: 100-fold faster RNA homology searches. Bioinformatics 29, 2933–2935 (2013).
Kalvari, I. et al. Rfam 14: Expanded coverage of metagenomic, viral and microRNA families. Nucleic Acids Res 49, D192–D200 (2021).
Chen, T. T. et al. The Genome Sequence Archive Family: Toward explosive data growth and diverse data types. Genom. Proteomics Bioinformatics 19, 578–583 (2021).
Bao, Y. M. et al. Database resources of the National Genomics Data Center, China national center for bioinformation in 2024. Nucleic Acids Res 52, D18–D32 (2024).
NGDC Genome Sequence Archive https://ngdc.cncb.ac.cn/gsa/browse/CRA020558 (2025).
Kuang, Y. F., Ouyang, K. Y., Xia, M., Feng, C. & Kang, M. Near-complete reference genome assembly of Hoya carnosa. Figshare https://doi.org/10.6084/m9.figshare.27872502 (2025).
European Nucleotide Archive https://identifiers.org/insdc.gca:GCA_965601415.1 (2025).
Acknowledgements
This work was supported by Guangdong Flagship Project of Basic and Applied Basic Research (2023B0303050001), Youth Innovation Promotion Association CAS (2021348), and Science &Technology Fundamental Resources Investigation Program (2022FY202200).
Author information
Authors and Affiliations
Contributions
M.K., Y.K. and C.F. conceived and designed the study. Y.K. and M.X. performed the sampling and experiments. C.F, Y.K., K.O. and M.X. analyzed the data and generated figures and tables. Y.K. and C.F. wrote the manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Kuang, Y., Ouyang, K., Xia, M. et al. Near-complete reference genome assembly of Hoya carnosa. Sci Data 12, 1210 (2025). https://doi.org/10.1038/s41597-025-05587-4
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41597-025-05587-4