A high-quality chromosome-scale genome assembly of the Cherokee rose (Rosa laevigata)

Wang, Yi; Yan, Huijun; Qiu, Xianqin; Zhang, Hao; Zhang, Yonghong; Jian, Hongying

doi:10.1038/s41597-025-04461-7

Download PDF

Data Descriptor
Open access
Published: 22 January 2025

A high-quality chromosome-scale genome assembly of the Cherokee rose (Rosa laevigata)

Yi Wang¹,
Huijun Yan²,
Xianqin Qiu²,
Hao Zhang²,
Yonghong Zhang ORCID: orcid.org/0000-0001-6583-3255¹ &
…
Hongying Jian²

Scientific Data volume 12, Article number: 132 (2025) Cite this article

2892 Accesses
Metrics details

Subjects

Abstract

Rosa laevigata is an excellent rose germplasm, highly resistant to aphid, and immune to both rose black spot and powdery mildew disease. It is also a well-known edible plant with a long history of medicinal use in China, having the effects of improving kidney function, inhibiting arteriosclerosis, and reducing inflammation. In this study, we assembled a high-quality chromosome-scale genome for R. laevigata by combining Illumina, PacBio, and Hi-C data, which has a length of approximately 494.2 Mb with a scaffold N50 of 68.6 Mb. A total of 493.2 Mb (99.8%) of the draft genome sequences were anchored on seven pseudochromosomes and two gapless pseudochromosomes were included in the final genome assembly. A total of 37,117 protein-coding genes were predicted, 34,047 of which were functionally annotated. Repeat annotation revealed 659,558 (285.6 Mb) repeat elements, accounting for 57.8% of the genome. The chromosome-scale genome provides valuable information to facilitate comparative genomic analysis of rose family and will accelerate genome-guided breeding and germplasm improvement of both R. laevigata itself and modern roses.

Chromosomal level genome assembly of medicinal plant Rosa laevigata

Article Open access 30 April 2025

Chromosome-scale genome assembly and annotation of Cotoneaster glaucophyllus

Article Open access 22 April 2024

A chromosome-level genome assembly of rugged rose (Rosa rugosa) provides insights into its evolution, ecology, and floral characteristics

Article Open access 18 June 2021

Background & Summary

Rosa laevigata Michx. (2n = 2x = 14)¹ is an evergreen climbing shrub widely distributed in eastern and southern China, with altitude of 200–1,600 meters. Commonly known as Cherokee rose, R. laevigata is the only species in Rosa sect. Laevigatae of Rosaceae, with glabrous leathery leaves and large fragrant white flowers² (Fig. 1). On one hand, R. laevigata is an excellent germplasm for the germplasm innovation and improvement of modern rose cultivars, highly resistant to aphid³, and immune to both black spot disease and powdery mildew disease of rose⁴. On the other hand, it is an edible and traditional herbal medicine in China⁵. The fruits of R. laevigata, known as “Jin-Ying-Zi” in Chinese, are a main Traditional Chinese Medicines ingredient that is prescribed as a kidney tonic for the treatment of urinary diseases, including urinary incontinence and urinary frequency, as well as menstrual irregularities, leucorrhea, and uterine prolapse⁶. The roots of R. laevigata possess functions such as clearing heat, detoxifying, cooling blood, promoting blood circulation, dispelling stasis, and relieving pain⁷. In addition, the roots of R. laevigata are used to treat gynecological infections and diseases of the urinary system⁶. The leaves of R. laevigata could cure burns, skin tumors and ulcers, while the flowers show efficacy in alleviating cold and heat, as well as contain insecticidal properties⁸.

Recent studies on R. laevigata revealed the presence of phenolic acids, steroids, triterpenoids, phenylpropanoids, and other chemical components in the plant that exhibits diverse pharmacological effects, such as antioxidant, anti-inflammatory, antibacterial, kidney function improvement, immunity enhancement, blood sugar reduction, and anti-tumor properties^9,10. R. laevigata is also used in treatments for spermatorrhea, premature ejaculation, urinary incontinence, diarrhea, chronic bronchitis, and chronic kidney disease¹¹. Among over 123 ingredients that have been isolated from different parts of R. laevigata, triterpenoids were regarded as the most significant bioactive substances¹², and were useful to combat Alzheimer’s disease (AD)^13,14,15.

There are more than 200 species in Rosa¹⁶ with varying ploidy levels, ranging from 2n = 2x to 10x^17,18. Despite the fact that at least eight published nuclear genome sequences are publicly available at present (https://www.plabipd.de/), some of them are at the draft genome level with relatively low quality, e.g., R. multiflora¹⁹. Although draft genomes could provide useful genomic information, the construction of a high-quality genome assembly is a fundamental step for dissecting genomic variations that contribute to exploring the genetic and molecular basis of desirable traits in plants. Thus, in this study, we assembled a high-quality chromosome-scale genome of R. laevigata using Illumina, PacBio, and Hi-C data. The final genome assembly spans a total length of 494.2 Mb, featuring a scaffold N50 size of 68.6 Mb. Additionally, 99.8% (493.2 Mb) of the genome sequence has been successfully anchored on seven pseudochromosomes. The assembled genome sequence consists of repeat elements, which make up 57.8% of the total. The most abundant class of repeat elements is the long terminal repeats (LTRs), which account for 42.1% of the genome. A total of 37,117 protein-coding genes were found using ab initio gene prediction, RNA-seq, and homologous protein evidence. Out of these, 34,074 genes were annotated with their respective functions. We have also detected 151 miRNAs, 1,115 tRNAs, 1,289 rRNAs and 627 snRNAs in the genome of R. laevigata. The high-quality chromosome-scale genome provides valuable resource for exploring key genes and molecular regulatory mechanisms involved in the high resistance to pests and diseases of modern roses on one hand, and in the biosynthesis of important compounds such as triterpenoids on the other, which will facilitate genome-guided breeding and improvement of R. laevigata itself and modern roses.

Methods

Sample collection, DNA extraction and sequencing

Fresh young leaves of R. laevigata, which had been propagated by cuttings collected from Changshou District of Chongqing Municipality (107°12′53.598″E, 30°10′42.169″N, 350 m), were sampled at the Flower Research Institute of the Yunnan Academy of Agricultural Sciences, Yunnan Province, China, for DNA extraction. At the same time, the tender roots, stems, leaves, and fruits of the same individual were used as source for RNA extraction. All samples were frozen using liquid nitrogen and transported to the laboratory to be stored in an ultra-low-temperature freezer at −80 °C, prior to the DNA and RNA extraction process.

Total high-molecular-weight genomic DNA was extracted from the leaves of R. laevigata using the Tiangen Extraction Kit (Tiangen Biotech, China) that was based on the cetyltrimethylammonium bromide (CTAB) method. The concentration of the DNA extract was ascertained by the Quant-iT PicoGreen assay (Invitrogen, Waltham, MA, USA). The quality and quantity of the DNA samples were assessed using an ultraviolet spectrophotometer at 260 nm and 280 nm wave lengths. The DNA was fragmented with a Covaris M220 Focused-ultrasonicator instrument. Genomic DNA sequencing was conducted at Novogene Co., Ltd., Beijing, China.

Three different genomic DNA sequence libraries were constructed in this study. For the first approach, the DNA PCR-free libraries with insert sizes of 350 bp were constructed by using the NEBNext Ultra DNA library Pre-Kit for Illumina short-reads sequencing. The resulting barcoded libraries were sequenced on an Illumina NovaSeq6000 platform to generate 150-bp paired-end reads. Quality control was carried out on all the obtained reads by trimming adaptors and low-quality reads using fastp v0.23.2²⁰. A total of 53 Gb filtered short reads were obtained and used in subsequent data processing steps.

The second approach used the single-molecule real-time (SMRT) PacBio libraries that were constructed by using the PacBio 15-kb protocol and sequenced by using a PacBio Sequel IIe platform. The raw data generated with the PacBio Sequel IIe system were processed through the SMRT Analysis software suite v5.1.0 (https://www.pacb.com/products-and-services/analytical-software/smrt-analysis/), whereas the consensus HiFi reads were produced by the CCS subprogram (https://github.com/PacificBiosciences/ccs) with default parameters, which then generated approximately 36 Gb filtered data.

The last approach would be via the Hi-C library that was generated by using the restriction endonuclease MboI. The MboI-digested chromatin was labeled with biotin-14-dATP, and in situ DNA ligation was performed. The DNA underwent extraction, purification, and shearing. After A-tailing, pull-down, and adapter ligation steps, the DNA library was subjected to sequencing on an Illumina NovaSeq6000 platform. The total filtered data generated from Hi-C library was approximately 99 Gb (Table 1).

Table 1 Information of sequencing data generated for the genome assembly of R. laevigata.

Full size table

RNA extraction and sequencing

Fresh roots, stems, leaves, and fruits of the same R. laevigata individual were used for transcriptome sequencing. Total RNA was extracted using the TRIzol reagent (Thermo Fisher Scientific, MA, USA) according to the manufacturer’s protocol. A 150-bp paired-end RNA-seq library was constructed and sequenced on an Illumina Novaseq6000 platform. After trimming the adapters and low-quality raw sequence reads using fastp, approximately 25 Gb clean data were obtained (Table 1).

Genome survey and assembly

Previous studies have shown that R. laevigata is a diploid plant²¹. Flow cytometry analysis determined the nuclear DNA content of R. laevigata to be 0.51 pg²², indicating that its genome size is approximately 498.78 Mb (1 pg = 978 Mb). In this study, by employing Jellyfish v2.3.0²³ and GenomeScope2²⁴ with 19-kmer to estimate the genome size, heterozygosity, and repeat content based on Illumina short reads, the genome of R. laevigata was estimated to be 510.2 Mb in size, with a heterozygosity of 0.6% and a repeat content of 53.9% (Fig. 2), which is consistent with the results of previous studies. For the genome de novo assembly of R. laevigata, the bam2fastq v3.0.0 (https://github.com/PacificBiosciences/pbtk#bam2fastx) pipeline was first used to convert the raw data into the fastq format. A primary assembly was then constructed by using hifiasm v0.16.1²⁵ with default parameters, while the redundant sequences were removed using purge_dups v1.2.5²⁶ with default parameters. The draft genome assembly comprised of 56 contigs, which had a combined length of 494.2 Mb, an N50 size of 63.1 Mb, and a GC content of 38.8% (Table 2).

Table 2 Information of the genome assembly of R. laevigata.

Full size table

After removing adapter and low-quality sequences, we employed Juicer v1.6²⁷ to align the filtered Hi-C data to the primary assembly and generate a deduplicated list of Hi-C reads with default parameters. By using 3D-DNA v201008²⁸, the primary contigs were anchored on the pseudochromosomes, while the heatmap for Hi-C interaction was generated using the 3D-DNA visualize module. The heatmap was further visualized, and manual curation was conducted with Juicebox v2.17.00²⁹. The seven pseudochromosomes with lengths ranging from 49.9 Mb to 88.5 Mb were identified via distinct interaction signals in the Hi-C interaction heatmap (Fig. 3). Among them, two pseudochromosomes were gapless (Supplementary Table 1). The final genome assembly of R. laevigata was 494.2 Mb in size, with a scaffold N50 of 68.6 Mb (Table 2, Fig. 4).

Repeat annotation

Tandem repeats in the final genome assembly were identified by using TRF v4.09³⁰ based on default parameters. By selecting the masked genome sequence as the target genome, transposon element (TE) identification was conducted using a combination approach via de novo and homology. The miniature inverted repeat transposable elements (MITEs) and long terminal repeat (LTR) elements were identified by de novo methods with MITE-Hunter (11–2011)³¹ and LTR_retriever v2.8.7³². LTR_retriever incorporated LTRs predicted from LTRHarvest³³ and LTR_FINDER v1.0.7³⁴. A species-specific repeat sequence library was also constructed using by RepeatModeler v2.0.3³⁵, while homology-based predictions were performed using by RepeatMasker v4.1.4 (http://repeatmasker.org/) and referring to the Repbase v2018 (http://www.girinst.org/server/RepBase/index.php) database. Annotation was carried out for all the repeat sequences with RepeatMasker based on a combined reference library that combined the MITEs, LTRs, species-specific library and homology library. The LTR Assembly Index (LAI) that is embedded in LTR_retriever was used to assess the assembly quality of the genome. A total of 285.6 Mb of repetitive elements were identified, which constituted 57.8% of the genome of R. laevigata. Among these repeats, the most abundant repeating element was the LTRs (42.14%). Within LTRs, Copia and Gypsy were the two most dominant classes in the genome, accounting for 16.9% and 24.7%, respectively (Fig. 5, Table 3).

Table 3 Repetitive elements in the genome assembly of R. laevigata.

Full size table

Gene prediction and annotation

Gene annotation was carried out by using the repeat masked genome. The protein-coding genes were annotated by incorporating transcriptional evidence, homology-based, and ab initio methods. For transcriptional evidence, fastp was used to trim the data separately, based on different tissues, and the clean data were aligned to the assembled genome of R. laevigata by using HISAT2 v2.2.1³⁶. The transcripts were assembled by using Stringtie v2.2.1³⁷ based on default parameters, while the transcripts from all samples were merged and subjected to protein-coding sequence prediction and quality filtering via TransDecoder (https://github.com/TransDecoder/TransDecoder/wiki), which is available in PASA v2.4.1³⁸. Only complete transcripts were retained for subsequent analysis. The protein sequences from Arabidopsis thaliana³⁹, Crataegus pinnatifida var. major⁴⁰, Eriobotrya japonica⁴¹, Fragaria vesca⁴², Malus domestica⁴³, Potentilla anserina⁴⁴, Prunus armeniaca⁴⁵, Pyrus communis⁴⁶, Rosa chinensis⁴⁷, Rosa rugosa⁴⁸, and Rubus idaeus⁴⁹ were mapped to the assembled genome using by GeMoMa v1.9⁵⁰ to identify high-quality protein structures. Ab initio gene prediction was carried out by using Augustus v3.3.3⁵¹, GeneMark-ESSuite v4.57⁵², and SNAP v2006-07-28⁵³. They were all trained by high-quality transcripts from the previous step. Ab initio gene identification was performed according to the manuals. All predicted gene structures were integrated into a nonredundant gene set by using EVdenceModeler (EVM)³⁸ v1.1.1. The weight values for high-quality RNA-seq transcripts, homologous proteins, and ab initio prediction were set to 10, 7, and 3 (Table 4). The EVM-predicted genes were further corrected by using PASA v2.4.1³⁸ to identify the untranslated and alternative splicing regions. In total, 37,117 protein-coding genes were predicted and annotated, with an average gene length of 3,047.31 bp (Table 5).

Table 4 Number of gene annotated in the assembled genome of R. laevigata based on the different methods used.

Full size table

Table 5 Statistics for the genetic structure in the R. laevigata genome.

Full size table

Functional annotation of the protein-coding genes was performed based on sequence similarity and the identification of conserved domains. For sequence similarity, the protein-coding genes were matched against the Universal Protein Knowledgebase (UniProt) database⁵⁴, the Kyoto Encyclopedia of Genes and Genomes (KEGG) database⁵⁵, and the eggNOG⁵⁶ database using diamond v2.0.11⁵⁷ with a specified E-value cut-off of 1e-5. For the identification of conserved domains, InterProScan v5.52⁵⁸ was employed to detect and classify domains and motifs by referring to the Pfam⁵⁹, SMART⁶⁰, PANTHER⁶¹, PRINTS⁶², and ProDom⁶³ databases. With both approaches, a total of 34,074 (91.8%) protein-coding genes were functionally annotated (Table 6).

Table 6 Information on the number and ratio of functionally annotated genes based on different sequence databases.

Full size table

Noncoding RNA was annotated by using RNAmmer v1.2⁶⁴ for rRNA, tRNAscan-SE⁶⁵ v2.0.9 for tRNA and the cmscan module in Infernal v1.1.2⁶⁶ for miRNA, snRNA. All predictions were performed with the default parameters, with the exception of tRNAs, which were filtered with a score >40. Finally, a total of 3,182 noncoding RNAs were predicted (Table 7), including 1,115 transfer RNAs (tRNAs), 1,289 ribosomal RNAs (rRNAs), 151 micro-RNAs (miRNAs), and 627 small nuclear RNAs (snRNAs).

Table 7 The number and length of the non-coding RNAs annotated in the assembled genome of R. laevigata.

Full size table

Data Records

The Illumina short reads, PacBio long-reads, Hi-C sequencing data, and RNA-seq data have been deposited in the National Center for Biotechnology Information Sequence Read Archive (SRA) database with the accession number SRP511807⁶⁷. The chromosome-scale genome assembly has been deposited in DDBJ/ENA/GenBank under the accession number JBEFKI000000000⁶⁸. The genome annotation files were submitted to Figshare⁶⁹.

Technical Validation

To assess the accuracy of the assembly, BWA⁷⁰ v0.7.17 and Minimap2 v2.24⁷¹ were used to remapped the Illumina short reads and PacBio long reads into the final assembled genome. A 99.8% and 99.9% mapping rates of short and long reads was achieved, respectively. The Benchmarking Universal Single-Copy Orthologs (BUSCO, v5.4.7⁷²) was used to further evaluate the genome completeness by performing searches against the embryophyte_odb10 database. The BUSCO analysis showed that the final genome sequence contained 99.0% (1,598) complete BUSCOs (including 95.4% (1,540) single-copy BUSCOs, 3.6% (58) duplicated BUSCOs), 0.6% (9) fragmented BUSCOs and 0.4% (7) missing BUSCOs (Table 8). The LTR Assembly Index (LAI) was used to assess the quality of the R. laevigata genome assembly. An LAI value of 25.21 was obtained, indicating that the quality of R laevigata genome assembly qualified the level as a reference genome (Table 9). Merqury v1.3⁷³ was used to assess the consensus quality value (QV) of the R. laevigata genome assembly. The QVs were 64.8 and 55.3 estimated with HiFi and Illumina k-mers, respectively (Supplementary Figure 1). The above evaluation results indicate that the R. laevigata genome assembly has high accuracy and integrity.

Table 8 Result of the BUSCO assessment of R. laevigata genome.

Full size table

Table 9 Statistics for Genome completeness in the R. laevigata genome.

Full size table

Furthermore, BUSCO was used to evaluate the completeness of the R. laevigata genome annotation by performing searches against the embryophyte_odb10 database. The BUSCO analysis showed that 95.1% of conserved orthologous genes were complete in the predicted protein coding genes, comprising 90.8% single-copy and 4.3% duplicated genes (Table 10).

Table 10 Result of the BUSCO assessment of R. laevigata protein coding genes.

Full size table

Code availability

No in-house code or scripts were used in this study. Commands and pipelines used for data processing were executed using their corresponding default parameters.

References

Yokoya, K. Nuclear DNA Amounts in Roses. Ann. Bot. 85, 557–561 (2000).
Article CAS MATH Google Scholar
Gu, C. Z. & Kenneth R. R. Rosa. In Wu, Z.Y., Raven, P.H. and Hong, D. Y. (Eds.). Flora of China. Science Press, Beijing, China and Missouri Botanical Garden Press, St. Louis. 9, 339–381 (2003).
Fan, Y. L. et al. Screening of Rosa germplasm resources with resistance to aphids. J. Yunnan Univ. Nat. Sci. Ed. 43, 619–628 (2021).
MATH Google Scholar
Qiu, X. Q. et al. Powdery mildew resistance identification of wild Rosa germplasms. Acta Hortic. 1064, 329–335 (2015).
Article MATH Google Scholar
Gao, P. Y., Si, X. X., Liu, X. G. & Li, D. Q. Research Progress in Extraction, Purification of Triterpenoids From Rosa laevigata Michx. and Its Anti-Alzheimer’s Disease Activity. Contemp. Chem. Ind. 51, 1196–1200 (2022).
MATH Google Scholar
Yuan, J. Q. et al. New Triterpene Glucosides from the Roots of Rosa laevigata Michx. Molecules 13, 2229–2237 (2008).
Article CAS PubMed PubMed Central MATH Google Scholar
Dai, H. N. et al. Triterpenoids from roots of Rosa laevigata. Chin. Tradit. Herb. Drugs 47, 374–378 (2016).
CAS MATH Google Scholar
Yoshida, T., Tanaka, K., Chen, X. M. & Okuda, T. Tannis of rosaceous medicinal plants. V. Hydrolyzable tannis with dehydrodigalloyl group from Rosa laevigata Michx. Chem. Pharm. Bull. 37, 920–924 (1989).
Article CAS Google Scholar
Fan, X. R., Li, R. R., Lin, M. L., Liao, D. F. & Li, C. Research progress in medicinal parts of Rosa laevigata Michx. Chin. Pharm. J. 53, 1333–1341 (2018).
MATH Google Scholar
Huang, Y. L. & Liu, Y. Experimental study on the antineoplastic effect of polysaccharide from fructus Rosae Laevigatae in vitro. Genomics Appl. Biol. 34, 1848–1851 (2015).
MATH Google Scholar
An, D. Q., Yan, J. X. & Wang, X. L. UV determination of content of total flavonoids in Rosa laevigata Michx. from different places in different harvest period. J. Anhui Agric. Sci. 43, 79–80+83 (2015).
CAS Google Scholar
Li, B. L., Yuan, J. & Wu, J. W. A Review on the Phytochemical and Pharmacological Properties of Rosa laevigata: A Medicinal and Edible Plant. Chem. Pharm. Bull. 69, 421–431 (2021).
Article CAS MATH Google Scholar
Gao, P. et al. Extraction and isolation of polyhydroxy triterpenoids from Rosa laevigata Michx. fruit with anti-acetylcholinesterase and neuroprotection properties. RSC Adv. 8, 38131–38139 (2018).
Article ADS CAS PubMed PubMed Central Google Scholar
Choi, S. J. et al. Protective effect of Rosa laevigata against amyloid beta peptide-induced oxidative stress. Amyloid 13, 6–12 (2009).
Article MATH Google Scholar
Jung Choi, S. et al. Ameliorative effect of 1,2-benzenedicarboxylic acid dinonyl ester against amyloidbetapeptide-induced neurotoxicity. Amyloid 16, 15–24 (2009).
Article PubMed MATH Google Scholar
Fayaz, F., Singh, K., Gairola, S., Ahmed, Z. & Shah, B. A. A Comprehensive Review on Phytochemistry and Pharmacology of Rosa Species (Rosaceae). Curr. Top Med. Chem. 24, 364–378 (2024).
Article CAS PubMed Google Scholar
Jian, H. et al. Decaploidy in Rosa praelucens Byhouwer (Rosaceae) Endemic to Zhongdian Plateau, Yunnan, China. Caryologia 63, 162–167 (2014).
Article Google Scholar
Roberts, A. V., Gladis, T. & Brumme, H. DNA amounts of roses (Rosa L.) and their use in attributing ploidy levels. Plant Cell Rep. 28, 61–71 (2008).
Article PubMed Google Scholar
Nakamura, N. et al. Genome structure of Rosa multiflora, a wild ancestor of cultivated roses. DNA Res. 25, 113–121 (2018).
Article CAS PubMed MATH Google Scholar
Chen, S., Zhou, Y., Chen, Y. & Gu, J. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 34, i884–i890 (2018).
Article PubMed PubMed Central MATH Google Scholar
Jian, H. Y. et al. Karyological diversity of wild Rosa in Yunnan, southwestern China. Genet. Resour. Crop Evol. 60, 115–127 (2013).
Article MATH Google Scholar
Li, S. Q., Zhang, C. & Gao, X. F. Estimation of nuclear DNA content of 17 Chinese wild rose species by flow cytometry. Plant Sci. J. 35, 558–565 (2017).
MATH Google Scholar
Marçais, G. & Kingsford, C. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27, 764–770, (2011).
Article PubMed PubMed Central MATH Google Scholar
Ranallo-Benavidez, T. R., Jaron, K. S. & Schatz, M. C. GenomeScope 2.0 and Smudgeplot for reference-free profiling of polyploid genomes. Nat. Commun. 11, 1432 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat. Methods 18, 170–175 (2021).
Article CAS PubMed PubMed Central Google Scholar
Guan, D. et al. Identifying and removing haplotypic duplication in primary genome assemblies. Bioinformatics 36, 2896–2898 (2020).
Article CAS PubMed PubMed Central MATH Google Scholar
Durand, N. C. et al. Juicer Provides a One-Click System for Analyzing Loop-Resolution Hi-C Experiments. Cell Syst. 3, 95–98 (2016).
Article CAS PubMed PubMed Central MATH Google Scholar
Dudchenko, O. et al. De novo assembly of the Aedes aegypti genome using Hi-C yields chromosome-length scaffolds. Science 356, 92–95 (2017).
Article ADS CAS PubMed PubMed Central MATH Google Scholar
Robinson, J. T. et al. Juicebox.js Provides a Cloud-Based Visualization System for Hi-C Data. Cell Syst. 6, 256–258.e251 (2018).
Article CAS PubMed PubMed Central MATH Google Scholar
Benson, G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 27, 573–580 (1999).
Article CAS PubMed PubMed Central MATH Google Scholar
Han, Y. & Wessler, S. R. MITE-Hunter: a program for discovering miniature inverted-repeat transposable elements from genomic sequences. Nucleic Acids Res. 38, e199–e199 (2010).
Article PubMed PubMed Central MATH Google Scholar
Ou, S. & Jiang, N. LTR_retriever: A Highly Accurate and Sensitive Program for Identification of Long Terminal Repeat Retrotransposons. Plant Physiol. 176, 1410–1422 (2018).
Article CAS PubMed MATH Google Scholar
Ellinghaus, D., Kurtz, S. & Willhoeft, U. LTRharvest, an efficient and flexible software for de novo detection of LTR retrotransposons. BMC Bioinformatics 9, 18 (2008).
Article PubMed PubMed Central Google Scholar
Ou, S. & Jiang, N. LTR_FINDER_parallel: parallelization of LTR_FINDER enabling rapid identification of long terminal repeat retrotransposons. Mob. DNA 10, 48 (2019).
Article CAS PubMed PubMed Central MATH Google Scholar
Flynn, J. M. et al. RepeatModeler2 for automated genomic discovery of transposable element families. Proc. Natl. Acad. Sci. 117, 9451–9457 (2020).
Article ADS CAS PubMed PubMed Central MATH Google Scholar
Kim, D., Langmead, B. & Salzberg, S. L. HISAT: a fast spliced aligner with low memory requirements. Nat. Methods 12, 357–360, (2015).
Article CAS PubMed PubMed Central Google Scholar
Kovaka, S. et al. Transcriptome assembly from long-read RNA-seq alignments with StringTie2. Genome Biol. 20, 278 (2019).
Article CAS PubMed PubMed Central Google Scholar
Haas, B. J. et al. Automated eukaryotic gene structure annotation using EVidenceModeler and the Program to Assemble Spliced Alignments. Genome Biol 9, R7 (2008).
Article PubMed PubMed Central MATH Google Scholar
Lamesch, P. et al. The Arabidopsis Information Resource (TAIR): improved gene annotation and new tools. Nucleic Acids Res. 40, D1202–D1210 (2012).
Article CAS PubMed Google Scholar
Zhang, T. et al. Cultivated hawthorn (Crataegus pinnatifida var. major) genome sheds light on the evolution of Maleae (apple tribe). J. Integr. Plant Biol. 64, 1487–1501 (2022).
Article CAS PubMed MATH Google Scholar
Wang, Y. A draft genome, resequencing, and metabolomes reveal the genetic background and molecular basis of the nutritional and medicinal properties of loquat (Eriobotrya japonica (Thunb.) Lindl). Hortic. Res. 8, 231 (2021).
Article PubMed PubMed Central Google Scholar
Edger, P. P. et al. Single-molecule sequencing and optical mapping yields an improved genome of woodland strawberry (Fragaria vesca) with chromosome-scale contiguity. GigaScience 7, 1–7 (2018).
Article CAS PubMed MATH Google Scholar
Qin, S. et al. A chromosome-scale genome assembly of Malus domestica, a multi-stress resistant apple variety. Genomics 115, 110627 (2023).
Article CAS PubMed MATH Google Scholar
Gan, X. et al. Chromosome-Level Genome Assembly Provides New Insights into Genome Evolution and Tuberous Root Formation of Potentilla anserina. Genes 12, 1993 (2021).
Article CAS PubMed PubMed Central MATH Google Scholar
Groppi, A. et al. Population genomics of apricots unravels domestication history and adaptive events. Nat. Commun. 12, 3956 (2021).
Article ADS CAS PubMed PubMed Central MATH Google Scholar
Yocca, A. et al. A chromosome-scale assembly for ‘d’Anjou’ pear. G3 14, jkae003 (2024).
Article CAS PubMed PubMed Central Google Scholar
Hibrand, L. et al. A high-quality genome sequence of Rosa chinensis to elucidate ornamental traits. Nat. Plants 4, 473–484 (2018).
Article Google Scholar
Chen, F. et al. A chromosome-level genome assembly of rugged rose (Rosa rugosa) provides insights into its evolution, ecology, and floral characteristics. Hortic. Res. 8, 141 (2021).
Article CAS PubMed PubMed Central Google Scholar
Sassa, H. et al. Chromosome-scale genome sequence assemblies of the ‘Autumn Bliss’ and ‘Malling Jewel’ cultivars of the highly heterozygous red raspberry (Rubus idaeus L.) derived from long-read Oxford Nanopore sequence data. Plos One 18, e0285756 (2023).
Article Google Scholar
Keilwagen, J., Hartung, F., Paulini, M., Twardziok, S. O. & Grau, J. Combining RNA-seq data and homology-based gene prediction for plants, animals and fungi. BMC Bioinformatics 19, 1–12 (2018).
Article Google Scholar
Stanke, M. & Morgenstern, B. AUGUSTUS: a web server for gene prediction in eukaryotes that allows user-defined constraints. Nucleic Acids Res. 33, W465–W467 (2005).
Article CAS PubMed PubMed Central MATH Google Scholar
Besemer, J. & Borodovsky, M. GeneMark: web software for gene finding in prokaryotes, eukaryotes and viruses. Nucleic Acids Res. 33, W451–W454 (2005).
Article CAS PubMed PubMed Central MATH Google Scholar
Leskovec, J. & Sosič, R. SNAP: A General-Purpose Network Analysis and Graph-Mining Library. ACM Trans. Intell. Syst. Technol. 8, 1–20 (2016).
Article PubMed PubMed Central MATH Google Scholar
The UniProt Consortium. UniProt: the universal protein knowledgebase. Nucleic Acids Res. 45, D158–D169 (2017).
Article Google Scholar
Kanehisa, M. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. 28, 27–30 (2000).
Article CAS PubMed PubMed Central MATH Google Scholar
Huerta-Cepas, J. et al. eggNOG 5.0: a hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses. Nucleic Acids Res. 47, D309–D314 (2019).
Article CAS PubMed Google Scholar
Buchfink, B., Xie, C. & Huson, D. H. Fast and sensitive protein alignment using DIAMOND. Nat. Methods 12, 59–60 (2014).
Article PubMed MATH Google Scholar
Jones, P. et al. InterProScan 5: genome-scale protein function classification. Bioinformatics 30, 1236–1240 (2014).
Article CAS PubMed PubMed Central MATH Google Scholar
Finn, R. D. et al. Pfam: the protein families database. Nucleic Acids Res. 42, D222–D230 (2014).
Article CAS PubMed MATH Google Scholar
Schultz, J. SMART: a web-based tool for the study of genetically mobile domains. Nucleic Acids Res. 28, 231–234 (2000).
Article CAS PubMed PubMed Central MATH Google Scholar
Mi, H. The PANTHER database of protein families, subfamilies, functions and pathways. Nucleic Acids Res. 33, D284–D288 (2004).
Article PubMed Central MATH Google Scholar
Attwood, T. K. The PRINTS database: A resource for identification of protein families. Brief. Bioinform. 3, 252–263, (2002).
Article CAS PubMed MATH Google Scholar
Bru, C. The ProDom database of protein domain families: more emphasis on 3D. Nucleic Acids Res. 33, D212–D215 (2004).
Article PubMed Central MATH Google Scholar
Lagesen, K. et al. RNAmmer: consistent and rapid annotation of ribosomal RNA genes. Nucleic Acids Res. 35, 3100–3108 (2007).
Article ADS CAS PubMed PubMed Central MATH Google Scholar
Lowe, T. M. & Eddy, S. R. tRNAscan-SE: A Program for Improved Detection of Transfer RNA Genes in Genomic Sequence. Nucleic Acids Res. 25, 955–964 (1997).
Article CAS PubMed PubMed Central MATH Google Scholar
Nawrocki, E. P., Kolbe, D. L. & Eddy, S. R. Infernal 1.0: inference of RNA alignments. Bioinformatics 25, 1335–1337 (2009).
Article CAS PubMed PubMed Central MATH Google Scholar
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRP511807 (2024).
NCBI GenBank https://identifiers.org/ncbi/insdc:JBEFKI000000000.1 (2024).
Zhang, Y. H. Genome assembly and annotation files of Rosa laevigata. Figshare https://doi.org/10.6084/m9.figshare.25998949 (2024).
Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754–1760 (2009).
Article CAS PubMed PubMed Central MATH Google Scholar
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).
Article CAS PubMed PubMed Central MATH Google Scholar
Manni, M. et al. BUSCO Update: Novel and Streamlined Workflows along with Broader and Deeper Phylogenetic Coverage for Scoring of Eukaryotic, Prokaryotic, and Viral Genomes. Mol. Biol. Evol. 38, 4647–4654 (2021).
Article CAS PubMed PubMed Central MATH Google Scholar
Rhie, A., Walenz, B. P., Koren, S. & Phillippy, A. M. Merqury: Reference-free quality, completeness, and phasing assessment for genome assemblies. Genome Biol. 21, 245 (2020).
Article CAS PubMed PubMed Central Google Scholar
Wang, Y. et al. MCScanX: a toolkit for detection and evolutionary analysis of gene synteny and collinearity. Nucleic Acids Res. 40, e49–e49 (2012).
Article ADS CAS PubMed PubMed Central MATH Google Scholar

Download references

Acknowledgements

The authors thank Guodong Li and Chunlin Gao from Yunnan University of Chinese Medicine and Ticao Zhang from Kunming Institute of Botany, Chinese Academy of Sciences for their technical assistance and valuable discussions. This work was financially supported by the Technology Talents and Innovation Team Project of Yunnan Province (No. 202305AS350002 to Jian) and the National Natural Science Foundation of China (No. 31760048 to Zhang).

Author information

Authors and Affiliations

School of Life Sciences, Yunnan Normal University, Kunming, 650500, China
Yi Wang & Yonghong Zhang
Flower Research Institute, Yunnan Academy of Agricultural Sciences, Kunming, 650205, China
Huijun Yan, Xianqin Qiu, Hao Zhang & Hongying Jian

Authors

Yi Wang
View author publications
Search author on:PubMed Google Scholar
Huijun Yan
View author publications
Search author on:PubMed Google Scholar
Xianqin Qiu
View author publications
Search author on:PubMed Google Scholar
Hao Zhang
View author publications
Search author on:PubMed Google Scholar
Yonghong Zhang
View author publications
Search author on:PubMed Google Scholar
Hongying Jian
View author publications
Search author on:PubMed Google Scholar

Contributions

Jian H.Y., Yan H.J., Qiu X.Q. and Zhang H. conceived the research and collected materials. Wang Y. assembled the sequences and analyzed the data. Wang Y. and Zhang Y.H. prepare the manuscript. Zhang Y.H. and Jian H.Y. revised the manuscript. All authors read, edited and approved the final manuscript.

Corresponding authors

Correspondence to Yonghong Zhang or Hongying Jian.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Wang, Y., Yan, H., Qiu, X. et al. A high-quality chromosome-scale genome assembly of the Cherokee rose (Rosa laevigata). Sci Data 12, 132 (2025). https://doi.org/10.1038/s41597-025-04461-7

Download citation

Received: 22 July 2024
Accepted: 13 January 2025
Published: 22 January 2025
DOI: https://doi.org/10.1038/s41597-025-04461-7