Background & Summary

The genus Rosa belongs to Rosaceae and comprises 150–200 species widely distributed in the temperate and subtropical regions of the northern hemisphere1,2. Roses are well-known for their remarkable ornamental value in horticulture, for example R. chinensis, which serve as important parent sources for modern ornamental rose varieties3. In addition, roses are rich in essential oils and other bioactive compounds such as tannins, flavonoids, and phenolic acids4,5,6,7,8,9, suggesting that roses possess widely application in the food, cosmetics, and pharmaceutical industries. As so far, only a number of rose genomes have been sequenced and reported, include eight horticultural varieties: R. rugosa10,11, R. chinensis ‘Chilong Hanzhu’12, R. chinensis ‘Old Blush’3, R. chinensis ‘Samantha’13, R. wichuraiana ‘Basye’s Thornless’14, R. chinensis15, R. multiflora16, R. gigantea17, and two wild species: R. roxburghii and R. sterilis18. The lack of high-quality reference genomes has limited in-depth research on the breeding, cultivation, and utilization of wild Rosa species.

Rosa hugonis Hemsl. (Fig. 1) is a perennial shrub widely distributed across many provinces in western and northern China2. The study shown that R. hugonis has retention effects on atmospheric suspended particles, suggesting its potential ability in air purification19. The strong ecological adaptability and population renewal ability of R. hugonis have been proven in some studies, indicating that it is a suitable species for the restoration of dry valleys20,21. With strong adaptability to nutrient-poor and arid environments, study have shown that R. hugonis is a suitable rootstock for grafting with roses22. The main components of the fragrance in R. hugonis petals have been shown to originate from 40 different organic compounds23. Recently, the petal extracts of R. hugonis (primarily phenolic compounds) have been shown to have neuroprotective effects24. In summary, R. hugonis holds significant potential for applications in ecological restoration, horticultural breeding, and compound development and utilization. However, a high-quality genome for R. hugonis is still missing, which has hindered the progress of further research.

Fig. 1
figure 1

Photographs illustrating the morphology of R. hugonis showing the flower (a) and leaves (b & c).

In this study, we report the first chromosome-level genome assembly of R. hugonis. The assembled genome size of R. hugonis was 337.96 Mb using PacBio single-molecular DNA sequencing technology25, and the contig N50 was 28.14 Mb. To obtain the high-quality genome assembly at the chromosome level, high-throughput chromatin conformation capture (Hi-C)26 was used and the contigs were clustered into seven pseudochromosomes, which corresponds to 99.68% of the total contig length. The final assembled genome size of R. hugonis was 337.92 Mb with the scaffold N50 length of 26.84 Mb. A total set of 36,218 putative protein-coding genes (PCGs) were predicted in R. hugonis, among which, 93.73% were annotated to the publicly available database. The chromosome-level genome assembly of R. hugonis set up a valuable platform for elucidating mechanisms of adaptation for surviving adverse environments and for advancing its development and utilization.

Methods

Plant materials and polyploidy estimation

The cultivated materials of R. hugonis were moved from the wild into greenhouse of Chengdu Institute of Biology, Chinese Academy of Sciences (CIB, CAS). The materials from Maoxian County, Sichuan Province (latitude: 103.692792, longitude: 31.520581, altitude: 1,625 m). The vouchers of specimens were deposited in the herbarium CDBI.

We used new born roots to observe the chromosome numbers of R. hugonis. The roots were collected and immersed in a 0.002 M/L aqueous solution of 8-hydroxyquinoline at a constant temperature of 15 degrees Celsius for 3–4 h, then fixed in Carnoy’s solution for 0.5–24 h, macerated in 1 N HCl at 60°C for 5 min., and then squashed in Carbol fuchsin. Observation and photography of cells in the metaphase of mitosis were conducted using Olympus microscope (Japan, BX43). The result showed that R. hugonis is diploid with 14 chromosomes in mitotic metaphase. (Fig. 2).

Fig. 2
figure 2

Mitotic metaphase chromosomes of R. hugonis (2n = 14).

DNA extraction and sequencing

Fresh young leaves were collected from a mature R. hugonis plant and were sent to Berry Genomics Company (Beijing, China) for genome sequencing. For PacBio sequencing, high-quality genomic DNA was extracted from fresh leaves using the CTAB method27. DNA quality and concentration were assessed using NanoDrop 2000 spectrophotometer (Thermo Fisher Scientific), agrose gel electrophoresis and Qubit 3.0 Fluorometer (Life Technologies, Carlsbad, CA, USA). The library of 15 kb was constructed using a SMRTbell Express Template Prep Kit 2.0 (Pacific Biosciences, CA, USA). We used the Agilent 2100 Bioanalyzer system to evaluate the size and quality of the library. The library was sequenced using a single 8 M SMAT Cell on the PacBio Sequel II platform (Pacific Biosciences, CA, USA). The PacBio SMRT-Analysis Link (https://www.pacb.com) was used for the quality control of the raw polymerase reads. For Hi-C sequencing, extracted DNA was first crosslinked with 40 ml of 2% formaldehyde solution to capture interacting DNA segments. Subsequently, the crosslinked DNA was digested with the DpnII restriction enzyme, and libraries were constructed and sequenced using an Illumina Novaseq 6000 platform with paired-end 150 bp reads. For transcriptome sequencing, fresh tissue samples including stem, leaf, and flower were collected from the same R. hugonis plant and frozen in liquid nitrogen immediately. Total RNA was subsequently extracted using the RNAprep Pure Polysaccharide Polyphenols Plant Total RNA Extraction Kit (Tiangen, Beijing, China). The concentration and quality of RNA were assessed using a NanoDrop 2000 spectrophotometer (NanoDrop Technologies, Wilmington, DE, USA) and a Bioanalyzer 2100/4200 system (Agilent Technologies, CA, USA). Subsequently, paired-end cDNA libraries were prepared from mRNA enriched with Oligo-dT magnetic beads, fragmented, circularized, and then subjected to PE150 sequencing on the Illumina Novaseq 6000 platform.

Genome assembly and quality control

Hifiasm v0.15.228 software was applied to generate the draft assembly with CCS reads. For the cleaning of heterozygous contigs, we used minimap2 v2.1329 to align the CCS reads to the assembled genome sequences. Subsequently, the heterozygous contigs were removed based on the coverage distribution of the aligned reads and their alignment scores. For the removal of pseudo contigs, we used minimap2 aligned the CCS reads to genome sequences that have removed heterozygous contigs. Based on the alignment results, the contigs with an average coverage depth of less than 5 × were removed, which are considered potential pseudo contigs. After polished, the contig assembly had a total size of ~337.96 Mb, with a contig N50 value of 28.14 Mb (Table 1). Next, in order to construct a high-quality reference genome, the Hi-C library was prepared using the method described previously30,31. We removed the low-quality reads to avoid reads with artificial bias. The filtered Hi-C reads were aligned to the initial draft genome by BWA software which was integrated into Juicer v1.6.232 software. Only uniquely mapped and validated paired-end reads were used to assembly by 3D-DNA pipeline33. Juicebox v1.9.834 were used to manually order the scaffolds to get the final chromosome assembly. Contact maps were plotted with HiCExplorer v3.335. We obtained 49 high-quality contigs (contig N50 = 26.84 Mb), with a total assembly size of 337.92 Mb and anchored 331.31 Mb onto seven pseudochromosomes using Hi-C data (Table 2; Fig. 3).

Table 1 Statistics of R. hugonis genome assembly.
Table 2 Statistical of R. hugonis genomic in chromosome distribution.
Fig. 3
figure 3

Hi-C interaction heatmap of R. hugonis showing that contigs were assembled into seven pseudochromosomes.

Genome assembly and completeness were assessed using the conserved genes of in BUSCO (Benchmarking Universal Single-Copy Orthologs) assessments based on the embryophyta_odb10 database36. A total of 98.6% (94.7% single-copy BUSCOs) completeness was revealed by the analysis (Table 1). In the same time, the CCS short-reads were mapped to the assembled genome sequence using minimap2 v2.13 with default settings. A mapping rate of 99.96% was estimated by the analysis (Table 1).

Genome annotation

We used MITE-Hunter v1.037 to identify the mini-inverted repeat transposable elements (MITEs) which were widely present in the genome. LTRharvest38 and LTR Finder v1.0739 were used to detected the long terminal repeated sequences (LTRs) in the genome and LTR retriever v2.8.240 was used to integrated the prediction results of two software mentioned above. For homolog evidence, RepeatMasker v4.1.041 was used to search the genome sequence for the sequence similar to the known repetitive sequence in the repetitive sequence database RepBase (http://www.girinst.org/repbase) to obtain the known repetitive sequence in the target genome. RepeatModeler v2.042 was used to de novo identify other repetitive sequences with repeat-masked genome. The result showed that R. hugonis genome comprises 48.8% repetitive sequences totaling 164.9 Mb, with LTRs and DNA transposons constituting 34.37% and 5.11%, respectively. For LTRs, Gypsy and Copia are the two most numerous types with 11.17% and 10.67%, respectively.

A comprehensive strategy combing ab initio prediction, protein-based homology searches, and RNA sequencing was used for gene structure annotation. AUGUSTUS v3.2.243, SNAP v6.044, Glimmerhmm v3.0.445 and GeneMark-ESSuite v4.5746 were used to predict gene structure in the repeat-masked genome. GeMoMa v1.7.147 was used to perform homology prediction and then obtain exon and intron boundary information based on the comparison between the transcript and the genome. HISAT2 v2.0.648 was used to align RNA-Seq reads to the genome sequence and Cufflinks49 was used to assemble transcripts for obtaining the full-length transcript sequences. PASA vr2014041750 software was used to predict the open reading frame (ORF) based on the obtained full-length transcript sequence. EVidenceModeler v1.1.151 was used to integrate the above prediction results, and UTR and other variable cut annotation was predicted by PASA software. A total of 36,218 predicted gene modules were obtained (Table 3). The Circos tool (http://www.circos.ca) was utilized to visualize gene density, GC content, repeat content on each pseudochromosome (Fig. 4).

Table 3 Statistics on gene prediction results of R. hugonis.
Fig. 4
figure 4

The Circos map of the genomic features of R. hugonis. (a) The seven pseudo-chromosomes; (b) Gene count; (c) Repeat content; (d) GC content; (e) Collinearity.

For the functional annotation, all PCGs were aligned to three integrated protein sequence databases: NR (v202108, ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz) and SwissProt (v1.7.1, https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz) using BLAST v2.2.3152 with e-value ≤ 1e-5, eggNOG (http://eggnog5.embl.de/#/app/home) using eggNOG-mapper v2.0.153 with default settings. Protein domains were annotated by InterPro and the Gene Ontology (GO) terms for each gene were obtained from the corresponding InterPro entry. The pathways in which the genes might be involved were assigned by BLAST v2.2.31 against the KEGG databases (https://www.genome.jp/kegg/brite.html). Among all the PCGs, 33,946 genes (93.73%) were functionally annotated to at least one database, and 4,959 genes (13.69%) were annotated in all five databases (Fig. 5).

Fig. 5
figure 5

The venn diagram of PCGs annotation of R. hugonis to five databases: NR, eggNOG, GO, KEGG and SwissProt.

The software of tRNAscan-SE v2.054 was used to predict tRNA in the genome sequence. The Rfam database was used to annotate other types of ncRNA by BLAST v2.2.31. We identified 702 rRNA, 669 tRNA, 105 miRNA, 194 snRNA and 470 snoRNA in R. hugonis assembly.

Data Records

The data that support the findings of this study have been deposited into the NCBI Sequence Read Archive (SRA) under Bioproject PRJNA118110955 and CNGB Sequence Archive (CNSA)56 of China National GeneBank DataBase (CNGBdb)57 with accession number CNP000555858,59,60,61,62,63. The chromosome-level genome assembly has been deposited at CNGB under accession number CNA013850864 and GenBank under the accession JBJCIR00000000065.

Technical Validation

The quality of the R. hugonis genome assembly was evaluated using three methods. First, the interaction contact patterns in the Hi-C heatmap are organized around the main diagonal, directly supporting the accuracy of the chromosome assembly. Secondly, the BUSCO assessment of the genome assembly indicated a high level of completeness, with 98.6% (94.7% single-copy BUSCOs) complete matches to the embryophyta_odb10 dataset. Finally, CCS shorter-reads mapping was employed to assess assembly quality and the results showed a mapping rate of 99.96%, suggesting that the genome is of high quality.

The annotations of genome repetitive sequences, genome structure, ncRNAs, and gene functions were performed using transcript-based, de novo, and homology-based prediction methods. These methods resulted in the prediction of 36,218 gene models. Through alignment with public protein databases, functional annotations are available for 93.73% of the genes (33,946).