Background & Summary

Rhodiola kirilowii (Regel) Maxim is a perennial herbaceous plant belonging to the Crassulaceae family. It is traditionally used in Tibetan medicine, primarily for its roots and rhizomes, which have been employed for centuries due to their reputed medicinal properties. This plant is native to the Qinghai-Tibet Plateau and is commonly found in alpine regions of China, including Tibet, Qinghai, Sichuan, Gansu, Yunnan, Xinjiang, Shaanxi, Shanxi, and Hebei, thriving at elevations ranging from 2000m to 5600 m on rocky grasslands and slopes1.

Historically, R. kirilowii has been documented in classical texts such as the “Four Medical Tantras” for its benefits in balancing lung heat and preventing epidemics. The “Chinese Tibetan Materia Medica” describes its capabilities in detoxification and reducing swelling, indicating its traditional use in treating epidemic diseases, lung heat, intoxication, and limb swelling. Modern pharmacological studies have identified a range of active compounds in R. kirilowii, including salidroside, tyrosol, daucosterol, cyanogenic glycosides, bergenin, lotaustralin, and flavonoids2,3. These compounds contribute to the herb’s anti-hypoxic, anti-fatigue, anti-aging, and blood-activating effects, making it a valuable component in adaptogenic and anti-altitude sickness formulations.

Given its traditional and contemporary significance, studying the genome of R. kirilowii is crucial for several reasons. First, genome assembly and annotation can provide insights into the biosynthetic pathways responsible for its therapeutic compounds, potentially leading to enhanced cultivation practices and quality control in herbal medicine production. Second, understanding its genetic makeup can facilitate the development of more effective plant breeding strategies, aiming to increase the yield and potency of its active ingredients. Lastly, genomic research can uncover novel genes and pathways that may contribute to the plant’s adaptability to high-altitude environments, offering broader implications for plant biology and ecology. However, genomic resources of R. kirilowii is limited4, limited its utilization in traditional and modern medicine.

In this study, we successfully assembled and annotated the genome of R. kirilowii at the chromosome level by MGI short-read sequencing, PacBio Revio long-read sequencing, Hi-C sequencing, and RNA sequencing (RNA-seq) techniques. We estimated genome size and heterozygosity from clean short reads, performed long-read sequencing using the PacBio Revio System, and combined it with Hi-C reads to achieve chromosome-level assembly. Furthermore, homoeologous chromosomes were identified in this tetraploid R. kirilowii. Genome annotation was conducted using a combined methods, including RNA-seq reads, published genomes of closely related species, and de novo prediction methods. Additionally, we assessed the quality of genome assembly using various metrics. Our efforts culminated in the first high-quality reference genome with 40 homoeologous chromosome and one sex chromosome, of the genus Rhodiola, providing essential genetic data for studying adaptive evolution, genetic diversity, and genetics of biochemistry of the broader genus Rhodiola.

Methods

Sample collection

Rhodiola kirilowii (Regel) Maxim. (xh-4), was obtained from the Hongyuan Plateau Medicinal Plant Breeding Base of the Sichuan Grassland Science Research Institute (coordinates: 102.5442°, 32.7752°, elevation: 3495 m). Fresh leaves were collected, rinsed thoroughly with sterile water, and surface moisture was removed. The leaves were immediately preserved and transported in liquid nitrogen.

Library construction and sequencing

High-quality genomic DNA (gDNA) was extracted from collected leaves following the manufacturer’s instructions. The integrity and purity of the gDNA samples were assessed using agarose gel electrophoresis. The high-quality gDNA were sent to Wuhan Frasergen Bioinformatics Co., Ltd. (Wuhan, China), for DNA extraction, library construction, and genomic sequencing. Libraries were prepared using the TruSeq DNA PCR-Free Library Prep Kit (Illumina, San Diego, CA, USA) and SMRTbell (Sage Science, MA, USA) and following the manufacturer’s recommendations, with 200 bp insertion size for Illumina HiSeq 4000 sequencing and 20 kb fragments selected for PacBio Sequel II sequencing. In addition, in situ Hi-C experiment was performed and Hi-C library was sequenced using Illumina HiSeq 4000 PE 150 bp platform (Table 1). Specifically, chromatin was digested using restriction enzyme MboI enzyme.

Table 1 Library sequencing data and methods used in this study to assemble the R. kirilowii genome.

To improve the precision of genome annotation, RNA sequencing was conducted from two tissues: leaf and roots (three locations, and three replicates). Each sample underwent RNA extraction utilizing TRIzol reagent (Invitrogen, USA), followed by assessment of RNA purity and concentration using Nanodrop and Qubit, construction of RNA-seq libraries employing the MGIEasy RNA Sample Prep Kit (UW Genetics), and sequencing on the Illumina HiSeq 4000 PE 150 bp platform. Totally, 446,936,037 pairs of raw reads were generated, and were subjected to Trimmomatic v0.385 for high-quality data filtering following procedure as described6. These high-quality transcriptome reads were utilized for genome annotation.

Genome size and heterozygosity estimation

The genome size and heterozygosity of R. kirilowii were estimated based on distribution of 17-mer using GenomeScope v2.0. The prediction results indicated a genome size of 2.26 Gb, a heterozygosity of 0.39%, and repeat sequences of 92.49% (Fig. 1a). Interestingly, the peaks of the distribution of 21-mer around 140, 70 and 35 depth clearly showed the homozygous AAAA alleles, heterozygous AABB, and heterozygous ABCD alleles, suggesting the tetraploidy of R. kirilowii. Moreover, the fluorescent microscopy was used to measure the number of chromosomes within a nucleus of R. kirilowii. A total of 41 (4n = 40 + 1) chromosomes was captured (Fig. 1b).

Fig. 1
figure 1

Genome survey of R. kirilowii. (a) The distribution of K-mer (21-mer) frequency. (b) Fluorescence-microscope image of somatic chromosome number of R. kirilowii.

De novo assembly of R. kirilowii genome

Flye mode in MaSuRCA v4.0.77, a hybrid approach using a combination of PacBio and Illumina reads, was used for initial assembly, with default parameters except that the estimated genome size was set accordingly. Then POLCA within the MaSuRCA package was used for assembly polishing. At this step, the total length of the draft genome was 1,926,367,165 bp, comprising of 9,015 contigs with N50 of contig length of 474,563 bp and N50 of scaffold length of 44,362,222 bp.

Then the Hi-C reads, after quality control and trimming, were used to anchor the initial assembled contigs onto chromosomes through sorting, orientation, and ordering, following 3D-DNA pipeline (v1809228). Multiple iterations of manual refinement of chromosome boundaries using the 3D-DNA pipeline were performed. This process allowed us to detect and correct any apparent haplotype switches, ensuring precise haplotype assignment. Detailed procedures on Hi-C data processing and scaffolding were described6 and https://github.com/theaidenlab/Genome-Assembly-Cookbook. Juicebox Assembly Tools9 (v1.11.08) was used for contact frequency visualization and to manually re-define chromosome boundaries. At this step, a total of 41 chromosomes were obtained (Fig. 2a), and the Hi-C interaction heatmap reveals a clear diagonal pattern, which is indicative of strong intra-chromosomal interactions across all chromosomes. Interestingly, a set of four homoeologous chromosomes could be clearly visualized in the contact map. To assign homoeologous chromosomes to haplotypes, fastANI (v1.3410) was used to estimate average nucleotide identify (ANI) as a function of genomic distance to publicly available Rhodiola kirilowii chromosomes4. Briefly, a set of 4 homeologous chromosomes was set based on their high average nucleotide identity. Then haplotypes 1 to 4 were assigned according to value of ANI (Fig. 2b). We note that ChrIX had only one haplotype, and therefore denoted as the sex chromosome. Finally, the chromosome-level haplotype-solved genome was generated for our tetraploid R. kirilowii, with a genome size of 1.922 Gb which is over three times bigger than previously assembled diploid R. kirilowii4.

Fig. 2
figure 2

Genome assembly of tetraploid R. kirilowii. (a) Genome-wide Hi-C contact matrix, at 1 Mb resolution, of the chromosome-level assembly of the xh-4 genome. Each blue rectangle represents a set of homoeologous chromosomes, whereas the green rectangles are chromosomes. (b) Average nucleotide identifies between our tetraploid R. kirilowii chromosomes (vertical) and a publicly available haploid genome of R. kirilowii (the horizontal chromosomes).

Genome annotation

Transposable elements (TEs) in our assembled R. kirilowii genome were masked using RepeatMasker (v4.0.611) using both the Repbase library and a de novo repeat library generated by RepeatModeler (v2.0.512). This repeat mask step was performed for both whole-genome and each sub-genome. Overall, 63.88% of the R. kirilowii genome was identified as repeats (Table 2). This TE masked genome was used for gene model prediction.

Gene structure prediction was conducted through three methods: homology prediction, transcriptome prediction, and de novo prediction, with integration of the results to derive the final gene structure annotation using braker3 (v3.0.313). For homology prediction, comparisons were made with the genomes: Vitis vinifera (GCA_030704545.1), Prunus persica L. (GCA_000346475.2), Vitis vinifera (GCA_030704545.1), Kalanchoe fedtschenkoi (GCA_002312865.1). Transcriptome prediction involved mapping quality-controlled RNA-seq reads (Table 1) to our assembled R. kirilowii genome using HiSAT2 (v2.2.114). For de novo prediction, Augustus (v3.5.015) was used to predict gene structure based on hidden Markov models. Finally, a total of 122,035 protein-coding genes were predicted in our assembled R. kirilowii genome, with each sub-genome encoding 25,446 to 28,034 proteins (Table 2).

Table 2 Metrics of the R. kirilowii xh-4 genome assembly.

For gene function annotation, we employed the default parameters of the InterProScan (v5.53–87.016, Jones et al.16) program to search Gene Ontology (GO) and Pfam databases. To annotate non-coding genes, various types of non-coding RNAs, including tRNA, rRNA, snRNA, and miRNA, were annotated using the Rfam database and Infernal (v1.1.417, Nawrocki and Eddy 2014) within cmscan program. The annotated genome was visualized using circos plot (Fig. 3).

Fig. 3
figure 3

Landscape of the tetraploidy R. kirilowii xh-4 genome. (a) Chromosome-level sub-genome features in R. kirilowii. (b) Orthologues between tetraploid R. kirilowii xh-4 sub-genomes and the previous haploid R. kirilowii genome from (Zhang et al., 2023).

Data Records

Our assembled R. kirilowii xh-4 genome and annotation were deposited in the EBI-European Nucleotide Archive, under accession number GCA_96520658518, and in the Genome Warehouse in National Genomics Data Center, Beijing Institute of Genomics, Chinese Academy of Sciences / China National Center for Bioinformation, under accession number GWHFGNY00000000.1 within Project PRJCA031461 that is publicly accessible at https://ngdc.cncb.ac.cn/gwh19. Raw Illumina short read, PacBio long read, and Hi-C sequencing data for generating genome assembly and RNA-seq data for annotating the xh-4 assembly are available at NCBI SRA under accession number PRJNA120092420.

Technical Validation

Three methods were used to validate the quality of the assembled genome. First, to assess the accuracy and completeness of our assembled R. kirilowii xh-4 genome, we conducted BUSCO (v5.4.621) assessment within the lineage of eudicots_odb10 (2326 single-copy genes) for both genome and annotated proteins. For assembled genome, no any conserved single-copy genes were missing for xh-4 genome, and only 4.7% to 10.2% were missing for each sub-genome. Similarly results were for annotated proteins (Table 2). Secondly, merqury (v1.322), a k-mer based assembly evaluator, was performed, and a 92.56 recovery rate with low error was obtained, showing high completeness of our assembly. Thirdly, we used the high-quality Illumina short reads to align back to our assembled R. kirilowii xh-4 genome and each sub-genome using BWA-MEM2 (v 2.0pre223). The analysis revealed that 94.93% to 95.75% of reads could be successfully mapped back to each assembled sub-genome, and 99.15% could be successfully mapped to all chromosomes (Table 2).