Introduction

Chrysosplenium sinicum Maxim (Chinese name “Zhonghuajinyao”, 2n = 24) is a perennial herb that belongs to the Chrysosplenium genus in the Saxifragaceae family1. The genus Chrysosplenium contains a large number of species (~83 species). These species generally prefer shade and dampness in growth environments and are widespread throughout Asia, America, Europe, and Africa, especially in temperate regions of Asia2,3. In particular, China contains diverse species from the genus Chrysosplenium. Chrysosplenium plants have been extensively used as a traditional Tibetan medicine in China and have been documented in various Tibetan pharmacopoeia monographs. The genus Chrysosplenium is described in “Tibetan Medicine Chronicles” to remove undesired heat and exert detoxification effects; in addition, the genus is used in medical recipes to treat fever, headache, acute icteric hepatitis, and acute liver necrosis. The flavonoids chrysosplenol and chrysosplenoside are the characteristic bioactive components within the genus Chrysosplenium4, which contains a large number of hydroxylated and methoxylated flavonoids. The demand for these traditional medicinal plants has increased in recent years, particularly for the species grown in China. However, despite the high medicinal value of the genus Chrysosplenium, molecular and genomic studies on this genus have rarely been carried out5.

In nature, plants are classified into sun or shade plants based on their ability to adapt to sunlight. In general, sun plants requires high levels of light exposure during their life cycle to grow, whereas shade plants often require considerable shade to grow6. In agriculture, environmental light intensity is among the most critical factors that influence crop development. As an abiotic stress, low-light exposure results in compromised photosynthesis, thus reducing the potential crop yields7,8. For instance, persistent cloudy weather or rainfall can reduce rice yields in India9,10. For long-term survival under low-light conditions, the photosynthesis system in shade plants has evolved to function optimally and use sunlight more efficiently. Since the biomass accumulation of crops is highly correlated with the photosynthetic rate under low-light conditions11, it would be beneficial to study the mechanism by which shade plants tolerate low-light environments.

Photosynthesis in plants utilizes captured light energy to produce ATP and NADPH through two major pigment-protein complexes, namely photosystem I (PSI) and photosystem II (PSII). The nuclear-encoded light-harvesting complex (LHC) (also known as the antenna complex), which increases the optical absorption cross-section of PSI and PSII12, captures solar energy and delivers it to the reaction center P700 in PSI and P680 in PSII13. Light-harvesting complex I (LHCI), which is composed of four subunits (Lhca1-4), connects to the PSI core via the PsaF or PsaJ subunits14. The antenna complex of PSII includes the trimeric light-harvesting complex II (LHCII), which consists of Lhcb1, Lhcb2, and Lhcb3 and three monomeric antennae (Lhcb4, Lhcb5 and Lhcb6)15. In addition, the complex contains a set of trimeric LHCIIs that can move between PSI and PSII in a process called state transitions, which allows photosynthetic organisms to adapt to changing light conditions16. In state 1, LHCII is almost exclusively attached to PSII. In state 2, the phosphorylation of LHCII by STN7 dissociates LHCII from PSII and attaches it to PSI17,18. PSI contains more chlorophyll a and stable red-shifted pigments and therefore can effectively utilize enriched far-red light under low-light conditions19. Studies on long-term dark-adapted plants have demonstrated that LHCII in state 2 increases the size of the PSI antenna and transfers energy very efficiently to PSI20, potentially improving the efficiency by which plants utilize light under low-light conditions. Therefore, to achieve better photosynthetic efficiency under low-light conditions, it is worthwhile to investigate whether the LHCII within shade plants tends to occur in state 2.

The intracellular transfer of DNA from the organelle genome to the nuclear genome has been a driving force in the evolution of eukaryotic genomes. Eukaryotic plants have evolved in a highly variable growth environment in which abiotic stresses increase the frequency of DNA transfer from organelles to the nucleus21; therefore, nucleus-localized organelle DNA may be essential for plant adaptive evolution under environmental stress, such as low-light stress. The constant transfer of organelle DNA has contributed to the great diversity of new genetic material for plant adaptive evolution22,23, and it provides the raw material for the generation of new functional genes24,25. In most cases, the nuclear genes that originate from organelle DNA transfer are nonfunctional, and most of these genes are eliminated rapidly from the nuclear genome26,27. However, in some rare cases, the transferred genes can acquire a suitable nuclear promoter for transcriptional activation and thus subsequently become functional. Many proteins encoded by these genes can translocate back into their original organelle to perform specific physiological functions if they can also capture a suitable transit peptide28. The remaining proteins perform other new functions that are not related to their previous functions29. However, the contribution of nucleus-localized organelle DNA to the adaptability of shaded plants to low-light environments remains unclear.

Flavonoids, which are a group of secondary metabolites that are abundant in plants, exhibit remarkable medicinal bioactivities, such as antioxidant, anti-inflammatory, and anti-infection effects. As a traditional Tibetan medicine, C. sinicum contains a high abundance of flavonoids. Chrysosplenol and chrysosplenoside are the characteristic bioactive components within C. sinicum, and belong to a special class of flavonoids. However, limited research has been conducted on the biosynthesis of flavonoids in Chrysosplenium plants, impeding the further use of this genus. On the other hand, flavonoids are important metabolites for overcoming physiological stress when plants are challenged by various biological and non-biological environmental factors. Therefore, the high accumulation of chrysosplenol and chrysosplenoside in C. sinicum may be attributed to its special characteristics in low-light environments.

In this study, we report a chromosomal-level genome assembly of C. sinicum, a traditional Tibetan medicine plant that predominantly thrives in low-light and humid forests as well as alpine shrubs at elevations ranging from 500 to 3550 m. In addition, the whole-genome sequence of C. sinicum, including its nuclear, chloroplast, and mitochondrial sequences, was included to extend our knowledge on the adaptation basis of shade plants in low-light environments; in addition, these results revealed potential gene-editing sites for enhancing crop tolerance to low-light stress. Genes related to the synthesis of chrysosplenol and chrysosplenoside were identified by analyzing metabolomic and transcriptomic data generated from three different tissues (roots, stems, and leaves). Additionally, alternative splicing (AS) analysis was performed to investigate the regulatory role of AS events in tissue-specific gene expression. In conclusion, the present study investigated the adaptive evolution of shade plants in low-light environments based on the C. sinicum genome and investigated the metabolite composition, flavonoid biosynthesis, and tissue-specific expression regulation within the Chrysosplenium plant.

Results

Genome assembly and annotation

C. sinicum (2n = 24)30, a traditional Tibetan medicine plant, always grows in cold, moist shade beneath trees (Fig. 1a; Supplementary Fig. 1). Several technologies have been integrated to sequence and assemble its genome (Supplementary Table 1). K-mer analysis (k = 17) revealed that the genome size was ~333.50 Mb, with a repeat sequence content of 47.09%, and the estimated heterozygosity proportion was 0.4% (Supplementary Fig. 2; Supplementary Tables 2 and 3). Flow cytometry estimations also supported the estimated genome size of the K-mers (Supplementary Fig. 3). Using the PacBio Sequel platform, we produced a total of 39.80 Gb (119× in depth) of high-quality long PacBio reads with an N50 length of 16,478 bp (Supplementary Table 4). In addition, 105.26 Gb (316× in depth) of short paired-end reads were obtained (Supplementary Table 5). A total of 145.06 Gb (435× in depth) of C. sinicum DNA sequence was de novo assembled into 633 contigs, and the contig N50 was 2.13 Mb (Supplementary Table 6). The Hi-C data were then used to construct pseudo-chromosomes, and a total of 41.2 Gb (124× in depth) of Hi-C clean data were produced (Fig. 1b; Supplementary Table 7). The final assembly of C. sinicum was 321.79 Mb, with a scaffold N50 of 25.0 Mb (Table 1; Supplementary Table 8).

Fig. 1: Overview of C. sinicum genome assembly.
Fig. 1: Overview of C. sinicum genome assembly.
Full size image

a Photographs of C. sinicum grow in low-light environments. b Genome-wide Hi-C map of C. sinicum. Post-clustering heatmap showing the density of Hi-C interactions between contigs. c Circos plot of the C. sinicum genome assembly. The density of genes in the track (A), GC content (B), simple sequence repeats (C), and long terminal repeat retrotransposons (D) were calculated using a 250 kb non-overlap window. The innermost layer shows syntenic blocks. The bandwidth is proportional to the syntenic block size.

Table. 1 Statistics of Chrysosplenium sinicum genome assembly and annotation

The functional completeness of the assembly was evaluated by performing Benchmarking Universal Single-Copy Orthologues (BUSCO) analysis using the compleasm tool31. In total, 93.42% (2173 of 2326) of the core eudicot genes were present in the C. sinicum genome, including 2117 (91.01%) completely aligned genes (Supplementary Table 9). The consensus quality value (QV) and genome completeness were evaluated using the Merqury tool32, which yielded a QV of 43.1 and genome completeness of 95.7%. Additionally, the alignment results showed that 98.25% of the short reads could be aligned to the genome (Supplementary Table 10). The results suggested that the C. sinicum assembly is a high-quality reference genome. Repetitive elements accounted for 48.22% (155.16 Mb) of the C. sinicum genome, with long terminal repeats (LTRs) comprising 36.38% (56.45 Mb) (Supplementary Fig. 4; Supplementary Table 11). An inverse correlation between transposable elements (TEs) and gene density was observed on the chromosomes (Fig. 1c).

Three strategies, ab initio prediction, homologous prediction, and transcription-based prediction, were combined to construct gene models for the C. sinicum genome. A total of 27,441 protein-coding genes were predicted, with an average gene length of 3931 bp and an average coding sequence length of 1188 bp (Supplementary Table 12). Overall, 27,064 (98.63%) of the predicted genes had functional descriptions, according to Gene Ontology (GO) terms (17,124 genes; ~62.40%), KEGG (11,195 genes; ~40.80%)33, and the NR database (24,425 genes; ~89.01%) (Supplementary Table 13). GO term analysis revealed that 35.56%, 14.25%, and 50.19% of the predicted genes were involved in the biological process, cellular component, and molecular function categories, respectively (Supplementary Fig. 5). In addition, 1422 transcription factor (TF) encoding genes were identified, covering 70 TF gene families. Among them, the AP2/ERF, bHLH, MYB, and NAC families included the most genes, with 120 (8.44%), 105 (7.38), 97(6.82), and 83 (5.84%) genes, respectively (Fig. S6; Supplementary Data 1). Notably, the number of genes in the FAR-RED-IMPAIRED RESPONSE1 (FAR1; 60, 4.22%) family in C. sinicum genome was significantly greater than that in closely related plants (Supplementary Fig. 6).

Genome evolutionary analysis

Evolutionary analyses of plastid DNA sequences and the Kalanchoe fedtschenkoi nuclear genome placed Saxifragales as a sister group to rosids, and together, they constitute a large clade of superrosids1,34,35. However, there are inconsistencies in the nuclear gene tree based on a single Saxifragales species. To further validate the phylogenetic placement of Saxifragales and elucidate its divergence time from rosids, a phylogenomic tree of four Saxifragales plants, eight rosid plants, and eight other monocot and dicot plants was constructed using single-copy genes (Fig. 2a and Supplementary Fig. 7; Supplementary Data 2). Consistent with previous studies, the phylogenomic tree revealed that Saxifragales is a sister to the rosids (Fig. 2a). The divergence of Saxifragaceae (exemplified by C. sinicum) from Crassulaceae (exemplified by K. fedtschenkoi) occurred approximately 8–105 million years ago (MYA), while the separation between Saxifragales and rosids (exemplified by Rosales) took place approximately 104-123 MYA (Fig. 2a).

Fig. 2: Evolutionary analysis of the C. sinicum genome.
Fig. 2: Evolutionary analysis of the C. sinicum genome.
Full size image

a Inferred phylogenetic tree reconstructed from single-copy orthologous genes shared by 20 species. Divergence timings were estimated using MCMCTree and are indicated by blue bars at the internodes with the 95% highest posterior density. Ea. Early, M. Middle, La. Late, Ju. Jurassic, Pa. Paleogene, P. Paleocene, Eo. Eocene, O. Oligocene, Ne. Neogene, Mi. Miocene. Green block: Superasterids; orange block: Saxifragales; blue block: Rosids. b Dot plot illustrating the comparative analysis of C. sinicum-grape plants; the black dots represent the syntenic gene blocks. The red boxes highlight several major duplication events between C. sinicum and grape. c Distribution of the synonymous substitution rates (Ks) for pairs of syntenic paralogues in C. sinicum and three other plant species. d Karyotype evolution of C. sinicum and related species.

Ancient whole-genome duplication (WGD) events are important driving forces of the evolution of plant lineages36,37. We examined grapes as a reference for studying genome duplication events in C. sinicum, because its genome did not undergo additional duplication after ancestral gamma hexaploidization38,39. The analysis of collinearity and synteny depth indicated that the grape genome contained one to two syntenic blocks in C. sinicum (Fig. 2b; Supplementary Fig. 8). Moreover, the distribution of synonymous substitution rates (Ks) in both reciprocal best hit (RBH) paralogous gene pairs and syntenic block gene pairs showed a single peak (Fig. 2c; Supplementary Fig. 9). These results provide strong evidence that an independent WGD event occurred in C. sinicum after the shared gamma-WGT event.

Intragenomic collinearity analysis of C. sinicum revealed that the predominant types of gene duplication were dispersed duplication (40.29%, 9627) and WGD/segmental duplication (30.56%, 7303), in contrast to tandem duplication (5.78, 1380) and proximal duplication (5.42, 1294) (Supplementary Table 14; Supplementary Data 3). Interestingly, the significantly enriched GO terms of genes originating from WGD or segmental duplication were closely associated with plant photosynthesis (Supplementary Fig. 10; Supplementary Data 4). Orthologous genes exhibiting signs of positive selection usually undergo adaptive divergence40. To explore the biological function of positively selected genes (PSGs) in the C. sinicum genome, we conducted a positive selection analysis on single-copy genes from ten Superrosids plants (Supplementary Fig. 11). Our findings revealed 132 PSGs in the C. sinicum genome (Supplementary Data 5). Moreover, functional classification analysis indicated that these PSGs are actively involved in crucial biological processes such as DNA replication, DNA recombination, and protein phosphatases (Supplementary Fig. 12; Supplementary Data 6).

The association between chromosomal reorganization and karyotype evolution has been substantiated in previous studies41. The genome of Superrosids plants evolved from a pre-γ ancestral eudicot karyotype (AEK) with 7 protochromosomes to a post-γ AEK karyotype with 21 protochromosomes42. To investigate chromosomal reorganization in the C. sinicum genome, we mapped the AEK genes onto the chromosomes of seven Superrosids species (C. sinicum, V. vinifera, E. grandis, M. truncatula, F. vesca, P. trichocarpa, and A. thaliana) (Fig. 2d). The V. vinifera genome has the closest resemblance to the ancestral chromosomal composition, while other Superrosids species underwent varying degrees of chromosome fusion. The C. sinicum genome experienced complex chromosome fusions, with each chromosome originating from a minimum of four ancestral chromosomes; an exception was chromosome 4, which originated from only two ancestral chromosomes (AEK1 and AEK5) (Fig. 2d; Supplementary Table 15).

Comparative genomics of gene families

C. sinicum is a shade plant that can adapt to well to low-light conditions. To elucidate the genomic basis underlying the distinctive phenotypic traits of C. sinicum, we characterized its gene families as unique or shared by comparing them with those of three other Saxifragales plants, namely, Rhodiola crenulate, Kalanchoe laxiflora, and Kalanchoe fedtschenkoi. The comparative analysis revealed a total of 9908 gene families distributed across all four genomes, with an additional 807 gene families exclusively found in C. sinicum (Fig. 3a). GO enrichment analysis of the 807 unique gene families, comprising 1649 genes, revealed eleven significant GO terms. Notably, the following terms were directly associated with photosynthesis: ‘photosynthesis, light reaction’, ‘photosynthetic electron transport chain’, and ‘photosynthesis’ (Fig. 3b; Supplementary Data 7). This suggests that the evolution of photosynthesis-related gene families may play a pivotal role in facilitating C. sinicum’s adaptability to low-light conditions.

Fig. 3: Comparative genomics of gene families in C. sinicum.
Fig. 3: Comparative genomics of gene families in C. sinicum.
Full size image

a Venn diagram of shared orthologous gene families among the genomes of four Saxifragales species. b GO enrichment results of 807 C. sinicum-specific gene families. c Statistics of expanded and contracted gene families in C. sinicum. d Venn diagram of C. sinicum-specific gene families and expanded gene families. e GO enrichment results of 609 rapidly evolving (expanded) gene families in C. sinicum. The relevant terms associated with photosynthesis are denoted in green, while those pertaining to respiration are indicated in blue. f Functional annotation of the expanded gene families in C. sinicum. The left panel shows the gene family size among 11 species and the right panel shows the functional annotation and classification of gene families. Cs C. sinicum, Rc Rhodiola crenulata, Kf Kalanchoe fedtschenkoi, Vv Vitis vinifera, Eg Eucalyptus grandis, Mt Medicago truncatula, Fv Fragaria vesca, Pt Populus trichocarpa, Csi Citrus sinensis, At Arabidopsis thaliana, Tc Theobroma cacao. RE rapidly expanded. NE normal expanded. U C. sinicum specific.

The analysis of gene family expansion and contraction in C. sinicum revealed that 1580 and 6204 gene families were classified as expanded and contracted, respectively. Among these gene families, 1021 were recognized as rapidly evolving (609 expanded and 412 contracted) (Fig. 3c), while a subset of 104 gene families exhibited rapid expansion and were unique families (Fig. 3d). The rapidly evolving gene families may provide potential insight into the adaptability of C. sinicum to low-light environments. Functional enrichment analysis of these 609 rapidly expanded gene families revealed that their functions were closely associated with the process of plant photosynthesis and respiration (Fig. 3e and Supplementary Fig. 13; Supplementary Data 8). We also performed functional enrichment analysis on two sets of families: 104 rapidly expanding and unique families, and 505 rapidly expanding but non-unique families. The results revealed that terms related to photosynthesis were enriched in both sets, while terms associated with protein phosphorylation were exclusively enriched in the unique families. Conversely, respiratory-related terms were solely enriched in the non-unique families (Supplementary Fig. 14; Supplementary Data 8). The evolution of the electron transport chain (such as electron transfer flavoprotein) used for cellular respiration in mitochondria allows plants to adapt to sucrose starvation induced by darkness43. Multiple GO terms associated with cellular respiration, such as ‘aerobic respiration’, ‘respiratory electron transport chain’, and ‘ATP synthesis coupled electron transport’, were significantly enriched in rapidly expanding gene families; thus, the rapidly evolving gene families involved in cellular respiration may also play a crucial role in enhancing C. sinicum’s adaptability to low-light conditions (Fig. 3e; Supplementary Data 8).

To further investigate the evolution of gene families related to photosynthesis and cellular respiration in C. sinicum, a functional classification was performed on expanded gene families. The initial steps of photosynthesis, also known as light reactions, involve two large membrane protein complexes referred to as photosystem I and photosystem II. Among the expanded gene families, two gene families comprising 14 genes were implicated in the function of photosystem I, while ten gene families consisting of 56 genes were related to photosystem II (Fig. 3f). Notably, the gene family known as the LHCII component protein Lhcb1, which was previously found at significantly greater levels in shade-tolerant plants than in light-resistant plants44, was expanded in C. sinicum. Three of the expanded gene families were involved in the electron transport chain of respiration, and among them, the electron transfer flavoprotein: ubiquinone oxidoreductase (ETFQO) family is crucial for the ability of plants to withstand extended darkness43,45 (Fig. 3f). Moreover, functional classification analysis revealed that other gene families associated with photosynthesis and respiration, such as ‘phytochrome’, ‘photosynthetic electron transport chain’, and ‘mitochondrial respiratory chain complex I’, were expanded (Fig. 3f).

Intracellular DNA transfer in Chrysosplenium sinicum

Among the genes associated with photosynthesis, we observed a significantly lower number of repetitive elements in the proximal region (upstream and downstream 50 kb) of rapidly expanded genes than in other genes (Fig. 4a). To determine the distribution of genes with fewer repetitive elements in the proximal region, we counted the number of repetitive elements near each gene at the whole-genome scale and arranged them based on their chromosomal order (Fig. 4b). The analysis revealed the presence of multiple large gene clusters, which exhibited a significantly lower number of repetitive elements and were distributed across multiple chromosomes (Fig. 4b; Supplementary Fig. 15). Among these gene clusters, we identified five gene clusters associated with photosynthesis and cellular respiration through functional annotation (Fig. 4b; Supplementary Data 9). The five gene clusters included 47.62% rapidly expanded photosynthesis-related genes and 20% rapidly expanded cell-respiration-related genes. Cluster 1 comprised 19 genes related to photosynthesis and was located on Chr03. Clusters 2, 3, and 4 were located on Chr04. Specifically, Cluster 2 consisted of 24 genes related to cellular respiration, while Clusters 3 and 4 comprised a total of 44 and 24 genes associated with photosynthesis, respectively. Cluster 5, located on Chr11, exhibited the greatest number of photosynthesis-related genes (49 genes) (Fig. 4b; Supplementary Data 9). The intracellular transfer of DNA from the organelle genome to the nuclear genome is widespread and ongoing in eukaryotes46, constituting a driving force in the evolution of eukaryotic genomes. Given the relatively lower number of repetitive sequences in organelle genomes, these gene clusters may have originated from the transfer of organelle DNA.

Fig. 4: Intracellular DNA transfer in C. sinicum.
Fig. 4: Intracellular DNA transfer in C. sinicum.
Full size image

a Statistics of the number of repetitive elements in the proximal region of rapidly expanded genes and not rapidly expanded genes. Student’s t-test (two-sided, pvalue = 1.41E-14): *P < 0.05; **P < 0.01; ***P < 0.001. b Overview of the abundance of repetitive elements in the proximal region (upstream and downstream 50 kb) of genes on Chr03, Chr04, and Chr11. The abscissa shows the arrangement of genes along a chromosome. c Gene map of the C. sinicum chloroplast genomes. Genes are color-coded to indicate functional groups. d Distribution of NUPT and NUMT sequences on the 12 chromosomes of C. sinicum. e Collinearity between the chloroplast fragments and Cluster 5. The asterisk-marked genes represent novel recombination genes. The ordinate of coverage of long reads represents log10(number of long-reads coverage). f Schematic structure and homology between accD in chloroplasts and the genomic region on Chr03 containing Cs3G0076700 and Cs3G0076800. g Schematic structure and homology between Cs4G0091400 on Chr04 and the chloroplast region containing psaC and ndhD. The gray blocks represent regions that are homologous.

To verify the origin of these gene clusters, we assembled the complete genomes of the chloroplasts and mitochondria of C. sinicum. The length of the chloroplast genome was 153,427 bp (Fig. 4c). The mitogenome was assembled as three circular molecules (chromosome 1: 233,531 bp, chromosome 2: 116,502 bp, chromosome 3: 45,818 bp) (Supplementary Fig. 16). The distribution of nucleus-localized plastid or mitochondrial DNAs (NUPTs or NUMTs) suggested that Clusters 1, 3, 4, and 5 were composed of NUPT fragments derived from the chloroplast genome, while Cluster 2 consisted of NUMT fragments originating from mitochondrial genome transfers (Fig. 4d). These findings indicate that intracellular DNA transfer served as a driving force behind the rapid expansion of gene families associated with photosynthesis and cell respiration in C. sinicum, potentially facilitating its adaptive evolution to low-light environments.

Homology analysis between chloroplast genes and nuclear genes revealed that genes in four photosynthesis-related clusters originated from two distinct fragments of the chloroplast genome, namely fragment 1 and fragment 2. (Fig. 4c). In the nuclear genome of plants, NUPTs and NUMTs frequently exhibit nonrandom arrangements, forming loose or tight clusters based on their physical linkage degrees47. Collinearity analysis revealed significant synteny between Clusters 1 and 5 within the chloroplast genome, as the gene arrangement in these two clusters closely resembled that observed in chloroplast genes (Fig. 4e; Supplementary Fig. 17). Moreover, collinearity analysis of Cluster 3 and 4 with the chloroplast genome was conducted; however, no significant synteny was observed between them (Supplementary Fig. 17). Interestingly, Clusters 3 and 4 exhibited significant synteny with Cluster 5 and most genes within Clusters 3 and 4 were identified as WGD or segmental duplication types; thus, WGD or segmental replication potentially contributed to the generation of Clusters 3 and 4 (Supplementary Fig. 18; Supplementary Data 3). These results indicated that the rapidly expanded gene families associated with photosynthesis in the C. sinicum may be attributed to the combined effects of intracellular DNA transfer and segmental or WGD replication.

During the early phase of eukaryote evolution, the nuclear transfer of organellar DNA resulted in the massive relocation of entire genes to nuclear chromosomes24. Researchers believe that the recent transfer of organelle DNA to the nucleus mainly yielded nonfunctional nuclear sequences. However, new studies have suggested that the exons of novel genes in the nuclear genome can be generated by the insertion of organellar DNA48. In C. sinicum, the recruitment of preexisting organellar protein domains to novel nucleus-encoded proteins is a prevalent phenomenon. The novel recombination genes, which integrate one or more organelle gene domains, accounted for 24.06% (32) of the genes in the four photosynthesis-related clusters (Supplementary Fig. 19). For example, Cs4G0094400 integrates the domains of four organelle genes, namely rpl14, rpl16, rps3, and rpl22, as exons, while the recombination of the organelle gene accD results in the generation of two novel genes, Cs3G0076700 and Cs3G0076800 (Fig. 4f; Supplementary Fig. 20). To confirm that the novel genes found in the nuclear genome were inserted fragments from organelle DNA rather than from assembly mistakes, we examined the coverage of long reads for these clusters. The results revealed that the long reads for these clusters were highly covered, and the novel genes were completely covered within the long reads (Fig. 4e; Supplementary Fig. 17). Furthermore, we amplified the novel genes from nuclear DNA using PCR, and the sequencing data demonstrated that these genes were identical to the genome assembly results (Supplementary Fig. 17; Supplementary Data 10).

Approximately 26.11% of the genes in the five clusters were expressed in different tissues of C. sinicum (Supplementary Data 11). Notably, a chloroplast transfer gene (Cs11G0121300) called the large subunit of RUBISCO (RBCL) exhibited a high level of expression (Supplementary Fig. 21; Supplementary Data 11). Cluster 2, which originated from the mitochondrial genome, exhibited the greatest proportion of expressed genes, as 14 out of 24 (58.33%) genes were expressed (Supplementary Fig. 21; Supplementary Data 11). Furthermore, most of the novel recombination genes exhibited negligible expression levels, except for Cs4G0091400, which was produced from an inversion of a fragment between the organelle genes psaC and ndhD, and displayed a relatively high expression level (Fig. 4g and Supplementary Fig. 21; Supplementary Data 11). Based on the extensive expression of these genes derived from organellar DNA transfer, the genes may provide the foundation for the adaptation of C. sinicum to low-light environments.

Adaptive evolution of the CsLhcb1s subfamily in Chrysosplenium sinicum

The LHC super-gene family encodes light-harvesting chlorophyll a/b-binding proteins, which constitute the antenna systems of the photosynthetic apparatus. These antenna systems efficiently capture solar energy and transfer it to PSI and PSII, thereby providing electrons for subsequent dark reactions. (Fig. 5a)49. The structure of the antenna complexes in different types of photosynthetic organisms is remarkably diverse, suggesting that early organisms adapt to specific light environments through the evolution of antenna complexes50. A total of 25 genes of the LHC family were identified in C. sinicum, including five Lhca genes and 20 Lhcb genes (Supplementary Data 12). A comparative genomic analysis between Saxifragales and Rosids revealed that the Lhcb1 subfamily expanded in three Saxifragales species, namely C. sinicum, R. crenulate, and K. laxiflora (Fig. 5b). Moreover, through gene structure analysis, we observed a significant increase in intron sequences within the CsLhcb1s of C. sinicum compared to those of closely related species (Supplementary Fig. 22). The gains and losses of introns are key factors in the evolution of genomes, and major evolutionary transitions may be accompanied by massive gains and losses of introns51,52. These results indicate that the Lhcb1 subfamily of C. sinicum possibly underwent adaptive evolution in response to low-light conditions.

Fig. 5: Evolution analysis of the LHC superfamily in C. sinicum.
Fig. 5: Evolution analysis of the LHC superfamily in C. sinicum.
Full size image

a Photosynthetic light response pathway map in C. sinicum. b Statistics of gene numbers within each subfamily of the LHC superfamily were analyzed for four Saxifragales species and eight Rosids species. c Multiple sequence alignment of amino acid sequences of Lhcb1 genes from C. sinicum and three other Saxifragales species. The amino acid highlighted in red serves as the recognition site of STN7 kinase. d Expression and phylogenetic profiles of genes in the LHC superfamily in roots, stems, and leaves from C. sinicum. e Prediction of the phosphorylation potential of threonine residues surrounding the STN7 kinase recognition site in CsLhcb1s. PKC protein kinase C, unsp non-specific protein kinase.

Positive selection analysis using the program Codeml of PAML53 enables us to better understand the evolutionary process. A total of seven positively selected sites were detected, indicating that the CsLhcb1s of C. sinicum were subjected to positive selective pressure (Supplementary Fig. 23; Supplementary Data 13). The positive selection of specific amino acid residues in the CsLhcb1s may influence the adaptability of C. sinicum to low-light environments.

Photosynthetic organisms adapt to a variety of light conditions by redistributing light excitation energy between PSI and PSII through state transitions54. The reversible association of LHCII between PSI and PSII facilitates the adaptation of plants to diverse light conditions. The relocation of LHCII from PSII to PSI (state transition from state 1 to state 2) is initiated by the phosphorylation of specific threonine residues at the N-terminus of the Lhcb1 and Lhcb2 proteins by STN7 kinase18,55. The CsLhcb2 protein was strictly conserved at the phosphorylation site of STN7 kinase (Supplementary Fig. 24). Among the five CsLhcb1s proteins, a noticeable increase in the number of threonine residues adjacent to the phosphorylation site of STN7 kinase was observed compared to that in closely related species (Fig. 5c and Supplementary Fig. 25). Among them, CsLhcb1.5, CsLhcb1.8, CsLhcb1.9, and CsLhcb1.12 contain two threonine residues adjacent to the phosphorylation site of STN7 kinase, while CsLhcb1.10 contains three threonine residues near the phosphorylation site of STN7 kinase (Fig. 5c and Supplementary Fig. 25). Furthermore, a positively selected site is located near the phosphorylation site of STN7 kinase, implying that natural selection influences the expansion of threonine residues at the phosphorylation site of STN7 kinase in C. sinicum’s CsLhcb1s proteins (Fig. 5c, Supplementary Figs. 23 and 25; Supplementary Data 13).

Among the LHC family members of C. sinicum, CsLhca1, CsLhca4, CsLhca2.1, CsLhcb1.13, CsLhcb2, and CsLhcb5 exhibited high expression levels in leaf tissue (Fig. 5d). The CsLhcb1s, which contain multiple threonine residues at the phosphorylation site of STN7 kinase, are expressed in leaf tissue (Fig. 5d; Supplementary Data 12). Thereafter, we performed a prediction analysis on the threonine residues adjacent to the phosphorylation site of STN7 kinase in the Lhcb1 family using NetPhos 3.1 server56. Through this tool, threonine residues were identified as potential phosphorylation sites that may be susceptible to kinase-mediated phosphorylation (Fig. 5e). These results suggest that the adaptive evolution of CsLhcb1s potentially plays a crucial role in enhancing the ability of C. sinicum to adapt to low-light environments.

In addition to the LHC family of C. sinicum, we further identified the gene families involved in photosynthesis and analyzed their replication and expansion. With respect to photosystem I, tandem and WGD/segmental duplications increased the gene numbers of the PsaA (photosystem I P700 chlorophyll a apoprotein A1) and PsaB (photosystem I P700 chlorophyll a apoprotein A2). Within photosystem II, the genes of multiple protein subunits, including cp43, cp47, PsbE, PsbF, PsbO, and PsbQ, were expanded through WGD/segmental replication. Furthermore, WGD/segmental duplication also expanded the gene families of cytochrome f (PetA), cytochrome b6 (PetB), and subunit IV (PetD) within the cytochrome b6f complex (Supplementary Data 14).

Synthetic pathway of chrysosplenol and chrysosplenoside

Flavonoids, which are widely distributed in nature and exhibit diverse pharmacological activities, are important active constituents for clinical applications of species in the genus Chrysosplenium5. A total of 594 metabolites were identified in the three tissues (roots, stems, and leaves) of C. sinicum, including 128 (21.6%) flavonoids, 108 (18.2%) lipids, 104 (17.5%) phenolic acids, 63 (10.6%) organic acids, and 59 (9.9%) amino acids and derivatives (Fig. 6a). Chrysosplenol and chrysosplenoside represent the predominant flavonoid metabolites in C. sinicum, constituting 54.07% of the total flavonoid content (Supplementary Data 15 and 16). Distinct differences were observed among samples from different tissues in the PCA score plots and metabolite heatmaps of the samples, indicating that metabolite accumulation was tissue-specific (Supplementary Fig. 26). Through differentially abundant metabolite analysis, we observed that chrysosplenol and chrysosplenoside levels were significantly increased in the leaf tissue (Fig. 6b; Supplementary Fig. 27).

Fig. 6: Flavonoid content and biosynthesis analysis in C. sinicum.
Fig. 6: Flavonoid content and biosynthesis analysis in C. sinicum.
Full size image

a Pie graph of metabolite classification in C. sinicum. b Top ten metabolites of flavonoids in C. sinicum. Heatmap showing the differential content of the top ten flavonoids in three tissues (root, stem, and leaf). c Presumptive biosynthesis pathway of chrysosplenol and chrysosplenoside in C. sinicum. Heatmap showing the expression levels of genes involved in the biosynthesis of chrysosplenol and chrysosplenoside in three tissues (roots, stems, and leaves). Dashed arrows represent the presumed biosynthesis pathway. PAL phenylalaninammo-nialyase, C4H trans-cinnamate 4-monooxygenase, 4CL 4-coumarate-CoA ligase, CHI chalcone isomerase, F3H naringenin 3-dioxygenase, FLS flavonol synthase, F3′H flavonoid 3′-monooxygenase, F3′5′H flavonoid 3′,5′-hydroxylase. d Correlation analysis between genes and metabolites. Pale green circles denote metabolites chrysosplenol and chrysosplenoside. The orange circles denote synthetase.

Combined with the chemical structure and the flavonoid biosynthesis pathway in KEGG, we predicted the synthetic pathway of chrysosplenol and chrysosplenoside in C. sinicum (Fig. 6c). In the presumptive synthetic pathway, chrysosplenol B, chrysosplenol C, and chrysosplenol D are formed through hydroxylation and methylation of kaempferol, while chrysosplenol E, chrysosplenol F, and chrysosplenol G are generated via hydroxylation and methylation of quercetin. Finally, O-glycosyltransferases (UGTs) serve as the glycosylation enzymes responsible for synthesizing chrysosplenoside from chrysosplenol (Fig. 6c). Chalcone synthase (CHS) is the first committed enzyme in the conserved flavonoid biosynthesis pathway among plants57. A total of nine CsCHS genes were identified in C. sinicum, and the CsCHS gene family expanded. A heatmap revealed that three CsCHSs (CsCHS.3, CsCHS.4, and CsCHS.7) were differentially expressed in leaf tissue, which was consistent with the observed accumulation pattern of chrysosplenol and chrysosplenoside (Fig. 6c; Supplementary Data 17). Additionally, a total of 57 CsUGTs were identified in C. sinicum, among which 14 CsUGTs exhibited differential upregulation in leaf tissue (Fig. 6c; Supplementary Data 17).

A correlation analysis was then performed using Pearson’s correlation coefficient (>0.95) to further explore the genes involved in chrysosplenol and chrysosplenoside synthesis and examine the correlation between gene expression and metabolite content. The results revealed significant correlations between the expression of CsCHS.3, Cs4CL.5, one flavonoid o-methyltransferase (AOMT), two isoflavone 2′-hydroxylases (CYP81Es), and seven CsUGTs and the content of chrysosplenol and chrysosplenoside, suggesting that these genes play roles in the biosynthesis of chrysosplenol and chrysosplenoside (Fig. 6d, Supplementary Data 18). The AOMT enzyme methylates hydroxyl groups at the 3′ and 5′ carbon positions in the B ring of flavonoids, while the CYP81E-encoded enzyme catalyzes the hydroxylation of isoflavones to produce 2′-hydroxyisoflavones (Supplementary Fig. 28). These two enzymes potentially participate in chrysosplenol synthesis through hydroxylation and methylation processes. The expression of flavonoid biosynthetic genes is regulated by the transcription factors MYB and bHLH58,59. Significant associations were detected between the contents of chrysosplenol and chrysosplenoside and multiple genes encoding bHLH and MYB transcription factors (Fig. 6d; Supplementary Data 18).

Gene regulation among different tissues in Chrysosplenium sinicum

Alternative splicing (AS) enriches the diversity of proteins and phenotypic traits by generating multiple transcripts from a single gene. To determine the function of AS in gene regulation in C. sinicum, we analyzed the different splice isoforms in our dataset and identified a total of 14,173 AS events, which were categorized into the following types across three tissues (root, stem, and leaf): retained intron (IR), skipping exon (SE), alternative 5ʹ/3ʹ splice sites (A5SS/A3SS), mutually exclusive exons (MX), and alternative first or last exons (AF/AL). A high frequency of RIs (36.94%) was identified, followed by A3SS, A5SS, SE, AF, AL, and MX (Fig. 7a).

Fig. 7: Gene regulation among different tissues in C. sinicum.
Fig. 7: Gene regulation among different tissues in C. sinicum.
Full size image

a Overview of the seven different types of alternative splicing (AS) events and their frequencies in the three tissues of C. sinicum. b Pie graph for the percentages of differentially alternatively spliced genes. c KEGG enrichment results of 5239 alternatively spliced genes. d Comparison of differentially expressed genes (DEGs) and differentially alternatively spliced genes (DASGs) among different tissues. e AS variants for genes involved in photosynthesis and flavonoid biosynthesis. For each gene, AS-covering and total long-read counts are shown for AS variants in different tissues, and differential expression of AS variants is indicated by ΔPSI. ‘Up’ and ‘down’ represent up-regulated and down-regulated, respectively, between tissues.

The 14,173 alternative splicing events corresponded to 5239 alternatively spliced genes (ASGs). KEGG enrichment analysis of the 5239 alternatively spliced genes revealed “protein kinases” as the most significant category (Fig. 7b). Among these alternatively spliced genes, 25.6% were classified as differentially alternatively spliced genes (DASGs) (Fig. 7c; Supplementary Data 19). Specifically, when comparing roots to stems, 447 differential AS events were identified for 339 DASGs; when comparing leaves to stems, 1394 differential AS events were detected for 1043 DASGs; and when comparing leaves to roots, 1155 differential AS events were observed for 893 DASGs (Fig. 7d; Supplementary Data 19).

We also identified genes with differential expression between different tissues. A total of 4097, 5199, and 6379 differentially expressed genes (DEGs) were identified in the roots vs. stems, leaves vs. stems, and leaves vs. roots, respectively. The number of DEGs was nearly twelve times greater than the number of DASGs (4097 versus 339) between the roots and stems, approximately five times greater than the number of DASGs (5199 versus 1043) between the leaves and stems, and approximately seven times greater than the number of DASGs (6379 versus 893) between the leaves and roots (Fig. 7d). The intersection of DEGs and DASGs accounted for 2.0%, 4.3% and 4.1% in the roots vs. stems, leaves vs. stems and leaves vs. roots, respectively (Fig. 7d; Supplementary Data 19-20).

To elucidate the functions of genes within the DEGs and DASGs identified, a functional enrichment analysis was performed. The consistently enriched terms ‘flavonoid biosynthesis’ and ‘photosynthesis-antenna proteins’ were detected across all comparable groups of DEGs (Supplementary Fig. 2931). The enrichment analysis revealed that the DASGs were associated with several GO terms, including ‘protein kinases’, ‘kinase activity’, ‘calmodulin binding’, ‘protein dephosphorylation’, and ‘protein serine/threonine phosphatase activity’ (Supplementary Figs. 2931). Protein kinases play significant roles in regulating plant responses to various abiotic stresses60, indicating that AS potentially regulates protein phosphorylation and responds to abiotic stress. As a protein kinase, the A3SS-type AS variant of CsSTN7 exhibited decreased express levels in stems vs. leaves, suggesting that AS helps regulate the LHCII state transition. Two CsFAR1s also exhibited alternative splicing events (Fig. 7e). In addition to DEGs, AS events for genes in the flavonoid biosynthesis pathway, such as CsUGT.35 and CsUGT.41, may represent another mechanism by which flavonoid biosynthesis is regulated (Fig. 7e).

Discussion

Low-light intensity is widely recognized as a prevalent abiotic stress that negatively impacts photosynthesis and leads to a reduction in crop yield potential. The growth of cucumber is impeded under low-light stress, resulting in compromised cucumber quality and reduced yields61,62. In the wet season in India and Southeast Asian countries, low-intensity weather often leads to a significant yield loss in rice cultivation63. Therefore, breeders and researchers have focused on developing or generating new crop varieties with enhanced tolerance to low-light conditions64,65,66. In the present study, we selected a classic shade plant, C. sinicum, as a model and investigated the mechanism underlying the low-light tolerance of its chromosome-scale genome. Our results demonstrated that the rapid expansion of photosynthesis-related gene families through intracellular DNA transfer and the adaptive evolution of CsLhcb1s could establish a crucial genomic foundation for the increased adaptability of the shade plant C. sinicum to low-light environments.

Our comparative genomic analysis of gene families revealed that numerous photosynthesis-related gene families with low repetitive element density in C. sinicum rapidly expanded, which was attributed to intracellular DNA transfer and segmental/WGD replication (Fig. 4e and Supplementary Fig. 17). By analyzing the density of repetitive elements surrounding nuclear genes at the chromosome level, we successfully detected large-scale intracellular DNA transfer within chromosomes (Fig. 4b). In the C. sinicum genome, a total of five large gene clusters originating from the organelle genome were identified (Fig. 4b, d). Interestingly, four gene clusters derived from the chloroplast genome are associated with the photosynthesis process, while a single cluster originating from the mitochondrial genome is involved in the cellular respiration process. The photosynthetic system is directly correlated with the efficiency by which plants utilize light energy, and cellular respiration is intricately connected to a plant’s capacity to survive in dark environments43,45. Moreover, these genes, which are located within the five clusters associated with photosynthesis and cellular respiration processes, are extensively expressed in leaf tissue (Supplementary Fig. 21). These results strongly support our speculation that intracellular DNA transfer was a driving force in the adaptive evolution of C. sinicum towards long-term survival in low-light environments.

In this study, we observed a significant increase in the number of threonine residues adjacent to the STN7 kinase phosphorylation site in the CsLhcb1s from C. sinicum, compared to those from closely related species (Fig. 5c). The adaptive evolution of CsLhcb1s may contribute to the adaptability of C. sinicum to low-light environments. In land plants, 15–20% of LHCII exhibits transferability during state transitions. A previous investigation revealed that more LHCII is associated with PSI (in state 2) after long-term of dark adaptation in plants20. Therefore, we speculated that the number of threonine residues is positively related to the susceptibility of LHCII to phosphorylation by STN7 kinase, resulting in an increased proportion of transferable LHCII associated with PSI (in state 2). As daylight passes through the vegetative canopy, the red-to-far-red light (R/Fr) ratio of daylight decreases due to the absorption of red light by photosynthetic pigments, resulting in a higher proportion of far-red light in low-light environments67,68. The enrichment of far-red light results in the preferential excitation of PSI19,69. An increased proportion of LHCII associated with PSI (in state 2) can significantly increase the size of the PSI antenna and transfer energy efficiently to PSI, thereby improving a plant’s ability to efficiently utilize far-red light. This may be an adaptive strategy that enables shade plants to persist in long-term low-light environments. However, the additional threonine near the STN7 phosphorylation site within CsLhcb1s may act as a competitor for the original threonine, therefore diminishing the susceptibility of this active site to STN7 kinase-mediated phosphorylation. Alternatively, the FAR1 gene family, known for its ability to inhibit the shade avoidance response by repressing hypocotyl elongation70, also underwent rapid expansion in the C. sinicum genome, suggesting that shade tolerance over shade avoidance is a strategy employed by shade plants to cope with competition for light.

C. sinicum, which belongs to the Chrysosplenium genus, is an ideal plant for studying adaptive evolution in low-light environments and has significant medicinal value as a traditional Chinese Tibetan medicine. However, the metabolic characteristics of Chrysosplenium in relation to the synthesis and accumulation of flavonoid compounds remain unclear. In this study, we performed metabolomic analysis of three tissues (roots, stems, and leaves) of C. sinicum, and our analysis revealed that the predominant flavonoids in this species are chrysosplenol and chrysosplenoside (Fig. 6b; Supplementary Data 15 and 16). Chrysosplenol and chrysosplenoside, which are rarely found in other plant sources, are widely distributed in species of the genus Chrysosplenium; thus, these flavonoids may be associated with the unique characteristics of this genus. Modern research has demonstrated that chrysosplenol and chrysosplenoside possess a broad range of pharmacological properties. For example, chrysosplenol B and chrysosplenol C exhibit a notable inhibitory effect against rhinoviruses71, while chrysosplenol D exhibits anticancer activity against oral squamous cell carcinoma72. In this study, we propose a putative pathway for chrysosplenol and chrysosplenoside synthetic based on their chemical structure and the KEGG flavonoid biosynthesis pathway (Fig. 6c). Additionally, based on the transcriptome and metabolome data, we identified key genes involved in synthesis via correlation analysis (Fig. 6c, d). Although the synthetic routes presented are incomplete, the preliminary results for the synthesis of characteristic flavonoid metabolites in the Chrysosplenium genus will contribute to the utilization of these biological resources within the genus.

AS events play a crucial regulatory role in plants by enhancing the diversity of the proteome while simultaneously governing the abundance of the transcriptome. In the C. sinicum, we found that STN7 kinase and multiple UGT genes underwent AS events in three tissues (roots, stems, and leaves). The STN7 kinase phosphorylates LHCII, which then migrates from PSII to PSI (state 2) and thus rebalances the light excitation energy between the photosystems. Several studies have provided evidence for the pivotal involvement of UGTs in the biosynthesis of flavonoids within plant organisms73,74. The differential alternative splicing of the STN7 and UGT genes indicated that AS events may play a regulatory role in photosynthesis and flavonoid accumulation across the three tissues.

In summary, the genome of C. sinicum, a classic shade plant, was assembled in this study via comparative genomics analysis to investigate the adaptive evolution mechanism that allows shade plants to thrive in long-term low-light environments. Our results demonstrated that the expansion of photosynthesis-related gene families, caused by intracellular DNA transfer, and the adaptive evolution of the Lhcb1 family potentially established a crucial genomic foundation for the increased adaptability of C. sinicum to low-light environments. Additionally, we explored the biosynthesis and regulatory mechanisms of the flavonoid compounds present in C. sinicum, chrysosplenol and chrysosplenoside in different tissues using metabolome and transcriptome analysis and predicted their biosynthetic pathways and key enzymes.

Methods

Plant material

The C. sinicum plant used in this study was obtained from Badagongshan National Natural Reserve in Hubei province, China. The root apex was utilized for karyotype and flow cytometry examination to determine the plant’s genome size and diploidy. A single C. sinicum plant was selected for a de novo genome assembly. Three different tissues (roots, stems, and leaves) were collected during the reproductive growth stage for transcriptome and metabolome analysis. The freshly collected samples were promptly frozen in liquid nitrogen and subsequently stored at −80 °C.

Whole-genome sequencing and Hi-C sequencing

Whole-genome DNA sequencing was conducted using the MGIseq and PacBio platforms. Genomic DNA was extracted from young leaves using a modified CTAB extraction method. Short-read sequencing libraries were constructed following the manufacturer’s standard protocol and performed sequencing on the MGIseq platform. The PacBio long-reads library was prepared using the SMRTbell Express Template Prep kit 2.0 (Pacific Biosciences) according to the manufacturer’s instructions and then sequenced on the PacBio Sequel II platform.

The plant used for PacBio sequencing was also utilized in the construction of the Hi-C sequencing library. Hi-C data was used to assist scaffolding. The purified cross-linked DNAs were first digested, followed by blunt-end-repair and tagging with biotin. Subsequently, a Hi-C library was constructed by capturing and PCR-enriching DNA fragments containing biotin. Finally, the library sequencing was conducted utilizing the PE150 strategy on the Illumina HiSeq platform.

Genome assembly and quality evaluation

The genome size and heterozygosity of C. sinicum were evaluated using k-mer frequency analysis75 and DNA flow cytometry76. For PacBio assemblies, the Falcon software (v0.2.2)77 was used to assemble the long reads into contigs. To improve the accuracy of single-base, the contigs were initially polished using SMRTLink software (v5.1) with PacBio data, followed by a secondary polished using Pilon software (v1.22)78 with short-reads data. We used Juicer software (v1.5.6)79 to align the Hi-C sequencing reads to the contigs, followed by JucieBox software (v1.5.3)80 for completing visual correction of the interaction map, ultimately obtaining 12 pseudo-chromosomes of C. sinicum.

The BWA-MEM software (v0.7.12)81 was used to evaluate the quality of the genome assembly by mapping short-read DNA sequencing data to the assembly and calculating the mapping rates. The functional completeness of our assembly was assessed using compleasm (v0.2.6) tool31 for BUSCO analysis, with eudicots_odb1082 serving as the reference database. Merqury (v1.3) tool32 was employed to evaluate QV and genome completeness.

Repeat analysis

A repeat library was constructed from the assembled C. sinicum genome using RepeatModeler (open-1.0.11) (http://repeatmasker.org/RepeatModeler). The repeat sequence library obtained from RepeatModeler software was combined with RepBase (v20181026) (http://www.girinst.org/repbase) to create a comprehensive repeats library. The DeepTE83 and TEsorter (v1.4.6)84 tools were employed for the classification of repeats belonging to the “Unknown” category in the repeat library. Subsequently, this repeat library was utilized in RepeatMasker (OPEN-4.0.7) (http://www.repeatmasker.org) to identify repetitive elements within the C. sinicum genome.

Genome annotation

After masking the repetitive sequences using the RepeatMasker software, we used the MAKER pipeline (v2.31.10)85 to predict the gene set by incorporating three different approaches: Ab initio prediction, homology-based prediction, and transcript-based prediction. The Augustus software (version 3.3.1)86 was used for Ab initio prediction, with the Solanum lycopersicum selected as the prediction model. Homology-based prediction utilized protein sequences from four sequenced plants (including Arabidopsis thaliana, Vitis vinifera, Kalanchoe fedtschenkoi, and Rhodiola crenulata), as well as proteins from the uniport database (from https://www.uniprot.org/uniprotkb). Protein sequences were mapped to the genome using tBlastn (v2.7.1+). For transcript-based prediction, RNA-seq reads were trimmed using Trimmomatic (v0.36)87 and then assembled using Trinity (v2.8.4)88.

The function annotation of predicted protein sequences using InterProScan (version 5.39-77.0)89 and eggNOG-mapper (v2.0.1)90,91. InterProScan was run with the options ‘-iprlookup -goterms -pa -f tsv’ to generate the GO annotation for proteins. The iTAK (v1.7)92 software was used to predict transcription factor92.

Genome evolution and whole-genome duplication

The evolution of the C. sinicum genome was investigated by selecting 20 representative plant genomes, including species from Poales (Oryza sativa93), Ranales (Aquilegia coerulea94, Nelumbo nucifera), Caryophyllales (Beta vulgaris95), Asterales (Helianthus annuus96), Lamiales (Mimulus guttatus97), Solanales (Solanum lycopersicum98), Gentianales (Coffea canephora99), Saxifragales (C. sinicum, Kalanchoe fedtschenkoi35, Kalanchoe laxiflora, Rhodiola crenulata100), Vitales (Vitis vinifera39), Myrtales (Eucalyptus grandis101), Fabales (Medicago truncatula102), Rosales (Fragaria vesca103), Malpighiales (Populus trichocarpa104), Sapindales (Citrus sinensis105), Brassicales (Arabidopsis thaliana106), and Malvales (Theobroma cacao107). Orthofinder (v2.3.3)108 was used to identify orthologous groups with default parameters. The single-copy orthologous genes across the analyzed genomes were extracted from the output of Orthofinder. We construct the species tree based on these single-copy orthologous genes using Raxml (v8.2.12)109 package. MCMCTREE software (from PAML v4.9i)53 was used to calculate the divergence times. Subsequently, we corrected these divergence times based on the calibration of Monocot and Dicot (>1.15 Mya and <3.08 Mya) as well as Rosids and Asterids (>1.11 Mya and <1.31 Mya) divergence times from the TIMETREE database (http://www.timetree.org/)110. The visualization of the species tree used R package MCMCtreeR (v1.1)111.

JCVI112 was used to search for pairwise synteny regions between C. sinicum and grape genomes. The command ‘python -m jcvi.graphics.dotplot’ was utilized to generate dotplots for visualizing the pairwise synteny of genomes. The command ‘python -m jcvi.compara.synteny depth -histogram’ was utilized to analyze the synteny pattern between the genomes. ParaAT (https://github.com/wonaya/ParaAT) and KaKs_Calculator (version 2.0)113 were used to calculate Ks values for each gene pair. Based on the pairwise synteny region of C. sinicum versus AEK42, we generated the karyotype of C. sinicum genome.

Comparative genomics and visualizations

The C. sinicum-specific gene families were extracted from the Orthofinder (v2.3.3) results of four Saxifragales species. We conducted KEGG and GO enrichment analyses on gene sets using R package clusterProfiler (v4.6.2)114. The R package ggplot2 (v3.4.3) was utilized to visualize the results generated from the enrichment analysis. The duplicate_gene_classifier module from MCScanX115 was used to classify genes into distinct categories, including segmental/WGD duplicates, tandem duplicates, proximal duplicates, dispersed duplicates, and singletons. The expansion and contraction of gene families in each lineage were identified using CAFE (V4.2)116 software. Based on the repetitive elements identified in C. sinicum genome, we used bedtools (v2.27.1)117 program to count the number of repetitive elements in the proximal region (upstream and downstream 50 kb) of genes.

Mitochondria and chloroplast genome assembly

The chloroplast genome of C. sinicum was obtained from our previously submitted dataset (NC_051987.1)118. After removing fragments of chloroplast genome using Blastn (v2.7.1+) program, we used GetOrganelle (v.1.7.5)119 software to perform de novo assembly of a draft mitogenome with parameters ‘-R50 -k 21,45,65,85,105,127 -F embplant_mt’. Subsequently, we aligned the PacBio long reads to the draft mitogenome using minimap2 (v2.17)120 software and extracted these aligned long reads using samtools (v1.7)121. Based on the aligned PacBio long reads, we separately conducted mitogenome assembly using Canu (v2.2)122 and Flye (v2.9)123. The two results were combined to generate a complete mitogenome with three rings.

We annotated the mitogenome of C. sinicum by referencing the mitogenome of Heuchera parviflora (KR559021) using three software programs: GeSeq (v2017)124, Geneious Prime (2021.2.2), and IPMGA (v2022) (http://www.1kmpg.cn/mgavas). The mitogenome map of C. sinicum was generated utilizing OGDRAW (v1.3.1)125 software. The NUPTs and NUMTs were identified by conducting Blastn (v2.7.1+) searches against nuclear genomic sequences, using the chloroplast and mitochondrial genome as query sequences.

Positive selection analysis

The positive selection analysis of Lhcb1s was conducted for the C. sinicum and seven other Superrosids species. The amino acid sequences of Lhcb1s were aligned using T-COFFEE (v13.45.33)126, and then transformed into nucleic acid sequences using ParaAT. The codeml program, an integral part of the PAML (v4.9) software, was employed to conduct positive selection analysis through the branch-site model, with the CsLhcb1s from C. sinicum designated as the foreground branch127.

Transcriptome and metabolome analysis

The samples for transcriptome sequencing were collected from multiple plants at the reproductive growth stage within the same population. After the collection and processing of root, stem, and leaf tissues, each containing three biological replicates, we follow the manufacturer’s instructions to extract total RNA using Trizol (Invitrogen, CA, USA). The purified mRNA was fragmented into small pieces using a fragmentation buffer. First-strand cDNA was synthesized using random hexamer-primed reverse transcription, followed by the synthesis and purification of second-strand cDNA. The repaired (using A-Tailing Mix and RNA Index Adapters) cDNA fragments were amplified by PCR and the products were purified by Ampure XP Bead to obtain the final RNA-seq library. Subsequently, all these libraries for RNA-seq were sequenced utilizing the PE150 strategy on the Illumina HiSeq platform.

To quantify the gene expression level, the RNA-seq clean reads from three tissues were mapped to the C. sinicum genome using Hisat2 (v2.1.0)128 with default parameters. Then we used StringTie (v1.3.4d)129 to quantify the expression level of genes, employing default parameters. Differentially expressed genes were identified using R package DESeq2 (v1.38.3)130. Gene exhibiting a difference of at least twofold change and false discovery rate (q-value) ≤ 0.05 were considered to be significantly differentially expressed.

The plants used for metabolome analysis were obtained from the same batch as the transcriptome. Samples (roots, stems, and leaves) for the metabolome analysis were collected during the reproductive growth stage, with each tissue containing three biological replicates pooled from multiple plants. Metabolome analysis was conducted using a widely targeted metabolomics method by Wuhan Metware Biotechnology Co., Ltd (Wuhan, China) (http://www.metware.cn/). In brief, the freeze-dried C. sinicum tissues were crushed using a mixer mill (MM 400, Retsch) with a zirconia bead for 1.5 min at 30 Hz. The 100 mg tissue powder was weighted and extracted overnight at 4 °C with 1.2 ml 70% aqueous methanol. Following centrifugation at 12,000 rpm for 10 min, the extracts were filtrated (SCAA-104, 0.22μm pore size) before UPLC-MS/MS analysis (UPLC, SHIMADZU Nexera X2, www.shimadzu.com.cn/; MS, Applied Biosystems 4500 Q TRAP, www.appliedbiosystems.com.cn/). Metabolite characterization was performed using the self-built database MWDB (metware database) of Metware Biotechnology, and metabolite quantification was conducted using a scheduled multiple reaction monitoring (MRM) method131. The Python package SciPy (v1.11.2) was used to calculate Pearson’s correlation coefficient between gene expression and metabolite content. We screened the candidate gene set potentially involved in the synthesis of chrysosplenol and chrysosplenoside using a Pearson’s correlation coefficient threshold greater than 0.95. Subsequently, we performed screening of enzymes involved in flavonoid hydroxylation, methylation, and glycosylation within the candidate gene set using KEGG function annotation. Finally, we identified the candidate genes by analyzing their specific enzyme catalytic sites within the flavonoid ring.

Full-length transcript analysis

The high-quality full-length transcript (RQ > 0.9) was generated using IsoSeq (v3.3.0) (https://github.com/PacificBiosciences/IsoSeq) based on PacBio long reads. These full-length transcripts were then corrected with short-reads data via LoRDEC (v0.9)132 software, and subsequently mapped to the assembled C. sinicum genome through minimap2 (v2.17). Finally, the redundant transcripts were collapsed using the cDNA_Cupcake program (v8.7) (https://github.com/Magdoll/cDNA_Cupcake).

The SUPPA2 (v.2.3)133 software was utilized to identify alternative splicing (AS) events from non-redundant full-length transcripts, and the percentage spliced in index (PSI) value was computed to quantify the inclusion levels of AS events based on the transcripts' TPM values. If the difference in PSI of an AS event between the two tissues surpasses a stringent threshold (false discovery rate ≤ 0.05, |ΔPSI | > 0.1), it is classified as differentially AS events.

Statistics and reproducibility

The statistical significance of the difference in the number of repetitive elements in the proximal region of genes was determined using the Student’s t-test (*P < 0.05; **P < 0.01; ***P < 0.001). The statistical significance of GO and KEGG terms was evaluated using Fisher’s exact test in combination with FDR correction for multiple testing (P < 0.05). Three biological replicates were prepared for each tissue under the same condition. The mean FPKM value computed from the three replicates represents the final FPKM for the genes.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.