Background & Summary

Chrysosplenium, a genus of small perennial herbaceous plants, occupies a distinctive position within the Saxifragaceae family1. Currently, there are approximately 80 species of Chrysosplenium worldwide, with the majority found in Asia, Europe, and North America in the northern hemisphere and a few species occurring in temperate regions of the Southern Hemisphere2,3,4. These species predominantly thrive in shady and humid habitats at altitudes ranging from 450 to 4800 meters, including alpine meadows, alpine shrubs, and high gravel gaps5. China is recognized as one of the centers of diversity for Chrysosplenium, harboring around 40 species, 24 of which are endemic to the country6. They are primarily distributed in the southwestern, northern, and central regions of China, with a significant concentration in the provinces of Shaanxi, Sichuan, Yunnan, and Xizang7. The genus Chrysosplenium, rich in various compounds such as flavonoids and triterpenes, has high medicinal value duo to its wide-ranging pharmacological properties, including anti-tumor, antibacterial, antiviral, hepatoprotective, and insecticidal activities8.

Chrysosplenium macrophyllum Oliv., is a perennial herbaceous plant belonging to the subgenus Alternifolia, and is a unique species native to China9 (Fig. 1a). Based on specimen records, it is mainly distributed in subtropical regions of China10. C. macrophyllum is a widely used folk herbal, traditionally employed in treating various ailments such as infantile convulsions, ecthyma, scalds, and lung and ear disorders6. While a pseudo-chromosome level genome for Tiarella polyphylla within Saxifragaceae family has been published, limited genomic information is available for species of the Chrysosplenium genus11. Previous research has predominantly focused on the chloroplast genome of Chrysosplenium, with no studies addressing its nuclear genome12,13,14,15,16. Furthermore, C. macrophyllum has received limited research attention, which has significantly hindered the development and utilization of its medicinal potential.

Fig. 1
figure 1

Genome assembly of Chrysosplenium macrophyllum. (a) Morphology of C. macrophyllum. (b). Flow cytometry histogram of Glycine max nuclei (FL2 signal, used as an internal reference). (c) Flow cytometry histogram of C. macrophyllum nuclei (FL2 signal). (d) Flow cytometry histogram of a mixed sample containing nuclei from both G. max and C. macrophyllum (FL2 signal), demonstrating clear peak separation for genome size estimation. (e) GenomeScope 2.0 profile of C. macrophyllum based on BGI short-read sequencing (k = 19), showing k-mer frequency distribution and estimated genome characteristics.

In this study, we have unveiled the whole-genome sequences of C. macrophyllum for the first time, achieved through the integration of Oxford Nanopore Technology (ONT) long reads, Beijing Genomics Institute (BGI) short reads, and high-throughput chromatin conformation capture sequencing (Hi-C) reads. The assembled genome size is approximately 2.55 Gb, with a scaffold N50 length of 93.38 Mb. Of the assembled sequences, 83.70% (2.14 Gb) were anchored to 22 pseudo-chromosomes. The genome contains 62,921 protein-coding genes, with annotations available for 93.67% of them. Additionally, we identified 316 miRNAs, 2,768 tRNAs, 2,348 rRNAs, and 1,467 snRNA. This newly assembled genome serves a crucial resource for investigating the evolutionary history of Saxifragaceae, studying the biosynthesis of bioactive compounds, and exploring its potential medicinal value as a Chinese endemic plant.

Methods

Plant materials

We collected samples from Qizimei Mountain National Nature Reserve for BGI sequencing, ONT sequencing, Hi-C sequencing, and transcriptome sequencing, as well as for flow cytometry analysis. Materials for chromosome karyotype analysis came from Saiwudang National Nature Reserve. All voucher specimens are stored in the Herbarium of South-Central Minzu University (HSN).

Genome sequencing

To assemble and annotate the genome of C. macrophyllum, we combined short-read, long-read, Hi-C and transcriptome sequencing. Genomic DNA was extracted from young leaves of C. macrophyllum using a modified cetyltrimethylammonium bromide (CTAB) method17, and its concentration, purity, and integrity were assessed with a NanoDrop (NanoDrop Technologies, Wilmington, DE, USA) and a Qubit 3.0 fluorometer (Life Technologies, Carlsbad, CA, USA), and 0.75% agarose gel electrophoresis. A short-read library was prepared using the VAHTS Universal Plus DNA Library Prep Kit for MGI V2 (Vazyme Biotech Co., Ltd., Nanjing, China), followed by sequencing on the DNBSEQ-T7 platform (BGI Inc., Shenzhen, China), generating approximately 457.28 Gb of raw data, with an estimated genome coverage of 143×.

To complement the short-read data, long-read sequencing was performed using Oxford Nanopore Technologies (ONT). High molecular weight DNA was fragmented using a Megaruptor, and DNA fragments were selected and ligated to adapters using the Nanopore SQK-LSK109 kit. Sequencing on the PromethION platform produced 269.58 Gb of long-read data from approximately 3 million reads (N50 = 29 kb, longest read = 834 kb).

Hi-C sequencing was applied to further improve the assembly by capturing chromatin interactions. Hi-C libraries were constructed with a modified Belton et al.18 workflow. Chromatin was cross-linked, digested, and labeled, with interacting DNA fragments captured using streptavidin magnetic beads. The Hi-C libraries were sequenced on the DNBSEQ-T7 platform (BGI Inc., Shenzhen, China), generating 355 Gb of data, which were used to assist in the subsequent of pseudochromosomes.

For transcriptome sequencing, RNA was extracted from roots, stems, and leaves of plants from the same population used for genomic sequencing. These RNA samples were pooled in equal proportions, followed by library preparation and sequencing on both the DNBSEQ-T7 (BGI Inc., Shenzhen, China) and PromethION platforms (Oxford Nanopore Technologies, USA). This generated 6.12 Gb and 12.39 Gb of raw data, respectively, providing valuable data for the subsequent genome annotation.

All library preparation and sequencing were conducted by Wuhan Benagen Technology Co. Ltd. (Wuhan, China).

Genome size estimation

Genome size was first measured by flow cytometry (Sysmex CyFlow® Cube6) at Jiyuan Biotech Co., Ltd (Guangzhou, China). A standard reference sample of Glycine max (Fig. 1b), with a known genome size, served as the benchmark. The analysis revealed that the genome size of C. macrophyllum is approximately 3.2 Gb (Fig. 1c,d). To further evaluate genome size, we carried out a k-mer–based genome survey. Raw BGI reads were quality-filtered with fastp v0.21.019, which removed adapters, short fragments and low-quality bases. We then counted 19-mers frequencies with Jellyfish v2.2.1020 and assessed genome characteristics using GenomeScope v2.021. The k-mer profile predicted a genome size of 3.19 Gb, a heterozygosity rate of 3.08%, and a duplication level of 89.26% (Fig. 1e).

Karyotype analysis

Karyotype analysis of C. macrophyllum was conducted at OMIX Technologies Corporation (Chengdu, China) to identify chromosome number and ploidy. Active root tip meristematic tissues were obtained by culturing collected C. macrophyllum plants. Root tips, approximately 1.5–2 cm in length, were collected and exposed to a nitrous oxide environment to induce mitosis, thereby increasing the number of cells in the metaphase stage. These root tips were then diced, digested, and treated with a mixture of 1% pectolyase Y23 and 2% cellulase Onozuka R-10. Cells were subsequently gathered via centrifugation and resuspended in 90% acetic acid. A drop of the cell suspension was placed on a slide, which was kept in a box lined with moist paper. Chromosomes were stained with the fluorescent dye 4’,6-diamidino-2-phenylindole (DAPI). Metaphase cells with well-dispersed chromosomes were counted using an Olympus BX63 fluorescence microscope. Further confirmation of chromosome number and ploidy was achieved through fluorescence in situ hybridization (FISH), employing telomeric repeats Oligo-(TTTAGGG)6 as probes for chromosome counting and 5S rDNA repeats for ploidy determination. Observations were made using the Olympus BX63 fluorescence microscope.

The karyotype analysis revealed that C. macrophyllum has a total of 88 chromosomes (Fig. 2a). FISH analysis using telomeric repeats probes showed clear fluorescent signals at the telomeres of various chromosomes, confirming the observed chromosome count of 88 (Fig. 2b). Additionally, FISH analysis with 5S rDNA repeat probes revealed that all cells in the sample exhibited 8 hybridization signals (Fig. 2c). Based on these findings, C. macrophyllum is confirmed to be an octoploid species with a chromosomal configuration of 2n = 8x = 88.

Fig. 2
figure 2

Chromosome counts and ploidy of C. macrophyllum. Fluorescent chromosome staining (a), telomere fluorescence in situ hybridization (b), and 5S rDNA fluorescence in situ hybridization (c) results of C. macrophyllum.

De novo genome assembly

The pipeline for the C. macrophyllum chromosome-level genome assembly and annotation is illustrated in Fig. 3. To assembly the contigs, low-quality Nanopore raw reads with a quality score below 7 were filtered out using Oxford Nanopore GUPPY v0.3.022. The remaining high-quality reads were then de novo assembled with NextDenovo v2.5.023. This initial assembly was corrected twice using Nanopore reads with the assistance of Racon v1.4.1124, followed by two additional rounds of correction using BGI reads with Pilon v1.2325. Duplicates were removed from the corrected genome using purge_dups v1.426, resulting in a draft genome size of 2.55 Gb, ready for further scaffolding, annotation, and analysis as detailed in Table 1.

Fig. 3
figure 3

The pipelines overview of C. macrophyllum chromosome-level genome assembly and annotation.

Table 1 The statistics of assembly result.

Hi-C reads were first quality-filtered with fastp v0.21.019 to remove low-quality bases and other contaminants. The cleaned pairs were aligned to the draft genome with HICUP v0.8.027, and uniquely mapped reads were passed to ALLHiC v0.9.828 to cluster, order, and orient scaffolds into pseudo-chromosomes. Hi-C contact matrices were converted to binary (.hic) format with 3D-DNA v18041929 and Juicer v1.629; the resulting scaffolds were then visualised and manually curated in Juicebox v1.11.0830. In total, 1,298 contigs were anchored onto 22 pseudo-chromosomes, representing 83.70% of the assembled genome (Fig. 4a,b). The assembled chromosomes range from 60,630,240 bp to 135,182,039 bp in length (Table 2).

Fig. 4
figure 4

Interchromosomal Hi-C contact map (a) and Chromosomes circle (b) of C. macrophyllum genome. The circle diagram depicts the following from outer to inner layers: 22 chromosomes, gene density, GC content, repeat density, genome collinearity.

Table 2 The chromosome length of C. macrophyllum.

Genome prediction and annotation

Repetitive elements in the C. macrophyllum genome were annotated with a pipeline that integrated homology-based and de novo strategies. An initial repeat library was constructed using LTR_FINDER v1.0.731, LTRharvest v1.6232, and RepeatModeler v2.0.433. Unidentified sequences were typed with TEclass v2.1.334 and merged with Repbase v2018102635 database to yield the final library. This library was then used by RepeatMasker v4.1.536 to mask repetitive sequences within the genome and by RepeatProteinMask v4.1.5 (https://github.com/Dfam-consortium/RepeatMasker) to predict repeat sequences based on TE protein types. Tandem repeat sequences were identified using Tandem Repeats Finder v4.0937 and MISA v2.138. Comprehensive analysis revealed that repetitive elements comprised 2.13 Gb, or 83.23% of the total genome size (Table 3). Of this, interspersed repeats accounted for 1.78 Gb, or 69.69% of the genome, while tandem repeats occupied 345.84 Mb, representing 13.54% of the genome. Within the interspersed repeats category, DNA transposons constituted 13.41% of the genome, Long Interspersed Elements (LINEs) accounted for 4.67%, Short Interspersed Elements (SINEs) contributed 0.10%, and Long Terminal Repeat retrotransposons (LTRs) made up 64.76%. Among the LTR retrotransposons, LTR-Gypsy elements were the most prevalent, representing 30.14%, followed by LTR-Copia elements at 16.13% (Table 3).

Table 3 Statistics of repeat sequences in C. macrophyllum genome.

Non-coding RNA, which lack protein-coding potential, was predicted through various approaches. For tRNA prediction, tRNAscan-SE v2.0.1239 was utilized, while rRNA prediction was performed using RNAmmer v1.240. To identify ncRNA, including snRNA and miRNA, INFERNAL v1.1.441 was applied, referencing the Rfam database. As a result, our annotation process identified a total of 316 miRNAs, 2,768 tRNAs, 2,348 rRNAs, and 1,467 snRNAs in the C. macrophyllum genome (Table 4).

Table 4 The result of non-coding RNA annotation of the C. macrophyllum genome.

The genome structure of C. macrophyllum was inferred through an integrative approach that combined ab initio, homology-based, and transcriptome-based predictions. For Ab initio prediction, Augustus v3.5.042 and GlimmerHMM v3.0.443 were applied. For homology-based prediction, protein sequences from Chrysosplenium sinicum, Kalanchoe fedtschenkoi, Kalanchoe laxiflora, Rhodiola crenulata, and Arabidopsis thaliana, were aligned to the the C. macrophyllum genome using tblastn v2.13.044, after which transcript and protein-coding region were refined with Exonerate v2.4.045. Transcriptome-based prediction combined BGI short reads and ONT full-length reads. Filtered BGI reads were mapped with HISAT2 v2.2.146 and assembled with StringTie v2.2.147. ONT reads were filtered using NanoFilt v2.8.048 and identified via Pychopper v2.7.5 (https://github.com/epi2me-labs/pychopper). The resulting sequences were aligned to the C. macrophyllum genome using minimap2 v2.26-r117549, and the resulting BAM files were reconstructed into transcripts using StringTie v2.2.147. The resulting assemblies were merged with TAMA v1.050, and open reading frames were identified using TransDecoder v5.7.0 (https://github.com/TransDecoder/TransDecoder). Finally, MAKER v3.01.0351 integrated the evidence from all three approaches to yield the consensus gene set. The resulting annotation comprises 62,921 protein-coding genes, with mean gene and CDS lengths of 4,086 bp and 1,123 bp, respectively; genes contain an average of 4.76 exons, with mean exon and intron lengths of 302 bp and 702 bp (Table 5).

Table 5 Statistics of protein-coding genes in C. macrophyllum genome.

Predicted protein sequences were compared with the UniProt and NCBI non-redundant (NR) databases using DIAMOND v2.1.852 to obtain high-confidence homologues. Conserved motifs and domains were identified with InterProScan v5.55–88.053, and complementary domain searches were carried out with HMMER v3.3.254. Gene Ontology (GO) terms were assigned by merging DIAMOND hits with InterPro-derived GO mappings in Blast2GO v4.155, and GO terms were subsequently linked to their corresponding Enzyme Commission (EC) numbers. Kyoto Encyclopedia of Genes and Genomes (KEGG) orthologues were predicted via the KAAS (https://www.genome.jp/kegg/kaas/) web server. Overall, 58,939 genes —representing 93.67% of the predicted protein-coding set —received functional annotation in at least one database, and 1,319 genes were annotated across all databases (Table 6).

Table 6 The result of function annotation of the C. macrophyllum genome.

Data Records

The raw sequencing data used for genome assembly and annotation have been deposited in the National Genomics Data Center (NGDC)56,57, Beijing Institute of Genomics, Chinese Academy of Sciences/China National Center for Bioinformation, under the BioProject accession number PRJCA025550, and are publicly accessible at https://ngdc.cncb.ac.cn/bioproject. BGI short-reads, Oxford Nanopore reads, Hi-C reads, and RNA-seq data have been deposited in the Genome Sequence Archive58 in NGDC under the accession number CRR113620959, CRR113620860, CRR113621061/CRR113621162, and CRR113621263/CRR113621364, respectively. The chromosomal-level genome assembly data have been stored in GenBank with the accession number JBISEJ00000000065. Additionally, the genome annotation file is available in Figshare66.

Technical Validation

To evaluate the accuracy and completeness of the C. macrophyllum genome, two complementary strategies were applied. Quality-filtered BGI short reads were first remapped to the assembly with BWA v5.3.067, and 97.94% of reads aligned, indicating high alignment efficiency. Genome completeness was then evaluated with Benchmarking Universal Single-Copy Orthologs (BUSCO v5.3.0)68 based on the embryophyta_odb10 reference set (1,614 conserved orthologues). The assembly contained 98.6% complete genes, of which 29.6% were single-copy and 69.1% were duplicated; only 0.6% were fragmented and 0.7% were missing (Table 7). An identical BUSCO analysis of the annotated gene set recovered 1,592 complete genes (98.6% completeness), comprising 656 single-copy (40.6%) and 936 duplicated (58.0%) genes (Table 7). Taken together, these results confirm that both the genome assembly and its annotation are highly complete and reliable.

Table 7 BUSCO analysis of C. macrophyllum genome.