Background & Summary

Sarcandra glabra (Thunb.) Nakai (Chloranthaceae; Fig. 1) is a traditional Chinese medicinal herb with high medicinal and edible value. It is distributed mainly south of the Yangtze River and grows in damp areas under mountain valleys and forests at an altitude of 420–1500 m. This species is widely used for its immunomodulatory1, anti-inflammatory2, antitumor3, and fracture healing properties owing to its chemical components, which include flavonoids, terpenes, coumarins, phenolic acids, and volatile oils4,5.

Fig. 1
figure 1

Genome assembly of Sarcandra glabra. (A) S. glabra; (B) Genome-wide Hi-C heatmap of chromatin interaction counts in 100 kb bins. Only sequences anchored on chromosomes are shown. The abbreviations, CSHB01–15, represent the nine chromosomes, and the color bar represents the log2 value of interaction counts.

The family Chloranthaceae, comprising five genera and approximately 70 species, is primarily distributed in tropical and subtropical regions and is used for medicinal purposes and for the extraction of aromatic oils. Chloranthaceae exhibit a degree of primitiveness in their evolutionary relationships. Some genera within this family possess leaf veins containing both tracheids and vessels, reflecting their unique evolutionary journey. However, only one species currently has a genome published in NCBI (Chloranthus sessilifolius, GCA_021018995.1), indicating that much remains to be explored in terms of genomic research and the utilization of genetic resources of Chloranthaceae plants. Within the S. glabra species, two subspecies are officially accepted, viz. Sarcandra glabra subsp. glabra and Sarcandra glabra subsp. brachystachys.

Previous transcriptomic and metabolomic studies have investigated the tissue-specific distribution of terpenoid biosynthesis, the regulatory mechanism of the differential accumulation of flavonoids in leaves and roots, and the mechanism of accumulation of phenylpropanoid-derived compounds in S. glabra6,7,8,9,10; molecular markers have also been reported11. Although regulatory patterns for these compounds have been deduced, it is still necessary to reveal their molecular mechanisms using whole-genome sequencing.

In the current study, nanopore, short-read, and high-throughput chromosome conformation capture (Hi-C) sequencing was used to construct a highly contiguous assembly of the S. glabra genome. High-quality genome assembly facilitates the elucidation of the molecular mechanisms underlying the biosynthesis of beneficial compounds with medicinal value in S. glabra and provides a reference for the development and utilization of S. glabra.

Methods

Sample preparation and DNA extraction

Fresh leaves of S. glabra were collected from the Guangxi Botanical Garden of Medicinal Plants, China (http://www.gxyyzwy.com, Ying Hu, hying@gxyyzwy.com), with a voucher number of YY00902. The samples were stored at −80 °C. Genomic DNA was extracted from the frozen leaves using CTAB (cetyltrimethylammonium bromide) buffer (incubation at 65 °C for 60 min). Then he extracted DNA was purified through phenol/chloroform/isopentyl (25:24:1) extraction, followed by precipitation with isopropyl alcohol and ethanol. The final DNA was resuspended in Tris-EDTA buffer for sequencing.

Library construction and sequencing

Library size selection was carried out using the BluePippin (Sage Science, Beverly, MA, USA), and 1 μg of the genomic DNA (target insert size of 20 kb) was processed for damage repair, end repair, and purification. Nanopore sequencing libraries were prepared using the SQK-LSK109 Ligation Sequencing Kit (ONT, Oxford, UK), following the manufacturer’s instructions. After quality control, the libraries were subjected to PacBio HiFi sequencing.

Two short-read libraries with insert sizes of 270 bp and 500 bp were constructed from high-quality DNA using fragmentation (Covaris, Woburn, MA, USA), end repair, and adaptor ligation, creating circular DNA molecules for rolling-circle amplification to generate DNA nanoballs (DNBs). For Hi-C library preparation, cells were cross-linked with formaldehyde to preserve DNA-protein and protein-protein interactions, followed by fragmentation, end repair, purification, and adaptor ligation. The short-read and Hi-C libraries were sequenced using the DNBSEQ platform (MGI, Shenzhen, China) in paired-end mode.

Genome assembly and quality evaluation

Short-read sequences were processed using SOAPnuke (v1.6.5; -n 0.01, -q 0.1, -l 20, -Q 2, -M 2, -A 0.5)12 to remove low-quality reads and adapter contamination. The genome size was estimated using k-mer analysis with K-mer Analysis Toolkit v2.4.213, followed by genome assessment and heterozygosity estimation using GenomeScope14.

A draft assembly was generated from the ONT data with Necat (GENOME_SIZE = 4455 Mb) and polished using Racon15,16. Short-read data were used to further refine the assembly with Pilon, and redundancy was reduced using Trimdup17. The HiC-Pro v2.5.0 pipeline aligned Hi-C data with the assembled genome contigs to obtain valid interaction pairs18. Juicer was used for sequence alignment, and 3D-DNA was employed to construct a chromosome-level assembly19,20. The final genome quality was assessed using BUSCO v3 with the “embryophyta_odb10” ortholog set21.

Genome annotation

Repeat elements were identified using RepeatMasker v4.0.7 and RepeatProteinMask v4.0.7 (http://www.repeatmasker.org/cgi-bin/RepeatProteinMaskRequest) with the Repbase v21.12 database22,23. A de novo repeat library was created using RepeatModeler, followed by identification of repetitive sequences with RepeatMasker22,23,24. The two predicted repeat sets were merged to generate nonredundant repeat sequences using TEclass2.1.325,26.

Protein-coding gene annotation was performed using Exonerate v2.2.025, Genewise v2.4.1 (https://www.ebi.ac.uk/Tools/psa/genewise/), and Funannotate v1.8.7 (https://github.com/nextgenusfs/funannotate), followed by integration into a comprehensive gene set using EVM (https://github.com/EVidenceModeler/EVidenceModeler/). Non-coding RNAs, including tRNAs, rRNAs, miRNAs, and snRNAs, were identified with tRNAscan-SE27, BLASTN, and INFERRAL (http://infernal.janelia.org/) of Rfam, respectively28. Gene functional annotation was performed using BLAST v2.2.3129 against various databases, including GO database30, KEGG (Kyoto Encyclopedia of Genes and Genomes)31, translation of European Molecular Biology Laboratory32, InterPro33, SwissProt32, and NR (nonredundant protein sequences)34.

Data Records

This Whole Genome Shotgun project has been deposited at GenBank under the accession ASM4507178v1. The version described in this paper is version GCA_045071785.135. The raw genome sequencing data (PacBio and DNBSEQ short reads) have been deposited to NCBI database under the Sequence Read Archive accession number SRP51833936, and the genome annotation was available at figshare with the accession number https://doi.org/10.6084/m9.figshare.28874543.v237.

Technical Validation

Genome assembly

Oxford Nanopore Technologies (ONT) sequencing technology and Hi-C-assisted genome assembly were used to generate highly contiguous genome assemblies of Sarcandra glabra (Thunb.) Nakai. The ONT read data was 124.95 Gb (~28 × coverage), with a mean long-read length and N50 of 27.30 and 33.75 kb, respectively (Table S1). A total of 268.36 Gb of clean short-read sequencing data (~49 × coverage) were used for subsequent polishing (Table S1).

The total length of the final assembly was 4.78 Gb, with a GC content of 38.90% (Table 1), which was close to the genome size estimated by 17-mer analysis (genome size of 4.46 Gb and heterozygosity of 1.10%). The contig N50 and scaffold N50 were approximately 602 kb and 239.7 Mb, respectively, with maximum contig size and scaffold sizes of 3.4 Mb and 424.4 Mb, respectively (Table 1). Fifteen chromosomes were generated by concatenating contigs with a total length of 3.75 Gb based on the Hi-C reads (Table 1).

Table 1 Sequencing data for the genome sequencing and assembly.

The interaction signal strength of the genome-wide Hi-C heatmap around the diagonal was higher than that of the off-diagonal signals, demonstrating the high quality of highly contiguous genome assembly (Fig. 1B). BUSCO evaluation indicated that the final genome contained 89.00% complete genes in the “embryophyta_odb10” ortholog set (Table 1), indicating a high degree of completeness for the genome assembly.

Genome annotation

The identified repetitive sequences (233.11 Mb) constituted 37.62% of the reference genome sequence (Table S2). The most abundant repeat types were long terminal repeat (LTR) retrotransposons (26.47%) and DNA elements (10.09%) (Table S2). Prediction yielded 41,423 protein-coding genes in the genome, with an average mRNA length of 3484.43 bp and an average coding sequence length of 1085.04 bp (Table 2). The average exon number was 4.97, with average exon and intron lengths of 330.45 bp and 1133.84 bp, respectively (Table 2). Gene function annotation revealed that 33,223 genes (80.21%) could be annotated into databases such as GO and KEGG (Table 2 and Fig. 2). Non-coding RNAs included 4,354 ribosomal RNAs (rRNAs), 967 transfer RNAs (tRNAs), 639 small nucleolar RNAs, and 143 miRNAs (Table S3).

Table 2 Statistics of genome annotation.
Fig. 2
figure 2

The functional annotation of the protein-coding genes of the Sarcandra glabra genome. (A) Venn diagram representing the functional annotation in InterPro, KEGG, SwissProt and NR; (B) GO classification statistics; (C) KEGG pathway classification statistics. Kyoto Encyclopedia of Genes and Genomes (KEGG), Gene Ontology (GO), nonredundant protein sequences (NR).