Background & Summary

Pseudostellaria heterophylla (Miq.) Pax, a perennial herb from the Caryophyllaceae family, is a renowned medicinal plant in traditional Chinese medicine1,2. With a therapeutic history spanning centuries, it was first officially documented in a Qing Dynasty pharmacopeia, Ben Cao Cong Xin (1757)1. This species thrives in mountain valleys and moist shaded forests, predominantly across northeastern and eastern China, including the provinces of Liaoning, Shandong, Fujian, Guizhou, and Anhui (https://www.iplant.cn/info/Pseudostellaria%20heterophylla). Its dried tuberous root, termed Radix Pseudostellariae, serves as the primary medicinal material, exhibiting pharmacological properties including body fluid replenishment, enhancement of splenic and pulmonary functions, and maintenance of physiological homeostasis. In clinical practice, it has been traditionally prescribed to alleviate fatigue, anorexia, post-illness asthenia, and chronic dry cough3,4,5,6. Due to its mild therapeutic properties, it is commonly used in pediatric applications as a ginseng substitute, earning its Chinese vernacular name hai-er-shen (literally Child’s Ginseng)1.

Modern pharmacological studies have identified various bioactive compounds from P. heterophylla, including cyclic peptides, polysaccharides, saponins, and amino acids1,7. Among these, cyclic peptides, especially heterophyllin B (HB), are the characteristic constituents with significant pharmacological effects such as anti-inflammatory, antitumor, immunomodulatory, antioxidant, and anti-aging activities, as well as cognitive enhancement2,8,9,10,11. Recent studies have shown that cyclic peptides are ribosomally synthesized and post-translationally modified peptides. The precursor linear peptide of HB is initially encoded by the PhPreHB gene, and subsequently undergoes enzyme-catalyzed macrocyclization, primarily mediated by the peptide cyclase PhPEPTIDE CYCLASE3 (PhPCY3) to generate the mature HB12,13. However, a comprehensive understanding of the biosynthetic pathway and its regulatory mechanisms of cyclic peptides in P. heterophylla remains elusive.

In addition to its medicinal value, P. heterophylla is also well-known for typical chasmogamous-cleistogamous (CH-CL) mixed breeding system of significant evolutionary importance14,15,16. This dimorphic species produces both open (chasmogamous, CH) flowers and closed (cleistogamous, CL) flowers on the same individual (Fig. 1a). The CH flowers display a complete floral structure with five sepals, five petals, ten stamens, and three carpels, adapted for pollinator attraction and outcrossing (Fig. 1b)16. In contrast, CL flowers exhibit reduced morphology - retaining only four sepals, two stamens, and two carpels while completely lacking petals - an adaptation ensuring reliable self-pollination under unfavorable conditions (Fig. 1c)16. Consequently, this species provides an ideal model to dissect the gene regulatory networks that drive floral dimorphism and its associated developmental divergence between CH and CL flowers.

Fig. 1
figure 1

Morphology of Pseudostellaria heterophylla. (a) P. heterophylla plant with chasmogamous (CH) and cleistogamous (CL) flowers. Scale bar, 2 cm. (b) Mature CH flower. Scale bar, 3 mm. (c) Mature CL flower tightly enclosed by sepals (left) and CL flower with sepals removed showing stamens and carpels (right). Scale bar, 2 mm.

In recent years, advances in third-generation sequencing and genome assembling technologies have established reference genomes as powerful and fundamental resources for elucidating the genetic mechanisms underlying important biological features of many plants17,18,19,20,21,22. The absence of a reference genome for P. heterophylla has impeded investigations into the genetic basis of both its medicinally valuable compound biosynthesis and unique dimorphic flowering system. To address this critical gap, we assembled and annotated a high-quality chromosome-level reference genome for this species, using short reads, PacBio HiFi long reads, high-throughput chromosome conformation capture (Hi-C) data, and transcriptome data. The final assembly obtained a 2.19 Gb genome with a scaffold N50 of 144.78 Mb. Approximately 99.36% of the genome sequence was anchored to 16 pseudo-chromosomes. Quality assessments of the assembly via Benchmarking Universal Single-Copy Ortholog (BUSCO) indicated 97.83% completeness. Repetitive elements constituted 79.41% of the assembled genome, with long terminal repeats being predominant. A total of 37,158 protein-coding genes were identified through a combination of ab initio prediction, homology-based prediction, and transcriptome-based prediction, 87.19% (32,397) of which were functionally annotated. The chromosome-level genome assembly of P. heterophylla provides valuable genetic resources not only for elucidating the molecular mechanisms underlying floral dimorphism between CH and CL flowers, but also for advancing our understanding of the biosynthesis of bioactive metabolites in P. heterophylla, facilitating the molecular breeding and genetic improvement for the high-efficiency utilization of this medicinal plant.

Methods

Sample collection and sequencing

Wild-growing P. heterophylla individuals were collected from Kunyu Mountain in Yantai City, Shandong Province, China (37°16′ N, 121°45′ E). These plants were then cultivated under conditions of a 16 h/8 h (day/night) photoperiod at 24 °C with 60% humidity in the greenhouse of Institute of Botany, Chinese Academy of Sciences. Fresh young leaves of the same individual were collected for genomic DNA extraction. Multiple tissues, including leaves, stems, flowers (CH and CL flowers), fruits, and roots, were sampled from multiple individuals for transcriptome sequencing. The harvested materials were immediately frozen in liquid nitrogen and subsequently stored at −80 °C until DNA and RNA extraction.

Genomic DNA was extracted following the modified CTAB method23. The quality of the extracted DNA was examined using a 1% agarose gel electrophoresis and a NanoDrop 2000 spectrophotometer (Thermo Fisher Scientific, USA), and DNA concentration was quantified using a Qubit 3.0 Fluorometer (Thermo Fisher Scientific, USA). For genome survey sequencing, DNA libraries were constructed using Hieff NGS® OnePot Pro DNA Library Prep Kit v4 (Yeasen, China), following the manufacturer’s instructions. DNA library quality was assessed using an Agilent 2100 Bioanalyzer (Agilent Technologies, USA). The library that passed quality control was then sequenced on DNBSEQ-T7 platform (MGI Tech, China) with a 150-bp paired-end mode, producing 295.85 Gb short-read data (Table 1). For PacBio HiFi sequencing, A SMRTbell (single-molecule real-time) library was prepared using SMRTbell Express Template Prep Kit 2.0 (Pacific Biosciences, USA) following the manufacturer’s instructions. The quality and concentration of the final library were examined using an Agilent 2100 Bioanalyzer. Qualified library was sequenced on the PacBio Sequel II system (Pacific Biosciences, USA), generating 68.11 Gb of HiFi long reads (Table 1).

Table 1 Summary of the sequencing data for Pseudostellaria heterophylla assembly and annotation.

For Hi-C sequencing, fresh leaves were collected from clonally propagated plants (derived from cuttings of a single mother plant) and fixed with formaldehyde to cross-link DNA and proteins. Following the standard protocol, Hi-C libraries were constructed and then assessed for concentration and insert size using Qubit 3.0 and Agilent 2100. The effective concentration of the libraries was accurately determined by qRT-PCR to ensure library quality. The Hi-C library was subsequently sequenced on the DNBSEQ-T7 platform, yielding a total of 238.97 Gb of Hi-C raw data (Table 1).

For transcriptome sequencing, total RNA was isolated independently from five tissue types (leaves, stems, flowers, fruits, and roots) to serve as input material. mRNA was purified from total RNA using poly-T oligo-attached magnetic beads, then fragmented and reverse-transcribed into cDNA using M-MuLV Reverse Transcriptase. The cDNA was processed through end repair, adenylation, and adaptor ligation. Fragments of 370–420 bp were selected using the AMPure XP system, followed by PCR amplification with Phusion High-Fidelity DNA polymerase. After purification, the library was quantified using Qubit 3.0, diluted to 1.5 ng/µL, and the insert size was verified. The effective concentration was accurately determined by qRT-PCR to ensure library quality. Qualified libraries were pooled and sequenced on the Illumina Novaseq 6000 platform, obtaining a total of 74.77 Gb RNA-seq reads for the subsequent genome annotation analysis (Table 1).

Genome size estimation

To assess the genome size of P. heterophylla, we performed flow cytometry and k-mer analyses. Nuclei for flow cytometry were isolated from fresh P. heterophylla leaves according to a previously described protocol24. Nuclei were stained with 4’,6-diamidino-2-phenylindole (DAPI), and their DNA fluorescence was subsequently measured on a MoFlo XDP flow cytometer (Beckman Coulter, USA). Data analysis was conducted using Summit 5.2 software, with only results exhibiting coefficient of variation below 5% considered reliable. Physalis floridana25 was used as the internal standard. The P. heterophylla genome size was then calculated as follows: Reference genome size of P. floridana × (Mean fluorescence of the P. heterophylla G1 peak / Mean fluorescence of the G1 P. floridana peak). This approach yielded an estimated genome size of 2.02 Gb for P. heterophylla. For k-mer analysis, a total of 295.85 Gb raw short reads (135.09 × coverage) were first filtered using fastp v0.23.426 with default parameters to obtain clean reads. Clean reads were then processed by Jellyfish v2.2.1027 to generate a 21-mer frequency distribution, followed by genome characteristics evaluation with GenomeScope v1.028. This analysis estimated the P. heterophylla genome size as approximately 2.08 Gb, consistent with the flow cytometry estimation, and revealed a heterozygosity rate of 0.289% (Fig. 2).

Fig. 2
figure 2

Genome survey of Pseudostellaria heterophylla based on the 21-mer distribution analysis.

Genome assembly

De novo assembly of the P. heterophylla genome was performed using Hifiasm v0.16.129 with 68.11 Gb PacBio HiFi long reads (31.10 × coverage, Table 1). The primary assembly, a longer and more continuous set of contigs, was extracted from the initial output generated by Hifiasm. To obtain a non-redundant, haplotype-purged assembly, the HiFi reads were realigned to the primary assembly using Minimap2 v2.2430. The resulting alignments were filtered and sorted via SAMtools v1.1331. Then, Purge_haplotigs v1.1.232 was employed to analyze the read coverage depth profile and identify and remove the redundant regions from the sorted alignments. Assembly quality was further assessed using Inspector v1.233, and iterative corrections were implemented based on its error profiles. The resulting draft genome spanned 2.19 Gb, comprising 211 contigs (longest: 166.22 Mb) and a contig N50 of 69.71 Mb.

To further improve the genome assembly continuity and accuracy, Hi-C data were aligned to the draft genome using Juicer v1.634. Subsequent optimization was performed with the 3D-DNA pipeline v18092235 to correct misassemblies and refine contig topology. Manual curation of the raw scaffolds was then conducted in Juicebox v2.20.0036 by examining chromatin interaction patterns to resolve ambiguous contig orientations and placements. Ultimately, 16 pseudochromosomes were unambiguously assembled based on distinct Hi-C interaction signals, covering 99.36% of the genome sequences (Fig. 3). The final chromosome-level assembly of P. heterophylla spans 2.19 Gb with a scaffold N50 of 144.78 Mb (Table 2).

Fig. 3
figure 3

Hi-C heatmap for the genome assembly of Pseudostellaria heterophylla.

Table 2 Summary of Pseudostellaria heterophylla genome assembly.

Genome annotation

A comprehensive multi-step strategy was employed to annotate the P. heterophylla genome, including repeat element identification, protein-coding gene prediction, and non-coding RNA prediction. Repeat elements in the genome were annotated using EDTA v2.1.237 pipeline with default parameters, which combines de novo, homology-based, and structural-based methods for comprehensive identification. The “LTR-unknown” sequences from the initial EDTA output were further classified using DeepTE38. In total, 79.41% of the P. heterophylla genome was identified as repetitive sequences. Among these, long terminal repeats (LTRs) were the most predominant (67.30%), followed by terminal inverted repeats (TIRs, 8.55%) (Table 3).

Table 3 Summary of repeat sequences in Pseudostellaria heterophylla genome.

Five types of non-coding RNA, which are microRNA (miRNA), transfer RNA (tRNA), small nuclear RNA (snRNA), small nucleolar RNA (snoRNA), and ribosomal RNA (rRNA), were also predicted in the P. heterophylla genome. tRNA prediction was performed using tRNAscan-SE v2.0.1239 with default parameters, while rRNA identification was conducted with barrnap v0.9 (https://github.com/tseemann/barrnap). The remaining non-coding RNAs were annotated using INFERNAL v1.1.540 with the Rfam database41 as reference. This comprehensive analysis identified a total of 22,646 non-coding RNA loci, comprising 102 miRNAs, 7,456 tRNAs, 529 snRNAs, 9,150 snoRNAs, and 5,409 rRNAs (Table 4).

Table 4 Summary of non-coding RNAs in Pseudostellaria heterophylla genome.

Protein-coding genes prediction was performed by a combination of ab initio prediction, homology-based prediction, and transcriptome-based prediction. For ab initio prediction, Augustus v3.3.342 and SNAP43 were employed with default parameters. For homology-based prediction, the P. heterophylla genome assembly was aligned against the protein sequences of eight highly-annotated species, including Arabidopsis thaliana44, Heliosperma pusillum45, Silene latifolia46, Gypsophila paniculata47, Glycine max48, Vitis vinifera49, Oryza sativa50, and Amborella trichopoda51. For transcriptome-based prediction, transcriptome sequencing data were trimmed using TRIMMOMATIC v0.3652. Clean data were then mapped to the P. heterophylla genome and assembled into transcripts via HISAT2 v2.2.153 and StringTie v2.2.154, following the prediction of open reading frames by TransDecoder v5.7.1 (https://github.com/TransDecoder/TransDecoder). Maker3 v2.31.1155 was used to integrate gene models predicted by all methods, resulting in the final gene set. Ultimately, 37,158 protein-coding genes were predicted for the P. heterophylla genome (Table 5). The genomic features were then visualized by circos v0.69-856 (Fig. 4).

Table 5 Summary of protein-coding gene annotation in Pseudostellaria heterophylla genome.
Fig. 4
figure 4

The genomic features of Pseudostellaria heterophylla. The features are arranged in the order of chromosomes, GC content, gene density, repeat density, LTR/Copia density, LTR/Gypsy density, and syntenic blocks from outside to inside across the 16 pseudochromosomes. Syntenic blocks among inter-chromosome were identified by MCScanX70 with default parameters.

Function annotation of these predicted genes was conducted via a two-step approach. Initially, the eggNOG-mapper v2.1.1357 software was applied to align those gene sequences to the eggNOG v5.0 database58, which successfully annotated 30,851 (83.03%) of the gene set. Among these, 28,630 were assigned Cluster of Orthologous Groups (COG) categories, 14,271 were annotated with Gene Ontology (GO) terms, and 9,598 were linked to pathways from the Kyoto Encyclopedia of Genes and Genomes (KEGG). Additionally, motifs and domains were identified using InterProScan v5.75-106.059 to compare with the InterPro member databases. This analysis revealed that 30,533 proteins (82.17%) contained conserved domains, with 26,819, 23,456, 18,562, and 17,487 proteins annotated in the PANTHER60, Pfam61, Gene3D62, and SUPERFAMILY63 databases, respectively. Overall, 32,395 (87.18%) of the predicted protein-coding genes were functionally annotated in at least one of these databases (Table 5).

Data Records

The short reads, PacBio HiFi reads, Hi-C reads, and RNA-seq reads have been deposited in the Genome Sequence Archive (GSA) of the National Genomics Data Center (NGDC) under the accession number CRA02847764. The final P. heterophylla genome assembly has been deposited in European Nucleotide Archive (ENA) with the accession number GCA_977035685.165. The genome annotation files are available in Figshare66.

Technical Validation

Assembly quality and completeness were evaluated through three complementary approaches. Firstly, the filtered short reads were mapped back to the final P. heterophylla genome via Bowtie2 v2.3.4.167, achieving 96.74% mapping rate and 99.90% genome coverage. Secondly, the base-level accuracy of the genome assembly was assessed by Merqury v1.368, based on 21-mers derived from DNBSEQ-T7 sequencing short reads. This analysis yielded a consensus quality value of 53.67, which means an extremely low error rate of 4.29 × 108. Thirdly, BUSCO v5.7.169 analysis against the embryophyta_odb10 dataset was employed, revealing 97.83% complete BUSCOs (90.71% single-copy, and 7.12% duplicated), along with 0.93% fragmented and 1.24% missing BUSCOs. These results collectively confirm a high-quality genome assembly for P. heterophylla.

The predicted proteins were evaluated by BUSCO v5.7.169 with the embryophyta_odb10 dataset. Among a total of 1,614 BUSCOs, 1,555 (96.34%) BUSCOs were complete (1,452 single-copy BUSCOs and 103 duplicated BUSCOs), 15 (0.93%) BUSCOs were fragmented and 44 (2.73%) BUSCOs were missing, which indicated high-quality annotation of the predicted gene models.