Background & Summary

Gypsophila vaccaria Sm., an annual herbaceous plant in the Caryophyllaceae family1, is known in traditional Chinese medicine as “Wang Bu Liu Xing” for its dry mature seeds, which are used to treat amenorrhea, urinary tract infections and to stop bleeding2,3 (Fig. 1). The bioactive constituents of G. vaccaria seeds include triterpenoid saponins, cyclic peptides, flavonoids, and crude polysaccharides4,5,6. According to the 2010 edition of the Pharmacopoeia of China, vaccarin is the primary bioactive ingredient in the seeds and is used as the defining marker of Wang Bu Liu Xing2.

Fig. 1
figure 1

Gypsophila vaccaria. (a) living habitat, (b) flowers, and (c) mature seeds.

Most medical research on G. vaccaria has focused on the extraction, isolation, and characterization of its main medicinal ingredient, vaccarin, along with studies on its pharmacology, toxicology, and the development and quality evaluation of new drugs7,8. Based on the acquisition of genomes and genetic information of genes, a series of difficult problems faced in the research of traditional Chinese medicine can be solved through the research and development of gene functions9,10,11. However, the lack of high-quality genomic data for G. vaccaria has hindered research into the biosynthesis and accumulation of its essential medicinal components, as well as the identification of the relevant pathways and genes involved in these processes. This gap limits the production and broader application of this important traditional Chinese medicine. Therefore, investigating the genome of G. vaccaria is crucial for analyzing its genetic background and understanding the biological mechanisms behind its medicinal properties. This knowledge can subsequently inform the selection of superior varieties of Chinese herbal medicines and facilitate genetic enhancement.

A high-quality, chromosome-level genome of G. vaccaria (2n = 30) was obtained by generating approximately 26.79 Gb of PacBio HiFi data and 14.07 Gb of transcriptome data based on the Illumina platform. We also integrated the 120.6 Gb of high-throughput chromosome conformation capture (Hi-C) sequencing data. This well-annotated genome offers valuable insights into the biosynthesis of the medicinal components of G. vaccaria and lays the groundwork for future studies on genetic improvement.

Methods

Sample collection and genome sequencing

Fresh leaves of G. vaccaria were collected from a healthy plant cultivated at Hexi Corridor Medicinal Plant Plantation (Fig. 1a,b). The leaves were washed with distilled water, flash-frozen in liquid nitrogen, and stored at −80 °C until sequencing. Genomic DNA was extracted to construct a library using the SMRTbell Express Template Prep Kit, following the manufacturer’s protocol. Genome sequencing was conducted on the PacBio Revio platform (BerryGenomics, Beijing, China), yielding 26.79 Gb of data with an average read length of 15.3 kb and an N50 of 15.1 kb (Table 1). Additionally, whole-genome sequencing was performed on the Illumina HiSeq2500 platform, generating 21.77 Gb of short reads for genome survey analysis (Table 1).

Table 1 Samples and sequencing statistics.

To accurately annotate the G. vaccaria genome, transcriptome sequencing was conducted on the fresh leaves using the Illumina NovaSeq 6000 platform. Low-quality reads were filtered out by using fastp12, a total of 14.07 Gb of clean data were obtained. Additionally, fresh leaves were used to construct a library for Hi-C sequencing13. The Hi-C library was prepared according to standard procedures and sequenced on the Illumina Novaseq 6000 platform, generating 120.6 Gb of raw data with approximately 110 × genome coverage (Table 1).

Genomic characteristics estimation

The genome size and heterozygosity of G. vaccaria were estimated using two complementary methods. First, root tips of G. vaccaria, along with those of reference species Solanum lycopersicum and Zea mays, were stained, and the nuclear suspensions were analyzed using a flow cytometer14. The genome size of G. vaccaria was calculated based on fluorescence intensity, resulting in estimates of 1.11 Gb and 1.16 Gb when using the two reference species (Table 2). The second estimation method involved k-mer analysis. The k-mer distribution of both short reads and HiFi reads was calculated using Jellyfish with the parameter “-k 19”15. GenomeScope (v2.0)16 was then used to analyze the k-mer distribution, providing estimates for genome size and heterozygosity rate. This analysis yielded estimated genome sizes of 962 Mb and 874 Mb, with heterozygosity rates of 0.0971% and 0.74%, respectively (Fig. 2).

Table 2 Genome size estimated based on flow cytometry.
Fig. 2
figure 2

Genome survey of G. vaccaria based on distribution of 19-mer. (a) Genome survey based on next generation sequencing data. (b) Genome survey based on third generation sequencing data.

Genome assembly and scaffolding

The PacBio HiFi reads were assembled using Hifiasm with default parameters17. The initial contigs were aligned to the HiFi reads using Minimap218 with the parameter “-xasm20”, and the results were used to calculate the sequencing depth of each contig. Simultaneously, the initial contigs were compared against organelle genomes using BLASTn with the parameters “-evalue 1e-5 -perc_identity 0.8 -task megablast”19. Contigs with high sequencing depth (>150×) and high organelle genome coverage (>85%) were removed. The remaining contigs were then aligned both to the HiFi reads and to themselves to eliminate haplotigs using Purge_Dups v1.2.6 with default parameters20. This analysis resulted in an assembly size of 1,089.9 Mb and an N50 length of 9.73 Mb (Table 3). Following the removal of redundancy, Hi-C paired-end reads were employed to construct pseudo-chromosomes using Haphic21. The chromosome-scale pseudomolecules were subsequently gap-filled and polished using TGS-GapCloser222 and NextPolish223. The final genome assembly was organized into 15 pseudo-chromosomes, resulting in a total genome size of 1.09 Gb and an N50 length of 73.3 Mb (Table 3, Fig. 3).

Table 3 Statistics of G. vaccaria genome assembly.
Fig. 3
figure 3

Heatmap of strength of DNA-DNA interactions discovered by the Hi-C dataset. Blue frames denote for the 15 assembled pseudo-chromosomes of the G. vaccaria.

To evaluate the completeness of the genome assembly, different strategies were employed. Short reads and HiFi reads were mapped to the assembled genomes using BWA and Minimap2 to estimate mapping rates and genome coverage18,24. The results indicate that over 99% of the short reads and HiFi reads could be mapped back to the genome (Table 4). Merqury and LAI analyses were conducted to assess the quality of the genome assembly, revealing that each pseudo-chromosome has high QV values (Table 5) and an overall LAI value of 17.33 for the entire genome25,26. These assessment results suggest a high quality for the G. vaccaria genome. Moreover, Benchmarking Universal Single-Copy Orthologs (embryophyta_odb10) were used to calculate the completeness of the assembled genome and the subsequent annotations of protein-coding genes27. The completeness values for the genome and protein modes were 95.9% and 94.9%, respectively, demonstrating the high accuracy of the genome assembly and gene prediction (Fig. 4).

Table 4 Summary of reads mapping rate and meandepth of G. vaccaria.
Table 5 Evaluation of the assembly quality of the G. vaccaria genome by Merqury analysis.
Fig. 4
figure 4

Assembly quality of the G. vaccaria genome estimated by BUSCO analysis.

Annotation of repetitive elements

Transposable elements in the genome were annotated using a combination of ab initio prediction and homology searching. A consensus sequence library was built using LTR_FINDER28, RepeatScout29, and RepeatModeler30. Repeat regions were annotated using Repeatmasker31 based on the aforementioned library and the RiTE database32. Repeat sequences at the protein level were identified using RepeatProteinMask, while tandemly repeated sequences were identified using Tandem Repeats Finder (TRF)33. Overall, 80.43% of the whole genome was annotated as repetitive elements using both homology-based and de novo prediction methods. This included 54.5% retrotransposons (594.0 Mb) and 6.0% DNA transposons (65.4 Mb) (Fig. 5, Table 6).

Fig. 5
figure 5

Illustration of genomic features of 15 pseudo-chromosomes of the G. vaccaria genome. The tracks from outside to inside represent: Gypsy density、Copia density、repeat element density、GC content、gene density, and collinear blocks.

Table 6 Statistics of repeat elements in the G. vaccaria genome.

Gene prediction and functional annotation

Protein-coding gene prediction was performed using a combined approach of homology-based, ab initio, and RNA-Seq-assisted predictions. First, two transcriptomes derived from G. vaccaria leaves were mapped to the genome using HISAT234, and the aligned RNA-Seq reads were processed with Samtools35 to generate BAM files. These BAM files were used in a genome-guided transcriptome assembly with StringTie236. The transcriptome assemblies provided RNA evidence for BRAKER37, which utilized both AUGUSTUS38 and GeneMark-ET39 to refine gene predictions by training models on RNA-Seq hints, improving gene structure accuracy. For homology-based evidence, a comprehensive protein database, including odb10_plants (3,510,742 proteins) as well as proteins from three Caryophyllaceae species—Saponaria officinalis (GCA_040167595.1)40, Silene conica (GCA_0292556 85.2)41, and Gypsophila paniculata (GCA_032274805.1)42-was used to enhance the annotation through protein sequence alignment and homology-informed model training in AUGUSTUS38. Ab initio prediction was carried out using trained models from RNA-Seq and protein evidence data, allowing BRAKER37 to perform gene structure inference with minimal manual intervention. Finally, the combined results from these strategies produced a comprehensive annotation, identifying a total of 21,795 protein-coding genes and 24,568 transcripts for G. vaccaria.

For functional annotation, protein-coding genes were compared with homologs in the SwissProt and NR databases using BLASTP (e-value = 10−5)43. Domains and gene ontology terms were annotated through sequence comparisons using HMMER44 with the Pfam45 and GO databases46. KEGG pathway annotations were derived from comparisons with homologs in the KEGG databases47. Over 97% of total predicted proteins were found to have homologs with functional annotations (Table 7).

Table 7 Summary of gene function annotations of G. vaccaria genome.

For non-coding genes, rRNA genes were predicted using Barrnap v0.9 (https://github.com/tseemann/barrnap), identifying 15,477 rRNA features. Additionally, 1,287 candidate tRNA genes were predicted using tRNAscan-SE48. Other non-coding RNA genes were identified by searching against Rfam.cm database49, with clan information provided by Rfam.clanin to group related families using Cmscan, resulting in the identification of 5,724 candidate genes.

Data Records

The raw Illumina, PacBio, Hi-C and RNAseq sequencing data have been deposited in the NCBI SRA database under accession number SRP53655650. The final chromosome assembly has been deposited in NCBI GenBank JBHZIJ00000000051. The genome annotation files have been deposited in the Figshare database52.

Technical Validation

To ensure the quality and validity of our data, we implemented rigorous quality control measures during sequencing, assembly, and annotation.

Quality control for sequencing data

For Illumina DNA sequencing, raw reads were filtered to remove adaptors and low-quality sequences (quality score < 20), retaining 95.90% of the original reads as high-quality reads. This resulted in 21.67 Gb of clean data, corresponding to approximately 19.2 × genome coverage. Similarly, for Hi-C DNA sequencing, after quality filtering (quality score < 20), 97.34% of reads were retained, generating 116.18 Gb of clean data (~101.9× genome coverage). For HiFi sequencing, the N50 read length reached 15.12 Kb, with the longest read spanning 40.9 Kb and an average read length of 15.27 Kb. Additionally, for transcriptome sequencing, two datasets were obtained, with Q20 values of 97.48% (6.54 Gb) and 98.30% (6.84 Gb), respectively. These high-quality sequencing data ensure the accuracy and reliability of the subsequent genome assembly and analyses.

Genome assembly quality assessment

We assessed the genome assembly based on contiguity, completeness, and correctness. The assembled genome exhibited a scaffold N50 of 73.29 Mb, with 15 chromosomal-level scaffolds successfully obtained. Completeness was evaluated using BUSCO analysis with the ‘embryophyta_odb10’ database, revealing a high completeness score of 97.96%, with 95.79% of sequences classified as complete and 2.17% as fragmented. To verify the accuracy of the genome assembly, we mapped both Illumina and HiFi sequencing reads to the assembled genome, achieving mapping rates of 99.03% and 99.99%, respectively. Furthermore, transcriptome data were mapped to the genome, yielding mapping rates of 96.74% and 95.38%. These results, along with phylogenomic analysis, collectively demonstrate the high quality and accuracy of the assembled genome.