Background & Summary

While the decline in global biodiversity has been a long-standing concern, the compounding effects of habitat destruction and climate change have now made it a pressing crisis1. According to the State of the World’s Plants and Fungi 2020 report released by the Royal Botanic Gardens, Kew in the United Kingdom, over 30% of known plant species worldwide are currently threatened with extinction. In China, this trend is reflected in the nearly 5,000 plant species that have been classified as endangered2. Therefore, to provide effective and scientifically-grounded protection for these endangered species, researchers have proposed the concept of Plant Species with Extremely Small Populations (PSESP)3. The concept characterized by narrow geographic distribution, persistent intrinsic and extrinsic constraints, visible population degradation, and ongoing demographic decline, has attracted considerable attention, rapidly becoming a central focus of biodiversity conservation in China4,5,6. Moreover, advancements in whole-genome sequencing technologies have greatly facilitated related conservation work, enabling more in-depth investigations into the mechanisms underlying endangerment in many PSESP species7,8,9,10.

As a species endemic to the Qinghai-Tibetan Plateau (QTP), the S. przewalskii is primarily found in Qilian County, Qinghai Province11. It belongs to the genus Swertia (subtribe Swertiinae, Gentianaceae), is a perennial herb, and one of the original source plants of Tibetan medicine “Zangyinchen”11. It contains a rich array of chemical constituents (such as loganic acid, sweroside, and gentiopicroside) and exhibits significant anti-inflammatory effects12. The species was included in the initial list of plant species with extremely small populations in Qinghai province due to its narrow distribution range, and the extremely limited number of wild populations13. In recent years, research on the taxonomy, phylogeny, and population genetics of S. przewalskii has been conducted. However, studies focusing on its conservation genetics remain scarce14,15,16. Therefore, it is crucial to investigate the mechanisms underlying endangerment of S. przewalskii.

Herein, we reported the first chromosome-level genome assembly of S. przewalskii, achieved through a combination of three sequencing strategies: next-generation sequencing, third-generation SMRT (Single Molecule Real-Time) sequencing, and high-throughput chromosome conformation capture (Hi-C) sequencing. This high-quality genome provides a valuable resource for analyzing the genetic factors underlying the endangerment of S. przewalskii. Building on this, our study further deepens the understanding of its genetic background and offers scientific evidence to support the conservation of this endangered species.

Methods

Plant materials

Specimens of the study species were collected and their morphological traits were examined in details. The stem was erect and yellow-green, with black-brown withered leaf petioles remaining at the base. Each plant bore 1-2 pairs of basal leaves, with blades ranging from oval to ovate-elliptic or spoon-shaped. The inflorescence formed a narrow umbel comprising 3–9 flowers. Detailed observations of the flowers revealed that the calyx extended to approximately two-thirds the length of the corolla, which was yellow-green with a blue center that gradually turned brown with age. The anthers were blue and either oval or narrow rectangular, while the ovary surface often displayed transverse wrinkles17 (Fig. 1A).

Fig. 1
Fig. 1The alternative text for this image may have been generated using AI.
Full size image

Ecological map of S. przewalskii, and the genome assembly and annotation of S. przewalskii. (A) Photograph of S. przewalskii, showing its natural habitat, and morphological features. (B) The process pipeline of S. przewalskii genome assembly and annotation.

The fresh root, stem, leaf, and flower tissues (37° 36′ 0.00″N, 100° 42′ 0.00″E, at an elevation of 3486 m) were harvested from a mature individual of S. przewalskii. To ensure optimal preservation of the collected samples and to prevent RNA degradation, the plant tissues were treated in the field by rinsing twice with ultrapure water and once with a 75% ethanol, transferred into Eppendorf (EP) tubes, and immediately frozen in liquid nitrogen. Upon return to the laboratory, the samples were subsequently stored at −80 °C until further use.

Library preparation and sequencing

For next-generation sequencing, the genomic DNA was extracted from S. przewalskii employing the CTAB (Hexadecyl trimethyl ammonium Bromide) protocol, which is widely used for plant DNA isolation due to its efficiency in removing polysaccharides and polyphenols18. Genomic DNA (gDNA) quality and quantity were carefully assessed prior to library construction. The integrity and degree of fragmentation of DNA were evaluated using agarose gel electrophoresis, while DNA purity and concentration were measured using a NanoDropTM One UV-Vis19 spectrophotometer (Thermo Fisher Scientific, Waltham, MA, USA) and a Qubit 3.0 fluorometer20 (Life Technologies, Carlsbad, CA, USA), respectively. High-quality gDNA fragments of approximately 150 bp were then selected to construct sequencing libraries, ensuring optimal fragment size for downstream analyses. The DNA libraries were initially quantified using the Qubit 2.0 fluorometer21, and after appropriate dilution, the fragment size quality was evaluated with an Agilent 2100 Bioanalyzer22. Finally, the effective library concentration was determined by quantitative PCR (Q-PCR)23 to ensure suitability for downstream sequencing applications. After passing these quality control step, next-generation sequencing was performed on the Illumina NovaSeq 6000 platform (Illumina, San Diego, CA, USA).

For third-generation SMRT (Single Molecule Real Time) sequencing, genomic DNA was first extracted using the same procedure as for next-generation sequencing. The high-quality genomic DNA was then fragmented to appropriated sizes, followed by damage repair, adapter ligation to construct the sequencing library. After selecting the desired fragments and quantifying the library, a PCR-free SMRTbell library was prepared. Finally, the library templates were combined with the sequencing enzyme complexes and loaded onto the PacBio Revio platform for high-throughput sequencing.

For Hi-C sequencing, genomic DNA was extracted from fresh leaves and fixed with formaldehyde, followed by digestion with the restriction endonuclease DPNII. The quality of the DNA was assessed according to standard next-generation sequencing protocols, including evaluation of DNA fragment size, degree of degradation, and measurement of purity and concentration. Hi-C libraries were then constructed through a series of steps, including cell crosslinking, endonuclease digestion, end repair, cyclization, DNA purification, and capture. The resulting libraries were subjected to high-throughput sequencing on the Illumina platform (Fig. 1B).

Data filtering and genomic assembly

Next-generation sequencing data were first subjected to quality control to remove low-quality reads and adapter sequences using Fastp v 0.21.024 with default parameters, yielding high-quality clean reads for downstream analyses. For third-generation SMRT sequencing, raw polymerase reads were processed to retain only the Subreads corresponding to the insert sequences. Adaptors sequences and low-quality reads were subsequently removed using SMRTlink v12.0 (https://www.pacb.com/support/software-downloads), producing high-quality reads suitable for accurate genome assembly. For Hi-C sequencing, raw data were subjected to rigorous quality control, during which low-quality sequences were removed to obtain clean reads. To obtain an initial insight into the genomic characteristics, genome size and heterozygosity were estimated utilizing a K-mer-based strategy. Specifically, genome size was assessed with GCE v1.0.025. In parallel, Jellyfish v2.2.1026 was used to calculate the K-mer frequency-depth distribution, which provided an additional basis for genome size estimation. Following quality control and filtering of raw sequencing data, next-generation short reads and HiFi reads were employed for genome assembly of S. przewalskii. Short reads provided high base accuracy, facilitating error correction of sequencing, while HiFi reads span repetitive and complex regions, enabling continuous contig construction. In general, hifiasm v0.16.1 (https://github.com/chhylp123/hifiasm) assembled the genome using an overlap-layout-consensus (OLC) algorithm. HiFi reads were first aligned using all-versus-all method to detect overlaps, and three rounds of self-correction were performed to reduce sequencing errors. This process was iteratively repeated, allowing overlaps to be refined and a string graph to be constructed for contig-level genome assembly. The resulting contigs were then scaffolded to chromosome-level assemblies using Hi-C interaction data. Hi-C valid interaction pairs were employed to cluster contigs into chromosome groups and determine their linear order and orientation using ALLHIC v0.9.827, with manual refinement performed based on Hi-C contact maps. Interaction data were converted into binary format using 3D-DNA v 18041928 and jucier v1.629. The Manual refinement of contig order and orientation was performed with Juciebox v 1.11.0830. Redundant sequences were identified and removed with Purge_dups v 1.2.5 (https://github.com/dfguan/purge_dups), further optimizing the genome assembly. To generate a chromosome-level assembly, gaps were filled using placeholder sequences of 100 Ns. Finally, HiCExplorer v 3.631 was used to visualize chromosomal interactions via Hi-C heatmaps. To comprehensively evaluate the quality of the genome assembly, we calculated its continuity, consistency, and completeness. The contig N50 was used to quantify assembly continuity. Clean reads from next-generation sequencing were aligned to the reference genome using BWA-MEM v0.7.1732, providing mapping rate and coverage as indicators of assembly consistency. Finally, genome completeness was evaluated with BUSCO v5.3.0 v 5.3.033 (parameter setting: Busco -m prot -c 16).

A comprehensive sequencing effort was undertaken using three sequencing strategies, yielding a total of 136.54 GB of Illumina short-read data, 81.71 GB of PacBio HiFi long-read data, and 268 GB of Hi-C data. Following quality control, 909,483,996 clean reads were retained from the Illumina dataset out of 910,278,082 raw reads, with Q20 and Q30 scores reaching 98.38% and 95.42%, respectively, reflecting high sequencing quality. The PacBio HiFi dataset comprised 5,057,384 reads with an average length of 16,157 bp. For the Hi-C data, 1,791,494,328 reads were retained after filtering, and both Q20 and Q30 scores exceeded 90%, demonstrating superior sequencing quality (Table 1), Collectively, these datasets provide a robust foundation for high-quality genome assembly. The 19-mer frequency analysis yielded a genome size estimate of ~2027.49 Mb for S. przewalskii, with the distribution peak suggesting low heterozygosity (1.14%, ~ 66.71 X) and a largely homozygous genome (Fig. 2A). The genome assembly exhibited high continuity, with a contig N50 of 2,569,777 bp. Mapping Illumina clean reads to the reference genome revealed a coverage and alignment rate exceeding 90%, further supporting assembly accuracy. Assessment of gene content using BUSCO identified X complete orthologs, corresponding to over 95% completeness (Table 2), indicating that the assembled genome is both consistent and complete. The sequences were successfully anchored to twelve chromosomes, and Hi-C interaction heatmaps confirmed the integrity and high quality of the chromosome-level assembly (Fig. 2B,C).

Table 1 Sequencing platform yield and statistics.
Fig. 2
Fig. 2The alternative text for this image may have been generated using AI.
Full size image

The assembly and characteristics of chromosome-level genomes of S. przewalskii. (A) The distribution of depth and frequence for Kmer = 19. (B) The Hic heatmap for interaction of twelve chromosomes, which indicates that a better genome anchoring. C The synteny and distribution of chromosome-level genomes of S. przewalskii. (I) synteny of gene for S. przewalskii genome. (II) GC content. (III) the density of repeat sequences. (IV) gene density (V) the twelve chromosomes.

Table 2 Statistical analysis of BUSCO in S. przewalskii genome.

Genome annotations

The genome annotation of S. przewalskii encompassed the identification of repetitive sequences, protein-coding genes and its models, and non-coding RNAs (ncRNA). Repetitive sequences were annotated using a combination of Ab initio and Repbase-based approaches to maximize sensitivity and accuracy. For Ab initio prediction, RepeatModeler v open-1.0.1134 and RepeatMasker v open-4.0.935 were employed with default parameters to construct de novo repeat library and identify repeat elements. In parallel, Repbase-based prediction was conducted using RepeatModeler v 2.0.434 with the parameter of -database mydb -threads 16, referencing the Repbase database to identify both DNA- and protein-level repeat sequences. To eliminate redundancy, LTR retriever v 2.9.036 was utilized to obtain non-redundant long terminal repeat sequences from the combinated results of the Ab initio and Repbase-based searches. These two De novo sequence sets were subsequently integrated to form a comprehensive De novo repeat library. Repeated sequences were then aligned and predicted by integrating the De novo and Repbase databases, generating a consolidated annotation of repetitive elements. Transposable element sequences were further predicted using RepeatProteinMask v 4.1.5 (https://github.com/Dfam-consortium/RepeatMasker), and all repeat annotation results were merged with redundancies removed to produce the final high-confidence set of transposable elements. Additionally, tandem repeats were identified utilizing TRF v 4.0937 and MISA v 2.138, completing the characterization of repetitive sequences within the genome. In this study, repetitive sequences were annotated using a combined strategy of ab initio prediction and Repbase-based homology searches. Allowing for comprehensive identification of various repeat types, long terminal repeats (LTRs) were the most abundant, accounting for 74.73% of the genome, followed by DNA elements (6.93%), LINEs (4.77%), SINEs (0.26%), and Unknown (2.13%) (Fig. 3A,B). Tandem repeats comprised a total of 4.963% of the genome, including minisatellite (repeat units of 10–99 bp, 2.886%), satellite (repeat units ≥ 100 bp, 1.823%), and microsatellite (repeat units of 1–9 bp, 0.823%) (Table S1). These results provide a detailed overview of the repetitive landscape of the S. przewalskii genome, highlighting the predominance of LTRs and the contribution of tandem repeats to genome structure.

Fig. 3
Fig. 3The alternative text for this image may have been generated using AI.
Full size image

The repeat sequences analysis in S. przewalskii genome. (A) the divergence rate and percentage of genome in the four transposable elements. (B) the statistical analysis of all transposable elements in S. przewalskii genome. X-axis: the sequence divergence rate of each TE, Y-axis: the percentage of S. przewalskii genome for each TE.

To facilitate comprehensive genome annotation of S. przewalskii, it was essential to obtain detailed information on gene distribution and structure. We employed an integrative strategy combining transcriptome-based prediction, homology prediction, and De novo prediction to maximize accuracy and completeness. For transcriptome-based prediction, both Illumina dataset and PacBio dataset were used. Raw Illumina reads were quality-filtered using fastp v 0.23.224, Hi-C reads with fastp v 0.21.024 (https://github.com/wdecoster/nanofilt), and PacBio reads with isoseq 3 v 3.9.0 (https://github.com/ylipacbio/IsoSeq3). Genome and reads alignments were carried out with Hisat2 v 2.2.139 for Illumina data, HICUP v 0.8.040 for Hi-C data, and Pbmm2 v 1.10.0 (https://github.com/PacificBiosciences/pbmm2) for PacBio data, enabling accurate mapping of reads to the reference genome. Coding sequences were predicted using TransDecoder v 5.7.0 (https://github.com/TransDecoder/TransDecoder), while homologous protein sequences were aligned to the genome with tblastn v 2.13.041, and transcript structures were further refined using Exonerate v 2.4.042. De novo gene prediction was conducted by Augustus v 3.5.043, trained on a specific parametric model. Additionally, gene set derived from the three strategies were integrated using Maker v 3.01.0344, generating a high-quality, non-redundant annotation. Through this integrative approach, we predicted a total of 35,701 protein-coding genes. The average coding sequence (CDS) length per gene was 1,089.01 bp, with each gene containing an average of 4.97 exons (Table 3). In total, 177,330 exons were identified with an average length of 306.64 bp, alongside 141,629 introns (average length: 1267.94 bp, total length: 179,577,541 bp) (Table 4). This comprehensive annotation provides a robust foundation for downstream functional and comparative genomic analyses. We further annotated multiple non-coding RNA, including 289 of miRNAs (0.0016%), 1,039 of tRNAs (0.0036%), 3,681 of rRNAs (0.2243%), and 9,244 of snRNAs (0.0473%), to achieve a more comprehensive genome annotation (Table 5). Additionally, the genome characteristics of S. przewalskii was visualized using R package “circlize” with a 50 kb sliding window, providing an intuitive overview of gene density, repeat content, and other genomic elements across chromosomes. Assessment of genome completeness revealed that over 95% of the completed BUSCOs were present, reflecting a high-quality and complete genome annotation.

Table 3 Statistical analysis of eight types transposable elements (TEs) in S. przewalskii genome.
Table 4 Statistical analysis of gene prediction in S. przewalskii genome.
Table 5 Statistical analysis of non-coding RNA in S. przewalskii genome.

To comprehensively characterize the functional landscape of protein-coding genes in S. przewalskii, we employed a dual annotation strategy, integrating both sequence similarity and motif/domain-based approaches. First, predicted protein sequences were compared against multiple databases, including universal protein45, non-Redundant Protein Database (NR)46, cluster of orthologous groups of proteins (COG)47, and eukaryotic orthologous groups (KOG), using diamond v 2.1.8, allowing rapid and sensitive identification of homologous proteins. Second, motif and domain-based annotations were conducted with InterProScan v 5.55-88.048, querying a suite of databases, including CDD, Gene3D, Hamap, Phobius, Pirsf, Prosite, Sfld, Superfamily, Tigrfam, Tmhmm and others. Conserved domains and sequence motifs were further identified using hmmscan v 3.3.249. Functional insights were then expanded through metabolic pathway assignment based on KOfam50 profiles within the Kyoto Encyclopedia of Genes and Genomes (KEGG)51 database. In parallel, non-coding RNA annotations were conducted to capture regulatory and structural elements. Transfer RNAs (tRNAs) were identified with tRNAscan-SE v 2.0.1252, ribosomal RNAs (rRNAs) with RNAmmer v 1.253, and ncRNA sequences with INFERNAL v 1.1.454, against the RNA family (Rfam, version: 14)55 database. Collectively, a total of 32,775 genes (91.80%) were functionally annotated across multiple databases, including NR (31,587; 88.48%), GO (23,883; 66.90%), KOG (663; 1.86%), and Pfam (22,995; 64.41%), KEGG (11,964; 33.51%) (Fig. 4, Table 6), providing a robust and multidimensional view of gene functions, conserved domains, and metabolic pathways in the S. przewalskii genome.

Fig. 4
Fig. 4The alternative text for this image may have been generated using AI.
Full size image

The analysis of gene function annotation of S. przewalskii genome. The gray circles on the vertical bars indicate overlapping annotations, such as genes identified by one or more databases. The total number of genes annotated to each database is shown by the horizontal bars.

Table 6 Analysis of gene function annotation in S. przewalskii genome.

Data Records

All the sequencing data have been deposited in the National Genomics Data Center (NGDC), Beijing Institute of Genomics, Chinese Academy of Sciences / China National Center for Bioinformation56,57 (Project accession number: PRJCA032197). Illumina short-reads, PacBio HiFi long-reads, Hi-C reads have been deposited in the Genome Sequence Archive (GSA) in NGDC under the accession number CRR137065758, CRR137065859, CRR137065960 that is publicly accessible at https://download.cncb.ac.cn/gsa2/CRA020335. S. przewalskii chromosome-level genome assembly and gene annotation files have been deposited in the Genome WareHouse (GWH) in NGDC under the accession number is GWHFIGX00000000.161 that is publicly accessible at https://ngdc.cncb.ac.cn/gwh. The genomic Illumina, PacBio, and Hi-C sequencing raw data were also deposited in the European Nucleotide Archive (ENA) at EMBL - EBI62 with accession number ERR1569609163, ERR1569632164, ERR1572902465 (Study accession number: PRJEB100542; Sample accession number: ERS26986347). S. przewalskii chromosome-level genome assembly and annotated files were deposited in the European Nucleotide Archive (ENA) at EMBL - EBI under the accession number is GCA_97701221566.

Technical Validation

In the present study, we utilized a NanoDropTM One UV-V spectrophotometer (Thermo Fisher Scientific, Waltham, MA, USA) and Qubit 3.0 fluorometer (Life Technologies, Carlsbad, CA, USA) to measure the DNA purity and concentration, respectively. After library preparation, sequencing was performed using three sequencing strategies. From these datasets, we filtered raw data, and retain high-quality clean data. We then utilized the19-mer frequency analysis to estimate genome size for S. przewalskii. Additionally, we used hifiasm and Purge_dups to discard genome redundancy sequence, producing a draft genome. We further anchored the genome sequences to twelve chromosomes, covering 95.02% of the genome, and the Hi-C interaction heatmap confirmed the accuracy and continuity of chromosome-level scaffolding. This comprehensive genome assembly provides a valuable resource, significantly enhancing our understanding of S. przewalskii and establishing a solid foundation for future functional and evolutionary studies.

Data overview

Sequencing was performed, yielding a total of 136.54 GB of Illumina data, 81.71 GB of PacBio HiFi data, and 268 GB of Hi-C data using three sequencing strategies. From these datasets, we obtained 909,483,996 clean reads from Illumina dataset out of 910,278,082 raw reads. Q20 and Q30 score was 98.38% and 95.42%, respectively, indicating high sequencing quality. Additionally, 5,057,384 reads were generated for PacBio HiFi data, with an average read length of 16,157 bp. For Hi-C data, 1,791,494,328 reads were retained after filtering, and both Q20 and Q30 scores exceeded 90%, demonstrating superior sequencing quality for Hi-C strategy.

We further evaluated the quality of the S. przewalskii genome assembly in terms of contiguity, completeness, and consistency. The assembly exhibited a contig N50 of 2,569,777, with 96.9% of the BUSCOs as complete, a mapping rate of 98.87%, and a coverage rate of 99.29%, indicating a well-assembled genome. The completeness of genome annotation was also assessed, exceeding 90%, reflecting a high-quality annotation. S. przewalskii chromosome-level genome comprised 35,701 protein coding genes, 1,039 tRNAs, 3,681 rRNAs, and 289 miRNAs. A total of 32,775 genes (91.80%) were functionally annotated across multiple databases, including NR (31,587; 88.48%), GO (23,883; 66.90%), KOG (663; 1.86%), and Pfam (22,995; 64.41%), KEGG (11,964; 33.51%), providing a comprehensive view of gene content, structure, and functional potential in S. przewalskii.