A High-quality chromosome-level genome assembly of Swertia przewalskii Pissjauk

Shuang, Han; Niu, Yu; Xu, Hao; Yu, Jingya; Han, Yun; Zheng, Qingqing; Guan, Rongrong; Zhang, Faqi; Chen, Shilong

doi:10.1038/s41597-025-06292-y

Download PDF

Data Descriptor
Open access
Published: 12 December 2025

A High-quality chromosome-level genome assembly of Swertia przewalskii Pissjauk

Han Shuang¹,
Yu Niu¹,
Hao Xu¹,
Jingya Yu¹,
Yun Han¹,
Qingqing Zheng³,
Rongrong Guan³,
Faqi Zhang^1,2,3 &
…
Shilong Chen¹

Scientific Data volume 12, Article number: 2026 (2025) Cite this article

1870 Accesses
Metrics details

Subjects

Abstract

Swertia przewalskii Pissjauk. belonging to Swertia (subtribe Swertiinae, Gentianaceae), is one of the original plant sources of the Tibetan medicine “Zangyinchen” has been included in the initial list of plant species with extremely small populations in Qinghai province. To gain deeper insights into the threats to S. przewalskii survival, we assembled a high-quality, chromosome-level genome using a combination of Illumina short-read, PacBio HiFi long-read, and Hi-C scaffolding technologies. The assembled genome has a size of ~2027.49 Mb, and demonstrates excellent continuity, consistency, and completeness (contig N50 = 2,569,777, mapping rate = 98.87%, coverage = 99.29%, BUSCO = 96.9%). The assembled sequences could be anchored to twelve chromosomes. Additionally, genome annotation revealed 35,701 protein-coding genes, 289 microRNAs (miRNAs), 1,039 transfer RNAs (tRNAs), 3,681 ribosomal RNAs (rRNAs). 32,775 genes were functionally annotated using several databases, including NR (31,587, 88.48%), GO (23,883, 66.90%), KOG (663, 1.86%), and Pfam (22,995, 64.41%), KEGG (11,964, 33.51%). This study reports a high-quality, chromosome-level genome assembly of S. przewalskii, providing a foundational resource for its conservation biology.

Chromosome-level genome assembly of Phortica okadai, a vector of Thelazia callipaeda

Article Open access 18 December 2024

Chromosome-level genome assembly of the dwarf cattail Typha minima

Article Open access 10 January 2026

Chromosome-level genome assembly of tetraploid Chinese cherry (Prunus pseudocerasus)

Article Open access 22 January 2025

Background & Summary

While the decline in global biodiversity has been a long-standing concern, the compounding effects of habitat destruction and climate change have now made it a pressing crisis¹. According to the State of the World’s Plants and Fungi 2020 report released by the Royal Botanic Gardens, Kew in the United Kingdom, over 30% of known plant species worldwide are currently threatened with extinction. In China, this trend is reflected in the nearly 5,000 plant species that have been classified as endangered². Therefore, to provide effective and scientifically-grounded protection for these endangered species, researchers have proposed the concept of Plant Species with Extremely Small Populations (PSESP)³. The concept characterized by narrow geographic distribution, persistent intrinsic and extrinsic constraints, visible population degradation, and ongoing demographic decline, has attracted considerable attention, rapidly becoming a central focus of biodiversity conservation in China^4,5,6. Moreover, advancements in whole-genome sequencing technologies have greatly facilitated related conservation work, enabling more in-depth investigations into the mechanisms underlying endangerment in many PSESP species^7,8,9,10.

As a species endemic to the Qinghai-Tibetan Plateau (QTP), the S. przewalskii is primarily found in Qilian County, Qinghai Province¹¹. It belongs to the genus Swertia (subtribe Swertiinae, Gentianaceae), is a perennial herb, and one of the original source plants of Tibetan medicine “Zangyinchen”¹¹. It contains a rich array of chemical constituents (such as loganic acid, sweroside, and gentiopicroside) and exhibits significant anti-inflammatory effects¹². The species was included in the initial list of plant species with extremely small populations in Qinghai province due to its narrow distribution range, and the extremely limited number of wild populations¹³. In recent years, research on the taxonomy, phylogeny, and population genetics of S. przewalskii has been conducted. However, studies focusing on its conservation genetics remain scarce^14,15,16. Therefore, it is crucial to investigate the mechanisms underlying endangerment of S. przewalskii.

Herein, we reported the first chromosome-level genome assembly of S. przewalskii, achieved through a combination of three sequencing strategies: next-generation sequencing, third-generation SMRT (Single Molecule Real-Time) sequencing, and high-throughput chromosome conformation capture (Hi-C) sequencing. This high-quality genome provides a valuable resource for analyzing the genetic factors underlying the endangerment of S. przewalskii. Building on this, our study further deepens the understanding of its genetic background and offers scientific evidence to support the conservation of this endangered species.

Methods

Plant materials

Specimens of the study species were collected and their morphological traits were examined in details. The stem was erect and yellow-green, with black-brown withered leaf petioles remaining at the base. Each plant bore 1-2 pairs of basal leaves, with blades ranging from oval to ovate-elliptic or spoon-shaped. The inflorescence formed a narrow umbel comprising 3–9 flowers. Detailed observations of the flowers revealed that the calyx extended to approximately two-thirds the length of the corolla, which was yellow-green with a blue center that gradually turned brown with age. The anthers were blue and either oval or narrow rectangular, while the ovary surface often displayed transverse wrinkles¹⁷ (Fig. 1A).

The fresh root, stem, leaf, and flower tissues (37° 36′ 0.00″N, 100° 42′ 0.00″E, at an elevation of 3486 m) were harvested from a mature individual of S. przewalskii. To ensure optimal preservation of the collected samples and to prevent RNA degradation, the plant tissues were treated in the field by rinsing twice with ultrapure water and once with a 75% ethanol, transferred into Eppendorf (EP) tubes, and immediately frozen in liquid nitrogen. Upon return to the laboratory, the samples were subsequently stored at −80 °C until further use.

Library preparation and sequencing

For next-generation sequencing, the genomic DNA was extracted from S. przewalskii employing the CTAB (Hexadecyl trimethyl ammonium Bromide) protocol, which is widely used for plant DNA isolation due to its efficiency in removing polysaccharides and polyphenols¹⁸. Genomic DNA (gDNA) quality and quantity were carefully assessed prior to library construction. The integrity and degree of fragmentation of DNA were evaluated using agarose gel electrophoresis, while DNA purity and concentration were measured using a NanoDrop^TM One UV-Vis¹⁹ spectrophotometer (Thermo Fisher Scientific, Waltham, MA, USA) and a Qubit 3.0 fluorometer²⁰ (Life Technologies, Carlsbad, CA, USA), respectively. High-quality gDNA fragments of approximately 150 bp were then selected to construct sequencing libraries, ensuring optimal fragment size for downstream analyses. The DNA libraries were initially quantified using the Qubit 2.0 fluorometer²¹, and after appropriate dilution, the fragment size quality was evaluated with an Agilent 2100 Bioanalyzer²². Finally, the effective library concentration was determined by quantitative PCR (Q-PCR)²³ to ensure suitability for downstream sequencing applications. After passing these quality control step, next-generation sequencing was performed on the Illumina NovaSeq 6000 platform (Illumina, San Diego, CA, USA).

For third-generation SMRT (Single Molecule Real Time) sequencing, genomic DNA was first extracted using the same procedure as for next-generation sequencing. The high-quality genomic DNA was then fragmented to appropriated sizes, followed by damage repair, adapter ligation to construct the sequencing library. After selecting the desired fragments and quantifying the library, a PCR-free SMRTbell library was prepared. Finally, the library templates were combined with the sequencing enzyme complexes and loaded onto the PacBio Revio platform for high-throughput sequencing.

For Hi-C sequencing, genomic DNA was extracted from fresh leaves and fixed with formaldehyde, followed by digestion with the restriction endonuclease DPNII. The quality of the DNA was assessed according to standard next-generation sequencing protocols, including evaluation of DNA fragment size, degree of degradation, and measurement of purity and concentration. Hi-C libraries were then constructed through a series of steps, including cell crosslinking, endonuclease digestion, end repair, cyclization, DNA purification, and capture. The resulting libraries were subjected to high-throughput sequencing on the Illumina platform (Fig. 1B).

Data filtering and genomic assembly

Next-generation sequencing data were first subjected to quality control to remove low-quality reads and adapter sequences using Fastp v 0.21.0²⁴ with default parameters, yielding high-quality clean reads for downstream analyses. For third-generation SMRT sequencing, raw polymerase reads were processed to retain only the Subreads corresponding to the insert sequences. Adaptors sequences and low-quality reads were subsequently removed using SMRTlink v12.0 (https://www.pacb.com/support/software-downloads), producing high-quality reads suitable for accurate genome assembly. For Hi-C sequencing, raw data were subjected to rigorous quality control, during which low-quality sequences were removed to obtain clean reads. To obtain an initial insight into the genomic characteristics, genome size and heterozygosity were estimated utilizing a K-mer-based strategy. Specifically, genome size was assessed with GCE v1.0.0²⁵. In parallel, Jellyfish v2.2.10²⁶ was used to calculate the K-mer frequency-depth distribution, which provided an additional basis for genome size estimation. Following quality control and filtering of raw sequencing data, next-generation short reads and HiFi reads were employed for genome assembly of S. przewalskii. Short reads provided high base accuracy, facilitating error correction of sequencing, while HiFi reads span repetitive and complex regions, enabling continuous contig construction. In general, hifiasm v0.16.1 (https://github.com/chhylp123/hifiasm) assembled the genome using an overlap-layout-consensus (OLC) algorithm. HiFi reads were first aligned using all-versus-all method to detect overlaps, and three rounds of self-correction were performed to reduce sequencing errors. This process was iteratively repeated, allowing overlaps to be refined and a string graph to be constructed for contig-level genome assembly. The resulting contigs were then scaffolded to chromosome-level assemblies using Hi-C interaction data. Hi-C valid interaction pairs were employed to cluster contigs into chromosome groups and determine their linear order and orientation using ALLHIC v0.9.8²⁷, with manual refinement performed based on Hi-C contact maps. Interaction data were converted into binary format using 3D-DNA v 180419²⁸ and jucier v1.6²⁹. The Manual refinement of contig order and orientation was performed with Juciebox v 1.11.08³⁰. Redundant sequences were identified and removed with Purge_dups v 1.2.5 (https://github.com/dfguan/purge_dups), further optimizing the genome assembly. To generate a chromosome-level assembly, gaps were filled using placeholder sequences of 100 Ns. Finally, HiCExplorer v 3.6³¹ was used to visualize chromosomal interactions via Hi-C heatmaps. To comprehensively evaluate the quality of the genome assembly, we calculated its continuity, consistency, and completeness. The contig N50 was used to quantify assembly continuity. Clean reads from next-generation sequencing were aligned to the reference genome using BWA-MEM v0.7.17³², providing mapping rate and coverage as indicators of assembly consistency. Finally, genome completeness was evaluated with BUSCO v5.3.0 v 5.3.0³³ (parameter setting: Busco -m prot -c 16).

A comprehensive sequencing effort was undertaken using three sequencing strategies, yielding a total of 136.54 GB of Illumina short-read data, 81.71 GB of PacBio HiFi long-read data, and 268 GB of Hi-C data. Following quality control, 909,483,996 clean reads were retained from the Illumina dataset out of 910,278,082 raw reads, with Q20 and Q30 scores reaching 98.38% and 95.42%, respectively, reflecting high sequencing quality. The PacBio HiFi dataset comprised 5,057,384 reads with an average length of 16,157 bp. For the Hi-C data, 1,791,494,328 reads were retained after filtering, and both Q20 and Q30 scores exceeded 90%, demonstrating superior sequencing quality (Table 1), Collectively, these datasets provide a robust foundation for high-quality genome assembly. The 19-mer frequency analysis yielded a genome size estimate of ~2027.49 Mb for S. przewalskii, with the distribution peak suggesting low heterozygosity (1.14%, ~ 66.71 X) and a largely homozygous genome (Fig. 2A). The genome assembly exhibited high continuity, with a contig N50 of 2,569,777 bp. Mapping Illumina clean reads to the reference genome revealed a coverage and alignment rate exceeding 90%, further supporting assembly accuracy. Assessment of gene content using BUSCO identified X complete orthologs, corresponding to over 95% completeness (Table 2), indicating that the assembled genome is both consistent and complete. The sequences were successfully anchored to twelve chromosomes, and Hi-C interaction heatmaps confirmed the integrity and high quality of the chromosome-level assembly (Fig. 2B,C).

Table 1 Sequencing platform yield and statistics.

Full size table

Table 2 Statistical analysis of BUSCO in S. przewalskii genome.

Full size table

Genome annotations

The genome annotation of S. przewalskii encompassed the identification of repetitive sequences, protein-coding genes and its models, and non-coding RNAs (ncRNA). Repetitive sequences were annotated using a combination of Ab initio and Repbase-based approaches to maximize sensitivity and accuracy. For Ab initio prediction, RepeatModeler v open-1.0.11³⁴ and RepeatMasker v open-4.0.9³⁵ were employed with default parameters to construct de novo repeat library and identify repeat elements. In parallel, Repbase-based prediction was conducted using RepeatModeler v 2.0.4³⁴ with the parameter of -database mydb -threads 16, referencing the Repbase database to identify both DNA- and protein-level repeat sequences. To eliminate redundancy, LTR retriever v 2.9.0³⁶ was utilized to obtain non-redundant long terminal repeat sequences from the combinated results of the Ab initio and Repbase-based searches. These two De novo sequence sets were subsequently integrated to form a comprehensive De novo repeat library. Repeated sequences were then aligned and predicted by integrating the De novo and Repbase databases, generating a consolidated annotation of repetitive elements. Transposable element sequences were further predicted using RepeatProteinMask v 4.1.5 (https://github.com/Dfam-consortium/RepeatMasker), and all repeat annotation results were merged with redundancies removed to produce the final high-confidence set of transposable elements. Additionally, tandem repeats were identified utilizing TRF v 4.09³⁷ and MISA v 2.1³⁸, completing the characterization of repetitive sequences within the genome. In this study, repetitive sequences were annotated using a combined strategy of ab initio prediction and Repbase-based homology searches. Allowing for comprehensive identification of various repeat types, long terminal repeats (LTRs) were the most abundant, accounting for 74.73% of the genome, followed by DNA elements (6.93%), LINEs (4.77%), SINEs (0.26%), and Unknown (2.13%) (Fig. 3A,B). Tandem repeats comprised a total of 4.963% of the genome, including minisatellite (repeat units of 10–99 bp, 2.886%), satellite (repeat units ≥ 100 bp, 1.823%), and microsatellite (repeat units of 1–9 bp, 0.823%) (Table S1). These results provide a detailed overview of the repetitive landscape of the S. przewalskii genome, highlighting the predominance of LTRs and the contribution of tandem repeats to genome structure.

To facilitate comprehensive genome annotation of S. przewalskii, it was essential to obtain detailed information on gene distribution and structure. We employed an integrative strategy combining transcriptome-based prediction, homology prediction, and De novo prediction to maximize accuracy and completeness. For transcriptome-based prediction, both Illumina dataset and PacBio dataset were used. Raw Illumina reads were quality-filtered using fastp v 0.23.2²⁴, Hi-C reads with fastp v 0.21.0²⁴ (https://github.com/wdecoster/nanofilt), and PacBio reads with isoseq 3 v 3.9.0 (https://github.com/ylipacbio/IsoSeq3). Genome and reads alignments were carried out with Hisat2 v 2.2.1³⁹ for Illumina data, HICUP v 0.8.0⁴⁰ for Hi-C data, and Pbmm2 v 1.10.0 (https://github.com/PacificBiosciences/pbmm2) for PacBio data, enabling accurate mapping of reads to the reference genome. Coding sequences were predicted using TransDecoder v 5.7.0 (https://github.com/TransDecoder/TransDecoder), while homologous protein sequences were aligned to the genome with tblastn v 2.13.0⁴¹, and transcript structures were further refined using Exonerate v 2.4.0⁴². De novo gene prediction was conducted by Augustus v 3.5.0⁴³, trained on a specific parametric model. Additionally, gene set derived from the three strategies were integrated using Maker v 3.01.03⁴⁴, generating a high-quality, non-redundant annotation. Through this integrative approach, we predicted a total of 35,701 protein-coding genes. The average coding sequence (CDS) length per gene was 1,089.01 bp, with each gene containing an average of 4.97 exons (Table 3). In total, 177,330 exons were identified with an average length of 306.64 bp, alongside 141,629 introns (average length: 1267.94 bp, total length: 179,577,541 bp) (Table 4). This comprehensive annotation provides a robust foundation for downstream functional and comparative genomic analyses. We further annotated multiple non-coding RNA, including 289 of miRNAs (0.0016%), 1,039 of tRNAs (0.0036%), 3,681 of rRNAs (0.2243%), and 9,244 of snRNAs (0.0473%), to achieve a more comprehensive genome annotation (Table 5). Additionally, the genome characteristics of S. przewalskii was visualized using R package “circlize” with a 50 kb sliding window, providing an intuitive overview of gene density, repeat content, and other genomic elements across chromosomes. Assessment of genome completeness revealed that over 95% of the completed BUSCOs were present, reflecting a high-quality and complete genome annotation.

Table 3 Statistical analysis of eight types transposable elements (TEs) in S. przewalskii genome.

Full size table

Table 4 Statistical analysis of gene prediction in S. przewalskii genome.

Full size table

Table 5 Statistical analysis of non-coding RNA in S. przewalskii genome.

Full size table

To comprehensively characterize the functional landscape of protein-coding genes in S. przewalskii, we employed a dual annotation strategy, integrating both sequence similarity and motif/domain-based approaches. First, predicted protein sequences were compared against multiple databases, including universal protein⁴⁵, non-Redundant Protein Database (NR)⁴⁶, cluster of orthologous groups of proteins (COG)⁴⁷, and eukaryotic orthologous groups (KOG), using diamond v 2.1.8, allowing rapid and sensitive identification of homologous proteins. Second, motif and domain-based annotations were conducted with InterProScan v 5.55-88.0⁴⁸, querying a suite of databases, including CDD, Gene3D, Hamap, Phobius, Pirsf, Prosite, Sfld, Superfamily, Tigrfam, Tmhmm and others. Conserved domains and sequence motifs were further identified using hmmscan v 3.3.2⁴⁹. Functional insights were then expanded through metabolic pathway assignment based on KOfam⁵⁰ profiles within the Kyoto Encyclopedia of Genes and Genomes (KEGG)⁵¹ database. In parallel, non-coding RNA annotations were conducted to capture regulatory and structural elements. Transfer RNAs (tRNAs) were identified with tRNAscan-SE v 2.0.12⁵², ribosomal RNAs (rRNAs) with RNAmmer v 1.2⁵³, and ncRNA sequences with INFERNAL v 1.1.4⁵⁴, against the RNA family (Rfam, version: 14)⁵⁵ database. Collectively, a total of 32,775 genes (91.80%) were functionally annotated across multiple databases, including NR (31,587; 88.48%), GO (23,883; 66.90%), KOG (663; 1.86%), and Pfam (22,995; 64.41%), KEGG (11,964; 33.51%) (Fig. 4, Table 6), providing a robust and multidimensional view of gene functions, conserved domains, and metabolic pathways in the S. przewalskii genome.

Table 6 Analysis of gene function annotation in S. przewalskii genome.

Full size table

Data Records

All the sequencing data have been deposited in the National Genomics Data Center (NGDC), Beijing Institute of Genomics, Chinese Academy of Sciences / China National Center for Bioinformation^56,57 (Project accession number: PRJCA032197). Illumina short-reads, PacBio HiFi long-reads, Hi-C reads have been deposited in the Genome Sequence Archive (GSA) in NGDC under the accession number CRR1370657⁵⁸, CRR1370658⁵⁹, CRR1370659⁶⁰ that is publicly accessible at https://download.cncb.ac.cn/gsa2/CRA020335. S. przewalskii chromosome-level genome assembly and gene annotation files have been deposited in the Genome WareHouse (GWH) in NGDC under the accession number is GWHFIGX00000000.1⁶¹ that is publicly accessible at https://ngdc.cncb.ac.cn/gwh. The genomic Illumina, PacBio, and Hi-C sequencing raw data were also deposited in the European Nucleotide Archive (ENA) at EMBL - EBI⁶² with accession number ERR15696091⁶³, ERR15696321⁶⁴, ERR15729024⁶⁵ (Study accession number: PRJEB100542; Sample accession number: ERS26986347). S. przewalskii chromosome-level genome assembly and annotated files were deposited in the European Nucleotide Archive (ENA) at EMBL - EBI under the accession number is GCA_977012215⁶⁶.

Technical Validation

In the present study, we utilized a NanoDropTM One UV-V spectrophotometer (Thermo Fisher Scientific, Waltham, MA, USA) and Qubit 3.0 fluorometer (Life Technologies, Carlsbad, CA, USA) to measure the DNA purity and concentration, respectively. After library preparation, sequencing was performed using three sequencing strategies. From these datasets, we filtered raw data, and retain high-quality clean data. We then utilized the19-mer frequency analysis to estimate genome size for S. przewalskii. Additionally, we used hifiasm and Purge_dups to discard genome redundancy sequence, producing a draft genome. We further anchored the genome sequences to twelve chromosomes, covering 95.02% of the genome, and the Hi-C interaction heatmap confirmed the accuracy and continuity of chromosome-level scaffolding. This comprehensive genome assembly provides a valuable resource, significantly enhancing our understanding of S. przewalskii and establishing a solid foundation for future functional and evolutionary studies.

Data overview

Sequencing was performed, yielding a total of 136.54 GB of Illumina data, 81.71 GB of PacBio HiFi data, and 268 GB of Hi-C data using three sequencing strategies. From these datasets, we obtained 909,483,996 clean reads from Illumina dataset out of 910,278,082 raw reads. Q20 and Q30 score was 98.38% and 95.42%, respectively, indicating high sequencing quality. Additionally, 5,057,384 reads were generated for PacBio HiFi data, with an average read length of 16,157 bp. For Hi-C data, 1,791,494,328 reads were retained after filtering, and both Q20 and Q30 scores exceeded 90%, demonstrating superior sequencing quality for Hi-C strategy.

We further evaluated the quality of the S. przewalskii genome assembly in terms of contiguity, completeness, and consistency. The assembly exhibited a contig N50 of 2,569,777, with 96.9% of the BUSCOs as complete, a mapping rate of 98.87%, and a coverage rate of 99.29%, indicating a well-assembled genome. The completeness of genome annotation was also assessed, exceeding 90%, reflecting a high-quality annotation. S. przewalskii chromosome-level genome comprised 35,701 protein coding genes, 1,039 tRNAs, 3,681 rRNAs, and 289 miRNAs. A total of 32,775 genes (91.80%) were functionally annotated across multiple databases, including NR (31,587; 88.48%), GO (23,883; 66.90%), KOG (663; 1.86%), and Pfam (22,995; 64.41%), KEGG (11,964; 33.51%), providing a comprehensive view of gene content, structure, and functional potential in S. przewalskii.

Data availability

The sequencing raw data generated in this study are available at ENA https://www.ebi.ac.uk/ena/browser/view/PRJEB100542 and CNCB-GSA https://ngdc.cncb.ac.cn/gsa/browse/CRA020335. The genome assembly files generated in this study are available at ENA http://www.ebi.ac.uk/ena/browser/view/GCA_977012215 and Genome Warehouse https://ngdc.cncb.ac.cn/gwh/Assembly/88049/show.

Code availability

All raw sequencing data were processed and quality-checked to ensure high fidelity for downstream analyses. Illumina raw data and PacBio raw data were filtered by FastQC v0.20.1 and Pbmm2 v1.10.0, respectively. All software and scripts used in this study are publicly accessible and freely available, with parameters set according to the respective manuals, including filters --low_complexity_filter, --adapter_sequence, seqkit rmdup). ALLHiC software version 0.9.8 for genome assembly is used as a free, publicly available at GitHub: https://github.com/tangerzhang/ALLHiC. Hi-C based scaffolding and contact map generation were achieved using 3D-DNA version 180922 (https://github.com/aidenlab/3d-dna) and Juicer version 1.6 (https://github.com/aidenlab/juicer), respectively. Denovo repeated sequence prediction was performed using RepeatModeler version 2.0.4 (https://www.repeatmasker.org/RepeatModeler/, parameters setting: -database mydb -threads 16). LTR sequences prediction was carried out utilizing LTR_FINDER (https://github.com/oushujun/LTR_FINDER_parallel, parameters setting: -threads 16 -harvest_out -size 1000000 -time 300), and LTRharvest version 1.62 (http://genometools.org/pub/binary_distributions, parameters setting: -minlenltr 100 -maxlenltr 7000 -mintsd 4 -maxtsd 6 -motif TGCA -motifmis 1 -similar 85 -vic 10 -seed 20 -seqids yes). Transposable elements prediction was performed using RepeatProteinMask version 4.1.5 (https://www.repeatmasker.org/RepeatProteinMask.html, parameters setting: -noLowSimple -pvalue 0.0001). Tandem repeats prediction was carried out utilizing TRF version 4.09 (https://tandem.bu.edu/trf/trf.html, parameters setting: 2 7 7 80 10 50 2000 -d -h) and MISA version 2.1 (https://webblast.ipk-gatersleben.de/misa/). Transcriptome-based gene prediction was performed using Tama version 1.0 (https://github.com/GenomeRIK/tama, parameters setting: -f filelist.txt -p merge), stringtie version 2.2.1 (https://ccb.jhu.edu/software/stringtie/, parameters setting: -p 16 -R -L), and TransDecoder version 5.7.0 (https://github.com/TransDecoder/TransDecoder, parameters setting: TransDecoder. Predict -retain_long_orfs_length 150, TransDecoder.LongOrfs -m 50). Homology-based gene prediction was carried out using Exonerate version 2.4.0 (https://github.com/nathanweeks/exonerate, parameters setting: -model protein2genome -showtargetgff 1), and tblast version 2.13.0 (default parameters). De novo gene prediction was performed using Augustus version 3.5.0 (https://github.com/Gaius-Augustus/Augustus, parameters setting: -uniqueGeneId = true -noInFrameStop = true -gff3 = on -strand = both), and glimmerhmm version 3.0.4 (http://ccb.jhu.edu/software/glimmerhmm/, parameters setting: -f -g). Evaluating the completeness of genome was performed utilizing BUSCO version 5.4.7 (https://busco.ezlab.org/, parameters setting: busco -m prot -c 16). Gene function annotation based on sequence similarity searching was carried out using diamond version 2.1.8 (https://github.com/bbuchfink/diamond, default parameters). Gene function annotation based on motif similarity searching was performed utilizing InterProScan version 5.55–88.0 (https://github.com/ebi-pf-team/interproscan, default parameters), and hmmscan version 3.3.2 (http://hmmer.org/download.html, parameters setting: -cpu 16 -E 1e-5). Non-coding RNAs (ncRNAs) were predicted using tRNAscan-SE version 2.0.12 (https://github.com/UCSC-LoweLab/tRNAscan-SE, parameters setting: -E -j tRNA.gff -o tRNA.result -f tRNA.struct -thread 16), RNAmmer version 1.2 (https://services.healthtech.dtu.dk/services/RNAmmer-1.2/, parameters setting: -S euk -m tsu, lsu, ssu), and INFERNAL version 1.1.4 (http://eddylab.org/infernal/, parameters setting: -Z (genome_length2CMmumber/10) -cut_ga -rfam -nohmmonly -fmt 2).

References

Huang, J. H. et al. Identifying hotspots of endemic woody seed plant diversity in China. Divers. Distrib. 18, 673–688 (2011).
Article Google Scholar
Xu, Y. & Zang, R. G. Conservation of rare and endangered plant species in China. IScience 26, 106008 (2023).
Article ADS PubMed PubMed Central Google Scholar
Sun, W. B. Conservation of plant species with extremely small populations in Yunnan - practice and exploration. Yunnan Science and Technology Press, Kunming, China (2013).
Sun, W. B. Words from the Guest Editor-in-Chief. Plant Divers 38, 207–208 (2016).
Article PubMed PubMed Central Google Scholar
Sun, W. B., Ma, Y. P. & Blackmore, S. How a new conservation action concept has accelerated plant conservation in China. Trends Plant Sci 24, 4–6 (2019a).
Article CAS PubMed Google Scholar
Sun, W. B., Yang, J., Dao, Z. L. Study and conservation of plant species with extremely small populations (PSESP) in Yunnan province, China. Science Press, Beijing, China (2019b).
Yang, T. et al. The chromosome-level genome assembly and genes involved in biosynthesis of nervonic acid of Malania oleifera. Sci. Data. 10, 298 (2023).
Article CAS PubMed PubMed Central Google Scholar
Shi, C. Y. et al. A high-quality chromosome-level genome assembly of the endangered tree Kmeria septentrionalis. Sci. Data. 11, 775 (2024).
Article CAS PubMed PubMed Central Google Scholar
Song, Y. et al. Chromosome level genome assembly of endangered medicinal plant Anisodus tanguticus. Sci. Data. 11, 161 (2024).
Article CAS PubMed PubMed Central Google Scholar
Tan, W., Zhou, P. Y. & Huang, X. Haplotype-resolved genome of Prunus zhengheensis provides insight into its evolution and low temperature adaptation in apricot. Hort. Res. 11, 2662–6810 (2024).
Google Scholar
Liu, J. Z., Chen, Z. D. & Lu, A. M. A preliminary analysis of the phylogeny of the Swertiinae (Gentianaceae) based on ITS data. Isr. J. Plant Sci. 49, 301–308 (2001).
Article Google Scholar
Liu, H. G. et al. Genetic relationship and resources evaluation of medicinal plants in Swertia L. Chin. Herb. Med. 42, 1646–1650 (2011).
ADS CAS Google Scholar
Ji, L. J. et al. Study on chemical compositions of four Gentiana plantsfrom qinghai-xizang plateau. Acta Biol. Plateau Sin. 11, 113–118 (1992).
Google Scholar
Zhang, D. F. et al. Patterns of genetic variation in Swertia przewalskii, an endangered endemic species of the Qinghai-Tibet Plateau. Biochem. Genet. 45, 33–50 (2007).
Article CAS PubMed Google Scholar
Duan, Y. W., Liu, J. Q. Pollinator shift and reproductive performance of the Qinghai-Tibetan Plateau endemic and endangered Swertia przewalskii (Gentianaceae). In: Hawksworth, D.L., Bull, A.T. (eds) Plant Conservation and Biodiversity. Topics in Biodivers. Conserv. 6. S. przewalskiiinger, Dordrecht (2006).
Cao, Q. et al. Plastome structure, phylogenomics and evolution of plastid genes in Swertia (Gentianaceae) in the Qinghai-Tibetan Plateau. BMC Plant Biol. 22, 195 (2022).
Article CAS PubMed PubMed Central Google Scholar
Ho, T. N., & Pringle, J. S. Gentianaceae. In Z. Y. Wu & P. H. Raven (Eds.), Flora of China (Vol. 16, pp. 1–139). Science Press & Missouri Botanical Garden Press. (1995).
Doyle, J. J. & Doyle, J. L. A rapid DNA isolation procedure for small quantities of fresh leaf tissue. Phytochem. Bull. 19, 11–15 (1987).
Google Scholar
Thermo Fisher Scientific. (n.d.). NanoDrop One UV-V Spectrophotometer. Thermo Fisher Scientific. Retrieved from https://www.thermofisher.com.
Thermo Fisher Scientific. (n.d.). Qubit 3.0 Fluorometer. Thermo Fisher Scientific. Retrieved from https://www.thermofisher.com.
Thermo Fisher Scientific. (n.d.). Qubit 2.0 Fluorometer. Thermo Fisher Scientific. Retrieved from https://www.thermofisher.com.
Agilent Technologies. (n.d.). Agilent 2100 Bioanalyzer. Agilent Technologies. Retrieved from https://www.agilent.com.
Heid, C. A. et al. Real-time quantitative PCR. Genome Research 6(10), 986–994, https://doi.org/10.1101/gr.6.10.986 (1996).
Article CAS PubMed Google Scholar
Chen, S. F. et al. Fastp: an ultra-fast all-in-one FASTQ preproces-sor. Bioinformatics. 34, 884–890 (2018).
Article Google Scholar
Liu, B. H. et al. Estimation of genomic characteristics by analyzing k-mer frequency in de novo genome projects. Quant. Biol. 35, 62–67 (2013).
Google Scholar
Marçais, G. & Kingsford, C. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics. 27, 764–770 (2011).
Article PubMed PubMed Central Google Scholar
Zhang, X. et al. Assembly of allele-aware, chromosomal-scale autopolyploid genomes based on hi-c data. Nat. Plants. 5, 833–845 (2019).
Article CAS PubMed Google Scholar
Dudchenko, O. et al. De novo assembly of the aedes aegypti genome using hi-c yields chromosome-length scaffolds. Science. 356, 92–95 (2017).
Article ADS CAS PubMed PubMed Central Google Scholar
Durand, N. C. et al. Juicer provides a one-click system for analyzing loop-resolution hi-c experiments. Cell Systems. 3, 95–98 (2016b).
Article CAS PubMed PubMed Central Google Scholar
Durand, N. C. et al. Juicebox provides a visualization system for hi-c contact maps with unlimited zoom. Cel. systems. 3, 99–101 (2016a).
Article CAS Google Scholar
Wolff, J. et al. Galaxy HiCExplorer 3: A web server for reproducible hi-c, capture hi-c and single-cell hi-c data analysis, quality control and visualization. Nucleic Acids Res, Oxford University Press. 48, W177–W184 (2020).
Article CAS Google Scholar
Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv preprint arXiv (2013).
Seppey, M., Manni, M. & Zdobnov, E. M. BUSCO: Assessing Genome Assembly and Annotation Completeness. Methods Mol. Biol. 62, 227–245 (2019).
Article Google Scholar
Smit, A., Hubley, R. & Green, P. RepeatModeler open-1.0 (2008–2015) Seattle, USA: Institute for Systems Biology. Available from: https://www.repeatmasker.org/, Last Accessed May, 1, 2018 (2015).
Smith, A., Hubley, R. & Green, P. RepeatMasker open-4.0, RepeatMasker Open-4.0. (2013).
Ou, S. & Jiang, N. LTR_retriever: A highly accurate and sensitive program for identification of long terminal repeat retrotransposons. Plant Physiol. 176, 1410–1422 (2018).
Article CAS PubMed Google Scholar
Benson, G. Tandem repeats finder: A program to analyze DNA sequences. Nucleic Acids Res, Oxford University Press. 27, 573–580 (1999).
Article CAS Google Scholar
Beier, S. et al. MISA-web: A web server for microsatellite prediction. Bioinformatics, Oxford University Press. 33, 2583–2585 (2017).
ADS CAS Google Scholar
Kim, D. et al. Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nat. Biotechnol. 37, 907–915 (2019).
Article CAS PubMed PubMed Central Google Scholar
Wingett, S. et al. HiCUP: Pipeline for mapping and processing hi-c data, F1000Research, Faculty of 1000 Ltd, 4 (2015).
Camacho, C. et al. BLAST+: Architecture and applications. BMC Bioinform. 10, 1–9 (2009).
Article Google Scholar
Slater, G. S. C. & Birney, E. Automated generation of heuristics for biological sequence comparison. BMC Bioinform. 6, 1–11 (2005).
Article Google Scholar
Stanke, M. et al. Using native and syntenically mapped cDNA alignments to improve de novo gene finding. Bioinformatics. 24, 637–644 (2008).
Article CAS PubMed Google Scholar
Holt, C. & Yandell, M. MAKER2: An annotation pipeline and genome-database management tool for second-generation genome projects. BMC Bioinform. 12, 1–14 (2011).
Article Google Scholar
Apweiler, R. et al. UniProt: The universal protein knowledgebase. Nucleic Acids Res. 32, D115–D119 (2004).
Article CAS PubMed PubMed Central Google Scholar
Deng, Y. et al. Integrated nr database in protein annotation system and its localization. Comput. Eng. 32, 71–72 (2006).
Google Scholar
Tatusov, R. L. et al. The COG database: An updated version includes eukaryotes. BMC Bioinform. 4, 1–14 (2003).
Article Google Scholar
Blum, M. et al. The InterPro protein families and domains database: 20 years on. Nucleic Acids Res. 49, D344–D354 (2021).
Article CAS PubMed PubMed Central Google Scholar
Mistry, J. et al. Challenges in homology search: HMMER3 and convergent evolution of coiled-coil regions. Nucleic Acids Res. 41, e121–e121 (2013).
Article CAS PubMed PubMed Central Google Scholar
Aramaki, T. et al. KofamKOALA: KEGG ortholog assignment based on profile HMM and adaptive score threshold. Bioinformatics. 36, 2251–2252 (2020).
Article CAS PubMed PubMed Central Google Scholar
Kanehisa, M. & Goto, S. KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 28, 27–30 (2000).
Article CAS PubMed PubMed Central Google Scholar
Chan, P. P. et al. tRNAscan-SE 2.0: Improved detection and functional classification of transfer RNA genes. BioRxiv. 614032 (2021).
Lagesen, K. et al. RNAmmer: Consistent and rapid annotation of ribosomal RNA genes. Nucleic Acids Res. 35, 3100–3108 (2007).
Article ADS CAS PubMed PubMed Central Google Scholar
Nawrocki, E. P. & Eddy, S. R. Infernal 1.1: 100-fold faster RNA homology searches. Bioinformatics. 29, 2933–2935 (2013).
Article CAS PubMed PubMed Central Google Scholar
Kalvari, I. et al. Rfam 14: Expanded coverage of metagenomic, viral and microRNA families. Nucleic Acids Res. 49, D192–D200 (2021).
Article CAS PubMed PubMed Central Google Scholar
Chen, M. et al. Genome Warehouse: A Public Repository Housing Genome-scale Data. Genom. Proteom. Bioinfo. 19, 584–589 (2021).
Article Google Scholar
Database Resources of the National Genomics Data Center. China National Center for Bioinformation in 2022. Nucleic Acids Res. 50, D27–D38 (2022).
Article Google Scholar
NGDC Genome Sequence Archive https://ngdc.cncb.ac.cn/gsa/browse/CRA020335/CRR1370657 (2024).
NGDC Genome Sequence Archive https://ngdc.cncb.ac.cn/gsa/browse/CRA020335/CRR1370658 (2024).
NGDC Genome Sequence Archive https://ngdc.cncb.ac.cn/gsa/browse/CRA020335/CRR1370659 (2024).
NGDC Genome WareHouse https://ngdc.cncb.ac.cn/gwh/Assembly/88049/show (2024).
European Nucleotide Archive (ENA) https://www.ebi.ac.uk/ena.
European Nucleotide Archive (ENA); ERR15696091 https://identifiers.org/insdc.sra:ERR15696091 (2025).
European Nucleotide Archive (ENA); ERR15696321 https://identifiers.org/insdc.sra:ERR15696321 (2025).
European Nucleotide Archive (ENA) ERR15729024 https://identifiers.org/insdc.sra:ERR15729024 (2025).
European Nucleotide Archive (ENA); GCA_977012215 https://identifiers.org/insdc.gca:GCA_977012215 (2025).

Download references

Acknowledgements

This research was partially supported by the Qinghai Provincial Science and Technology Major Project (2023-SF-A5), the Xining Science and Technology Major Project (2023-Z-13), the CAS-Qinghai on Sanjiangyuan National Park (LHZX-2022-03), the CAS “Light of West China” Program (2024), and the leading talents of the Kunlun talents in Qinghai Province (2024).

Author information

Authors and Affiliations

Key Laboratory of Adaptation and Evolution of Plateau Biota, Northwest Institute of Plateau Biology, Chinese Academy of Sciences & Institute of Sanjiangyuan National Park, Chinese Academy of Sciences, Xining, 810008, China
Han Shuang, Yu Niu, Hao Xu, Jingya Yu, Yun Han, Faqi Zhang & Shilong Chen
Qinghai Provincial Key Laboratory of Crop Molecular Breeding, Xining, 810008, China
Faqi Zhang
Xining Botanical Garden, Xining, 810001, China
Qingqing Zheng, Rongrong Guan & Faqi Zhang

Authors

Han Shuang
View author publications
Search author on:PubMed Google Scholar
Yu Niu
View author publications
Search author on:PubMed Google Scholar
Hao Xu
View author publications
Search author on:PubMed Google Scholar
Jingya Yu
View author publications
Search author on:PubMed Google Scholar
Yun Han
View author publications
Search author on:PubMed Google Scholar
Qingqing Zheng
View author publications
Search author on:PubMed Google Scholar
Rongrong Guan
View author publications
Search author on:PubMed Google Scholar
Faqi Zhang
View author publications
Search author on:PubMed Google Scholar
Shilong Chen
View author publications
Search author on:PubMed Google Scholar

Contributions

Conceptualization: Faqi Zhang, Shilong Chen; Experimental design and methodology: Shuang Han, Yu Niu; Sample collection and laboratory work: Yu Niu, Yun Han; Data analysis: Shuang Han (genome assembly, gene annotation, SNP calling, population genetics analyses), Jingya Yu (functional annotation, enrichment analysis), Hao Xu (statistical analysis, data visualization); Visualization and figure preparation: Shuang Han, Hao Xu; Writing-original draft: Shuang Han; Writing-review & editing: All authors; Supervision and project administration: Faqi Zhang, Shilong Chen; Funding acquisition: Faqi Zhang, Shilong Chen. All authors have read and approved the final manuscript. The authors declare that they have no competing financial interests or personal relationships that could have influenced the work reported.

Corresponding authors

Correspondence to Faqi Zhang or Shilong Chen.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

SUPPLEMENTARY INFORMATION (download PDF )

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Shuang, H., Niu, Y., Xu, H. et al. A High-quality chromosome-level genome assembly of Swertia przewalskii Pissjauk. Sci Data 12, 2026 (2025). https://doi.org/10.1038/s41597-025-06292-y

Download citation

Received: 25 December 2024
Accepted: 06 November 2025
Published: 12 December 2025
Version of record: 29 December 2025
DOI: https://doi.org/10.1038/s41597-025-06292-y