Background & Summary

Approximately 1% of angiosperms are parasitic plants, either fully or partially dependent on their host plants for carbon, nutrients, and water through specialized structures known as haustoria1. These plants exhibit diverse morphological and physiological adaptations and have evolved multiple times independently in at least 16 angiosperm families1,2. Parasitic strategies range from hemiparasitism, in which plants retain photosynthetic ability, to holoparasitism, in which they rely entirely on their hosts2. This diversity suggests complex and lineage-specific genomic adaptations3,4.

Recent advances in genome sequencing have enabled studies on several parasitic plant species. Published genomes include hemiparasites such as Santalum album5, Malania oleifera6, Striga asiatica7, Phtheirospermum japonicum8, and Pedicularis cranolopha9, as well as holoparasites like Cuscuta campestris10, C. australis11, Orobanche cumana, Phelipanche aegyptiaca12, and Sapria himalayana13,14. Comparative genomic analyses have revealed shared features such as extensive gene loss, plastome and mitogenome reduction, and horizontal gene transfer from host plants10,11,13. However, due to the limited number of high-quality genomes, broader evolutionary patterns across parasitic lineages remain poorly understood.

Mistletoes represent a major clade of hemiparasitic plants in Santalales, where aerial parasitism evolved independently multiple times from root-parasitic ancestors15,16. Scurrula parasitica (Loranthaceae) is a widespread mistletoe species in southwest China. Seeds dispersed by birds and mammals germinate on host branches and form haustoria to establish parasitic relationships17. In contrast to root-parasitic Santalales such as Santalum album and Malania oleifera, S. parasitica parasitizes woody branches and has a broad host range, including Osmanthus, Citrus, and Camellia, making it a valuable model for investigating the genomic basis of aerial hemiparasitism.

Here, we present a high-quality, chromosome-level genome assembly of S. parasitica using PacBio high-fidelity (HiFi) and Hi-C sequencing technologies. We comprehensively annotated the genome, including repetitive sequences, protein-coding genes (PCGs), transcription factor (TF) genes, and non-coding RNAs (ncRNAs). This genome provides a foundational resource for future comparative genomic studies to explore the genetic mechanisms underlying the evolution of hemiparasitism in Santalales and to understand both convergent and divergent genomic adaptations across parasitic angiosperms.

Methods

Plant sample preparation

We collected plant material from an S. parasitica individual parasitizing Osmanthus fragrans grown at the Wangjiang Campus of Sichuan University in Chengdu, Sichuan Province, southwest China (Fig. 1a). The freshly harvested leaves were promptly washed in distilled water and immediately frozen in liquid nitrogen, and stored at −80 °C until DNA extraction. Additionally, fresh flower, stem, leaf, and fruit tissues were collected from the same individual and frozen in liquid nitrogen for RNA sequencing (RNA-seq).

Fig. 1
figure 1

Genome survey of S. parasitica. (a) Photograph of an S. parasitica individual parasitizing Osmanthus fragrans. (b) K-mer frequency distribution derived from Illumina short-read sequencing data.

Genome survey

To perform genome survey analyses, we utilized an Illumina NovaSeq. 6000 platform for whole-genome sequencing (Illumina Inc., San Diego, CA, USA). Following total DNA extraction via the CTAB method18, paired-end ReSeq libraries were prepared, with an average insertion length of approximately 400 bp. A total of 39.00 Gb of Illumina reads were generated (Table 1). A 19-mer frequency distribution of these reads was generated using jellyfish v2.2.919. This analysis identified 31,259,563,762 k-mers, with a primary peak observed at a k-depth of 57 (Fig. 1b). The haploid genome size of S. parasitica was estimated to be 548.41 Mb, with a high repeat content of 64.11% and a notably low heterozygosity rate of 0.07%.

Table 1 Summary of genome and transcriptome sequencing data for S. parasitica.

Genome assembly

For PacBio HiFi sequencing, we isolated high-molecular-weight DNA using a modified CTAB method and prepared SMRTbell libraries following the PacBio 15-kb protocol. Subsequently, circular consensus sequencing (CCS) was performed on a PacBio Sequel II sequencing platform (Pacific Biosciences, Menlo Park, CA, USA), resulting in 20.44 Gb of HiFi reads (37.3 × coverage) with an N50 length of 13,486 bp (Table 1). The HiFi long reads were processed using the CCS workflow in SMRT Link v8.0 (PacBio) and assembled into contigs using hifiasm v0.1420 with default parameters, resulting in 878 contigs totaling 552.42 Mb. To improve assembly accuracy, Illumina sequencing reads were aligned to the contigs using BWA v0.7.1721, and contigs with anomalous GC content (>50%) or insufficient coverage (<5×) were identified and removed based on the alignments. This filtering step yielded 731 contigs spanning 547.29 Mb, which were used for downstream Hi-C scaffolding analyses.

Hi-C sequencing was then performed to generate a chromosome-level genome assembly. Hi-C libraries were prepared from more than 2 g of young leaves from the same S. parasitica plant, following standard protocols for chromatin extraction, digestion, ligation, and DNA purification. Paired-end sequencing was performed on a NovaSeq 6000 sequencing platform, resulting in 63.76 Gb of Hi-C reads (116.3 × coverage) (Table 1). The Hi-C reads were mapped to the contig-level assembly using Juicebox v1.8.822. Uniquely mapped reads were subsequently used to anchor contigs into pseudochromosomes with the 3D-DNA pipeline23. Hi-C contact maps were visualized and manually curated in Juicebox to correct misassemblies (Fig. 2), yielding a final chromosome-level assembly of 547.41 Mb (Fig. 3; Table 2). In total, 97.54% (533.93 Mb) of the genome was anchored to nine pseudochromosomes, ranging from 55.41 Mb to 64.98 Mb (Fig. 3; Table 3). The contig and scaffold N50 values were 8.32 Mb and 59.61 Mb, respectively (Table 2).

Fig. 2
figure 2

Hi-C contact heatmaps for nine pseudochromosomes of the S. parasitica genome.

Fig. 3
figure 3

Circos plot illustrating the genomic architecture of S. parasitica. Tracks display (a) GC content, (b) repeat density, (c) LTR/Gypsy density, (d) LTR/Copia density, (e) protein-coding gene density, and (f) syntenic regions within the genome.

Table 2 Global statistics of S. parasitica genome assembly and annotation.
Table 3 Summary of nine pseudochromosomes of the final S. parasitica assembly.

Genome annotation

Genome annotation began with the identification of repetitive sequences. A de novo repeat library was constructed using Repeat Modeler v2.0.124 based on the genome assembly. This library was subsequently merged with the green plant repeat dataset from the Repbase database v22.1125. We then used RepeatMasker v4.1.026 to identify repetitive elements based on sequence homology. In total, we identified 353.26 Mb of repetitive sequences, accounting for 64.53% of the S. parasitica genome (Table 4). Among the identified repetitive elements, long terminal repeat retrotransposons (LTR-RTs) were the most abundant, comprising 251.45 Mb (45.93%) of the genome. Within the LTR-RT class, Gypsy and Copia elements were the most prominent, totaling 250.14 Mb. Additionally, 74.40 Mb (13.59%) of sequences were classified as unclassified repeats, suggesting the presence of species-specific or novel repeat types. Furthermore, DNA transposons accounted for 4.02% (22.01 Mb) of the genome, while long interspersed nuclear elements (LINEs) comprised 4.88 Mb, short interspersed nuclear elements (SINEs) 1.16 Mb, and other repeat types totaled 0.42 Mb (Table 4).

Table 4 Classifications of repetitive elements in the S. parasitica genome.

After masking all repetitive elements in the S. parasitica genome, we employed three complementary approaches to predict the PCGs. For transcriptome-based annotation, total RNA was extracted from all fresh tissues using the TRIzol reagent. The NEBNext Ultra II RNA Library Prep Kit was used to generate RNA-seq libraries after removing residual DNA. These libraries were then sequenced on an Illumina NovaSeq 6000 platform, generating 32.28 Gb of RNA-seq data (Table 1). The RNA-seq reads were de novo assembled into transcripts using Trinity v2.8.427. The resulting transcripts were aligned to the repeat-masked genome using PASA v2.3.328, and the alignment results were used to generate gene structure predictions. For homologous protein annotation, we aligned protein sequences from several representative species (Santalum album5, Malania oleifera6, Arabidopsis thaliana29, Populus trichocarpa30, Vitis vinifera31, and Theobroma cacao32) to the S. parasitica genome using TBLASTN v2.2.3133. Gene models were then predicted based on these alignments using GeneWise v2.4.134. For ab initio gene prediction, high-confidence transcripts from PASA exceeding 1,500 bp in length and containing more than two exons were selected solely to train species-specific parameters for AUGUSTUS v3.2.335. The trained AUGUSTUS model was then applied to predict genes across the entire genome without applying any length or exon-number filters, ensuring that all potential PCGs were considered. Finally, we used EvidenceModeler v1.1.136 to integrate gene models from the three approaches into a consensus, non-redundant gene set.

We predicted 21,837 PCGs in the S. parasitica genome, with 21,450 (98.23%) located on the nine pseudochromosomes at a density of 40.2 genes per Mb (Table 3). The average lengths of the predicted transcripts, coding sequences (CDSs), exons, and introns were 4,561 bp, 1,283 bp, 211 bp, and 644 bp, respectively (Table 2). To investigate potential whole-genome duplication (WGD) events, we conducted an all-against-all BLASTP search using protein sequences from S. parasitica and Santalum album. Syntenic blocks were identified using MCScanX v1.137 with default parameters, and non-synonymous (Ka) and synonymous (Ks) substitution rates were calculated for syntenic gene pairs using the ‘add_ka_and_ks_to_collinearity.pl’ script from MCScanX. We observed a major peak around 0.73 in the Ks distribution of orthologs between S. parasitica and Santalum album, which was younger than the Ks peak of paralogs within S. parasitica (0.78), indicating that no independent WGD event occurred in the S. parasitica genome after its split from Santalum album (Fig. 4). The inter-chromosomal synteny shown in Fig. 3 therefore likely reflects ancient WGD events and more recent segmental duplications. Functional annotation, performed by aligning the protein sequences against Swiss-Prot, TrEMBL38, InterPro39, and KEGG40 databases, successfully annotated 96.05% of the genes (Table 5). We identified 1,271 TF genes (5.82% of PCGs) using PlantTFDB v5.041 (Fig. 5). Additionally, 8,407 ncRNAs with a total size of 0.98 Mb were identified by using tRNAscan-SE v2.042 for tRNAs, Infernal v1.1.243 for miRNAs and snRNAs, and BLASTN v2.2.31 against Rfam database44 for rRNAs, comprising 3,821 tRNAs, 3,076 snRNAs, 1,447 rRNAs, and 63 miRNAs (Table 6).

Fig. 4
figure 4

Synonymous substitution rate distributions of paralogous and orthologous gene pairs in S. parasitica and Santalum album.

Table 5 Functional annotation of protein-coding genes in the S. parasitica genome.
Fig. 5
figure 5

Distribution of the top 30 transcription factor families identified in the S. parasitica genome.

Table 6 Summary of non-coding RNAs in the S. parasitica genome.

Data Records

The genome assembly of S. parasitica and the associated raw sequence data were made publicly available through the NCBI database under BioProject PRJNA126687745. The genome assembly is available in GenBank under accession number JBPAPV00000000046. Raw sequencing data, including Illumina, PacBio HiFi, and Hi-C reads, are available in the Sequence Read Archive (SRA) under accession numbers SRR3375597247, SRR3377632748 and SRR3367619549, respectively. RNA-seq reads were deposited under the SRA accession numbers SRR33745685–SRR3374568850,51,52,53. Genome assembly and annotations of repetitive elements, gene structures, and functional features have also been archived in Figshare54.

Technical Validation

We employed a variety of approaches and metrics to determine the integrity and accuracy of the final S. parasitica genome assembly. First, using BUSCO v3.0.2 software55, we evaluated the presence of 1614 conserved genes from the Embryophyta odb10 dataset. The results showed that 93.87% of complete BUSCO genes were identified at the assembly level, while 89.59% were detected at the protein level (Table 2). Second, we assessed assembly continuity by calculating the long terminal repeat (LTR) Assembly Index (LAI) using LTR_retriever v2.856. The assembly achieved an overall LAI score of 15.93, indicating reference-level genome quality (Table 2). Third, Illumina short-read data were mapped to the final assembly using BWA software. The Illumina reads covered 99.96% of the genome, with a mapping rate of 99.69% and a minimum 20-fold coverage of 99.58% of the assembly. Finally, we examined the presence of Arabidopsis-type telomeres (TTTAGGG)n57 at the ends of each pseudochromosome, with a minimum need of 5 replicates. Seven of the nine pseudochromosomes contained telomeric sequences at both ends (Table 7). Based on this comprehensive set of evidence, we conclude that the S. parasitica genome assembly is of high quality and utility.

Table 7 Summary of Arabidopsis-type telomeres at both ends of S. parasitica pseudochromosomes.