Background & Summary

Parasitic Hymenoptera represent the most species-rich group of parasitic organisms. They parasitize a broad range of insect groups and other arthropods, including caterpillars, leafhoppers, aphids, flies, spiders, and ticks—many of which are agricultural pests1. As a result, parasitic Hymenoptera play a vital role in sustainable and environmentally friendly agricultural pest management. However, our understanding of the genetic mechanisms behind parasitism in this group, particularly in terms of host specificity and immune evasion, remains limited due to the scarcity of high-quality genomic resources. Access to genomic data holds significant potential for identifying key parasitic effectors and advancing biopesticide development for large-scale use in agriculture.

The genus Leptopilina includes a few parasitoid wasps that have been studied in detail, particularly in the context of host-parasite interactions2,3,4,5. They have recently garnered attention due to their ability to parasitize and control the growing threat of the invasive pest Drosophila suzukii6,7. These studies have provided important insights, highlighting intriguing evolutionary dynamics and the practical implications for parasitoid wasps, while also emphasizing the need for further genomic research. Despite this progress, much of the genetic basis underlying parasitism in this genus remains unexplored.

In this study, we captured a Leptopilina species in Taizhou, Zhejiang Province, China, using fruit traps in a Myrica rubra plantation. We assembled a chromosome-level genome of this species, referred to as Leptopilina myrica, by utilizing a combination of PacBio long-read sequencing, Illumina short-read sequencing, and Hi-C chromosome conformation capture technologies. We then compared the protein-coding genes of L. myrica with those of other Hymenoptera species to gain insights into the evolutionary history of parasitoid wasps. The high-quality genome assembly obtained in this study will provide a valuable resource for future investigations into the genetic mechanisms underlying parasitoid traits.

Methods

Sample collection and preparation

We collected L. myrica samples using fruit traps in Taizhou, Zhejiang Province, China. The samples were maintained in the laboratory using Drosophila melanogaster (w1118) as the regular host, under controlled conditions: 25 °C temperature, ~50% relative humidity, and a 16:8 light-dark cycle.

Genomic DNA Sequencing and de novo Assembly

Genomic DNA was extracted from a pool of approximately 1,000 male specimens for PacBio sequencing, using the DNeasy Blood and Tissue Kit (Qiagen). A 20-Kb genomic library was constructed and sequenced by Berry Genomics Co. Ltd. (Beijing, China) on a PacBio Sequel platform, following standard protocols. This sequencing yielded 112.14 Gb of long-read data, with an N50 read length of ~25,337 bp and an average read length of ~22,506 bp.

For genome error correction, we used the Illumina platform to generate short reads. The Illumina sequencing produced 21.57 Gb of raw data, which was filtered to remove adapters and low-quality reads using fastp v0.20.08, resulting in 20.06 Gb of clean reads (Table 1).

Table 1 Statistical characteristics of the sequencing reads.

To assemble contigs, Nextdenovo v2.4.0469 was utilized with the following parameters: “read_type = clr read_cutoff = 2k genome_size = 900 m seed_depth = 60 nextgraph_options = -a 1 -A”. The contigs were then corrected and polished using Illumina paired-end reads via Nextpolish v1.3.14710, with parameters: “task = best rerun = 3 sgs_options = -max_depth 100 -bwa lgs_options = -min_read_len 1k -max_depth 100 lgs_minimap2_options = -x map-pb”. Finally, the polished contigs underwent two rounds of redundancy removal using purge_dups v1.2.34811 with default parameters. The genome size of the first assembly was 462.09 Mb, with a contig N50 of 4.07 Mb across 274 contigs (Table 2).

Table 2 The chromosomal-level genome assembly statistics.

Chromosome staining

For karyological analysis, we prepared chromosomes from L. myrica following a modified version of the protocol outlined by Imai et al.12. Cerebral ganglia were dissected from early pupae of male wasps in Ringer’s saline solution and rinsed in Ringer’s buffer for 3–5 minutes. Ganglia were then incubated in a 0.005% colchicine-hypotonic solution (diluted in 1% sodium citrate) for 30 minutes. After incubation, the ganglia were transferred onto clean slides, fixed in Fixative Solution A (ethanol, glacial acetic acid, and distilled water in a 3:3:4 ratio by volume), and gently disaggregated using forceps for even chromosome distribution. The samples were then fixed in Fixative Solution B (ethanol and glacial acetic acid in a 1:1 ratio by volume) and mounted in ProLong Gold Antifade Mountant with DAPI (Invitrogen). Fluorescence images were acquired using a Zeiss LSM 800 confocal microscope (Fig. 1a,b).

Fig. 1
figure 1

Characteristics of the L. myrica genome. (a) chromosome staining karyotypes of L. myrica: n = 10 and (b) 2n = 20; (c) chromosomal Hi-C interactive heatmap (bin size = 1MB).

Hi-C sequencing and scaffolding

Hi-C libraries for L. myrica were prepared from a pool of 20 newly emerged males following the protocol outlined by Lieberman-Aiden et al. (2009). The samples were initially fixed with 2% formaldehyde for 10 minutes at room temperature, and glycine was added to a final concentration of 100 mM to stop the cross-linking reaction. Cross-linked DNA was extracted and digested overnight with HindIII (NEB). During the sticky-end repair process, Biotin-14-dCTP17 was incorporated. The interacting DNA fragments were then ligated using T4 DNA ligase to form chimeric junctions. Hi-C libraries were subsequently sequenced by GrandOmics Co. Ltd (Wuhan, China) on an Illumina HiSeq X Ten platform, generating 56.04 Gb of paired-end reads, providing approximately 121x genome coverage (Table 1).

The assembled contigs were scaffolded using Hi-C contact information obtained from the Hi-C sequencing reads. Juicer v1.5.713 was employed to process the contact signals, which were then provided to 3d-dna v19071614 for chromosome grouping, using the parameters “--q 1 --editor-repeat-coverage 2.” The final chromosome interaction matrix was visualized as a heatmap, showing diagonal patches of strong linkage based on the interplay between valid mapped reads and bins using the Juicebox15 (Fig. 1c).

Finally, we obtained the high-quality chromosome-level genome of L. myrica, with the genome size of 462.30 Mb, with a scaffold N50 of 47.32 Mb and GC content of 27.27%. (Table 2).

RNA sequencing and analysis

Critical developmental stages were sampled for high-coverage RNA sequencing to capture a comprehensive transcriptome profile (Table 3). A total of 11 developmental stages of L. myrica were included: Egg, L1 (days 1–3 larvae; early larval stage), L2 (days 4–6 larvae; early to middle larval stage), L3 (days 7–9 larvae; middle to late larval stage), P1 (days 1–3 pupae; early pupal stage), P2 (days 4–7 pupae; early to middle pupal stage), P3 (days 8–10 pupae; middle to late pupal stage), adult females (AF), and adult males (AM). The VGs of 3-day-old AF wasps were dissected in Ringer’s saline solution on an ice plate under a stereoscope (Nikon). Total RNA was independently extracted from each sample using the RNeasy Mini Kit (QIAGEN) and stored at −80 °C until further use.

Table 3 Statistics of RNAseq data in this study.

Construction of cDNA libraries and paired-end RNA sequencing (Illumina) was performed by Berry Genomics Co. Ltd. Transcriptome sequencing data statistics are provided in Table 3. Full-length transcripts were generated using the PacBio sequencing system (Pacific Biosciences), yielding 49.25 Gb of transcriptome sequencing data from libraries with insert sizes of 1–10 kb for the mRNA pool across all stages. Raw reads were processed using IsoSeq v3.2.216 and mapped to the reference genome using Minimap v2.1717 with the parameters “-ax splice --uf --secondary = no --C5”.

Repeat elements prediction

Repeat elements were annotated using the RepeatMasker pipeline. A species-specific repeat library was first generated using RepeatModeler v2.0.2 (www.repeatmasker.org), and RepeatMasker v4.1.1 was employed to mask repetitive content across the genome using both the species-specific library and Dfam v3.2. A total of 242,390,138 bp repetitive sequences were obtained, accounting for 52.43% of genome size (Fig. 2, Table 2). This proportion is consistent with that observed in other related species within the same superfamily. For example, Leptopilina boulardi has a repetitive content of 54.40%, and Belonocnema treatae has 56.10% (Table 4).

Fig. 2
figure 2

Genome characteristics. Genome characteristics of L. myrica (window size 1000 kb). From the outer ring to the inner ring are the distributions of chromosome length, GC content, gene density, TE (DNA, SINE, LINE, LTR, and simple repeat).

Table 4 Statistics on the proportion of repeat sequences in 11 Hymenoptera species.

Genome annotation

Protein-coding genes were predicted based on the repeat-masked genome using multiple approaches: (1) BRAKER v2.1.518,19,20,21,22,23,24 was used to generate two gene sets, one based on transcriptome-based hints and the other on related protein-based hints; (2) Maker v2.31.1025 generated an integrated gene set by calling SNAP v2006-07-2826 and Augustus v3.3.227, incorporating evidence from related proteins and full-length transcripts; (3) StringTie v2.022 combined Illumina transcriptome data to generate a merged transcript set using default parameters; (4) TOFU’s28 Python module “collapse_isoforms_by_sam” processed long-read transcriptome data to produce a full-length transcript set, using parameters “--dun-merge-5-shorter -c 0.9 -i 0.9”. These independent gene sets were compared pairwise at both transcript and exon levels. Genes with consistent support from multiple sets were prioritized, while those supported by only a single set were excluded. Predicted genes were annotated based on BLASTP searches against the NR database and domain searches using InterProScan v5.38-76.029. Gene expression across developmental stages or tissues was quantified as TPM using salmon v0.12.030, with the parameters “quant -l A”. The integrated unified dataset led to the prediction of 13,832 protein coding genes distributed across the genome and a mean gene length was found to be 1,589.8 bp. (Table 2).

Phylogenetic analysis

OrthoFinder v2.5.131 was used to analyze orthologous and paralogous genes across 11 Hymenopteran genomes with parameter “-M msa”. These genomes are of high completeness and represent major clades within the order Hymenoptera, including Apoidea (Apis mellifera, Bombus consobrinus), Tenthredinoidea (Athalia rosae), Chalcidoidea (Nasonia vitripennis, Pteromalus puparum, Trichomalopsis sarcophagae), Ichneumonoidea (Diachasma alloeum, Microplitis demolitor), and Cynipoidea (Belonocnema treatae, Leptopilina boulardi, and Leptopilina myrica), as detailed in Table 4.

Single-copy orthologous sequences from these species were aligned using MAFFT v7.50532. Subsequently, a species tree was constructed based on orthologs using STAG v1.033. Calibration points for divergence within Hymenoptera (221–283 million years ago, mya), Apis mellifera + Athalia rosae (224- 304 million years ago, mya) and Belonocnema treatae + Pteromalus puparum (108–242 million years ago, mya) were obtained from TimeTree (timetree.org).

Gene family contraction and expansion were analyzed using CAFE v5.134, incorporating results from OrthoFinder, the phylogenetic tree, and divergence time estimates. The phylogenetic tree was visualized and enhanced using iTOL (https://itol.embl.de/#) (Fig. 3).

Fig. 3
figure 3

Phylogenetic tree. The maximum likelihood phylogenetic tree based on 2749 concatenated single-copy orthologous genes from 11 Hymenopterans. The bootstrap value of all nodes is supported at 100/100, and gene counts different types of orthologous groups. The expansion, contraction of orthologous groups (OGs) are shown on the nodes and tips. “1:1:1” indicates universal single-copy genes present in all species; “N: N: N” indicates multicopy genes, although the absence in a single genome is tolerated; “Leptopilina” means common unique genes in species from Leptopilina. “Species-specific” represents species-specific genes in the genome; “Unassigned” indicates genes which cannot be assigned into any gene families (orthogroups); “Others” means the remaining genes.

Data Records

PacBio, Illumina and Hi-C sequencing data have been deposited to the NCBI Sequence Read Archive with accession numbers SRR3132050635, SRR3132050536 and SRR3132050437 respectively. Additionally, RNA-Seq data are available and active in the NCBI database with accession numbers SRR3097868638, SRR3097868739, SRR3097868840, SRR3097868941, SRR3097869042, SRR3097869143, SRR3097869244, SRR3097869345, SRR3097869446 and SRR3097869547. The assembled genome can be found on NCBI’s GenBank through accession number GCA_032872475.148. The genome annotations were openly available from figshare49.

Technical Validation

We evaluated the quality of the L. myrica genome assembly by calculating mapping rates and assessing completeness. Using BWA v0.7.1750, we aligned PacBio and Illumina reads to the final assembly, achieving mapping rates of 99.35%, and 99.37%, respectively. Genome completeness was assessed with the BUSCO pipeline v5.2.151, using the insecta_odb10 database as a reference. The BUSCO analysis showed 98.1% completeness for the predicted protein-coding gene sequences, including 97.4% single-copy, 0.7% duplicated, 0.3% fragmented, and 1.6% missing BUSCOs.