Background & Summary

The Hymenoptera, one of the four largest orders in the class Insecta, is one of the most species-rich groups of insects. With the advancement of sequencing technologies, this order has become a hotspot in insect genomics research1,2. Currently, the number of sequenced Hymenoptera genomes has reached 557 (on April 2024, based on statistics from NCBI), with 388 species sequenced in the past three years, and annotation information submitted for 125 species. Among these sequenced Hymenoptera species, 258 belong to parasitoids, primarily including 36 species of Cynipoidea, 75 species of Chalcidoidea, 98 species of Ichneumonoidea, 42 species of Proctotrupoidea, 6 species of Chrysidoidea, and 1 species of Orussidea.

Encarsia sophia (Hymenoptera: Aphelinidae) is a dominant parasitoid of the “super pest” Bemisia tabaci (Hemiptera: Aleyrodidae), serving as a crucial biological control agent against global populations of whiteflies due to its remarkable parasitic and destructive capabilities on the host3,4,5. The reproductive strategy of this parasitoid is rather unique, being a typical heteronomous hyperparasitoid. Males and females develop heteronomously, obtaining their nutritional resources from different host insects. Females, the primary parasitoids, arise from fertilized eggs and parasitize directly within the target host insect, feeding on the larvae or nymphs of the host to complete their development. Conversely, males, arising from unfertilized eggs, act as hyperparasitoids and can only parasitize secondary hosts, i.e., those already parasitized by the primary parasitoids, feeding on the larvae of the primary parasitoids to complete their development6,7,8,9. Here, mated female E. sophia parasitize directly within the nymphs of the B. tabaci, laying fertilized eggs that develop into female offspring, serving as primary parasitoids. Unmated females, on the other hand, can only parasitize secondary hosts, laying unfertilized eggs within the nymphs of conspecific or heterospecific parasitoids already parasitized within the whitefly nymphs, producing male offspring, acting as hyperparasitoids10,11. So far, no genome of a heteronomous parasitoid has been reported. In order to gain deeper insights into the characteristics of such parasitoids, we conducted whole-genome sequencing and chromosomal-level assembly of E. sophia using Illumina, PacBio, and Hi-C technologies.

Methods

Parasitoid Wasp Collection and Sequencing

Encarsia sophia population, introduced in 2008 from the Vegetable Pest Integrated Management Laboratory at Texas A&M University, USA. They were reared in the insectarium of the Laboratory of Biological Invasion Research at the Langfang Research and Development Base of the Chinese Academy of Agricultural Sciences, using cotton plant B. tabaci nymphs as hosts (26 ± 1 °C, RH65 ± 5%, light cycle 14 L:10D). The B. tabaci laboratory population originates from the MEAM1 population maintained by the Institute of Vegetables and Flowers, Chinese Academy of Agricultural Sciences (CAAS), in a greenhouse at the Institute of Plant Protection, CAAS, with no history of pesticide use. The cotton variety used is CCRI 49. E sophia is a typical heteronomous hyperparasitoid with a unique reproductive strategy: females act as primary parasitoids, parasitizing first- to fourth-instar B. tabaci nymphs (primary hosts). In contrast, solitary females produce male offspring, acting as secondary parasitoids parasitizing conspecific or heterospecific parasitoid larvae inside B. tabaci nymphs (secondary hosts). Given that males are secondary parasitoids, we collected newly emerged females for sequencing. To obtain newly emerged parasitoids, we used insect pins to transfer females from black pupae to centrifuge tubes (1.5 mL). We conducted daily checks for newly emerged adults and collected a total of 4,000 females for DNA extraction using the QIAamp DNA Mini Kit (QIAGEN). Following extraction, the purity, concentration, and integrity of the DNA were evaluated with the NanoDrop 2000&8000, Qubit Fluorometer, and Agilent 4200 Bioanalyzer, respectively.

Genome size estimation and assembly

The high-quality DNA samples from E. sophia were randomly sheared using a Covaris ultrasonic disruptor. Subsequent steps, such as end repair, A-tailing, adapter ligation, purification, and PCR amplification, were performed to complete the library construction process. The constructed library was subjected to paired-end sequencing using Illumina HiSeq. By removing reads with adapter sequences and those containing more than 10% uncertain bases (N), as well as discarding single-end reads where the proportion of low-quality bases (quality score below 5) exceeds 20%, we obtained the filtered clean reads. Then, a k-mer frequency histogram was generated using Jellyfish 2.2.7 with the following parameters: “-G 2 -m 17 -C -o kmercount.”12, yielding the following estimations: a genome size of 412.21 Mbp, corrected to 404.2 Mbp, heterozygosity rate of 0.52%, and a repeat sequence proportion of 52.84% (Fig. 1). To obtain the preliminary genome assembly of E. sophia, we utilized 49,702,845,900 bp of second-generation sequencing data and assembled it using the Soapdenovo software. The assembly was then scaffolded using kmer41. The initial assembly results showed that the genome of E. sophia had a contig N50 of 1,272 bp with a total length of 318,591,742 bp, and a scaffold N50 of 2,192 bp with a total length of 328,391,604 bp (Table 1).

Fig. 1
figure 1

Encarsia sophia genome feature statistics obtained by Kmer analysis.

Table 1 Encarsia sophia genome assembly to scaffold results.

Sequencing was conducted using the PacBio platform, resulting in a total sequencing volume of 148 G with a coverage depth of 366.16X (calculated based on the survey-estimated genome size of 404.20 M). Additionally, a short-insert library was prepared and sequenced using the Illumina platform (Table 2). Using the sequencing data, de novo assembly of the E. sophia genome was performed with HiFiasm13. The genome assembled by Hifiasm has a length of 398.19 Mbp, with a contig N50 of 1.33 Mbp (sequences above 100 bp were selected for the assembly results) (Table 3).

Table 2 Summary of DNA/RNA sequencing data utilized for the genome assembly of Encarsia sophia.
Table 3 Encarsia sophia genome denovo assembly results statistics.

To obtain the chromosome-level genome of E. sophia, a Hi-C sequencing library was constructed using Hi-C technology14, incorporating DNA from 20,000 female adults. Hi-C data were obtained from the sequencing, and the contigs/scaffolds assembled were anchored to approximate chromosome-level using the All-hic software15. Subsequently, the juicebox software (https://github.com/aidenlab/Juicebox) was utilized for manual correction based on chromosomal interaction intensity, resulting in the final chromosome-level genome of E. sophia (Table 4). Following Hi-C-assisted assembly, the E. sophia genome assembled at the chromosome-level comprises a total of 5 sequences, with an additional 189 sequences remaining unassembled at the chromosome-level. The total genome length is 398,274,414 bp, of which 378,887,893 bp is assembled onto chromosomes (Fig. 2). The genome mapping rate achieved is 95.1% (Tables 5, 6). (Results were based on contigs above 100 bp for assembly statistics).

Table 4 Statistical results of the Encarsia sophia genome assembly, both from the initial de novo assembly and after Hi-C scaffolding.
Fig. 2
figure 2

A genome-wide Hi-C interaction map of Encarsia sophia (5 chromosomes, 100 kb resolution) is shown, with a color gradient on the right indicating the interaction strength. Intrachromosomal interactions (red squares along the diagonal) are markedly more intense than interchromosomal interactions (light yellow squares).

Table 5 Encarsia sophia single chromosome cluster number and length statistics of Hi-C assemble.
Table 6 Encarsia sophia genome mapping rate of de novo and afer Hi-C scaffolding.

Genome quality assessment

We employed different methods to assess the sequence integrity, consistency, and accuracy of the genome assembly. Firstly, the integrity of the E. sophia genome assembly was assessed using BUSCO with the insecta-odb10 database16, employing software such as MetaEuk and HMMER. The assembly resulted in 97.1% complete BUSCO genes, with 92.1% being single-copy genes and 5.0% being completely duplicated genes. Additionally, a core gene library comprising 248 conservative genes present in six eukaryotic model organisms was used for CEGMA assessment17 using tblastn, genewise, and geneid software. The assembly successfully identified 233 out of 248 core eukaryotic genes, indicating a completeness rate of 93.9%. Secondly, the sequence consistency of the E. sophia genome was assessed by aligning short-insert library reads using BWA software (http://bio-bwa.sourceforge.net/)18. The analysis revealed a HiFi reads alignment rate of approximately 97.6% and a genome coverage rate of around 99.1%, demonstrating strong consistency between the reads and the assembled genome. SNP calling was performed using samtools (http://samtools.sourceforge.net/) on the BWA alignment results, and after filtering and statistical analysis19, the genome exhibited a heterozygous SNP rate of 0.317095% and a homozygous SNP rate of 0.000943%, demonstrating excellent single-base accuracy in the assembly. Thirdly, the sequence accuracy of the E. sophia genome was assessed using Merqury software (https://github.com/marbl/merqury) with Illumina sequencing data. The quality value (Qv) of the genome, calculated based on K-mer using the Merqury-mash module20,21, was determined to be 33.6653, indicating a base accuracy rate exceeding 99.9%. In conclusion, the E. sophia genome assembly exhibits good consistency, completeness, and accuracy (Table 7).

Table 7 Encarsia sophia genome assembly quality assessment results.

Genome annotation

Our approach to repetitive annotation utilizes a thorough strategy that combines homology alignment with de novo search to detect repetitive sequences across the entire genome. We utilized TRF (http://tandem.bu.edu/trf/trf.html)22 for ab initio prediction, extracting tandem repeat sequences. For homology-based prediction, we utilized the standard Repbase database (http://www.girinst.org/repbase)23, employing RepeatMasker (http://www.repeatmasker.org/)24 and its internal script, RepeatProteinMask, to identify repetitive regions with default settings. In the de novo prediction process, we applied LTR_FINDER (http://tlife.fudan.edu.cn/ltr_finder/)25, RepeatScout (http://www.repeatmasker.org/), and RepeatModeler (http://www.repeatmasker.org/RepeatModeler.html)26 to create a de novo repetitive element database. Subsequently, all repetitive sequences longer than 100 bp with an ‘N’ content below 5% were included in the initial transposable element (TE) library. This custom library, created by merging Repbase with our de novo TE library and refined using uclust to remove redundancy, was then utilized by RepeatMasker for the identification of repetitive sequences at the DNA level. The E. sophia genome contains 214.7 Mb of repetitive sequences, constituting 53.92% of the genome. Among them, long terminal repeats (LTRs) are the most abundant, accounting for 34.59% of the total, followed by Unknown (12.17%), 7.18% DNA elements, 3.96% long interspersed nuclear elements (LINEs), and short interspersed nuclear elements (SINEs) at a mere 0.02% (Table 8).

Table 8 Encarsia sophia genome repeat sequence classification result statistics.

The protein-coding gene annotation in the E. sophia genome integrates de novo prediction, homology-based approaches, and RNA-Seq-supported modeling for gene prediction27. For de novo gene prediction, our automated gene prediction pipeline utilized Augustus (v3.2.3) (http://bioinf.uni-greifswald.de/augustus/)28, Geneid (v1.4), Genescan (v1.0), GlimmerHMM (v3.04) (http://ccb.jhu.edu/software/glimmerhmm/)29, and SNAP (http://homepage.mac.com/iankorf/)30. Homologous protein sequences were downloaded from NCBI Nasonia vitripennis (Nvit), Ceratosolen solmsi (Csol), Copidosoma floridanum (Cflo), Trichogramma brassicae (Tbra), Trichomalopsis sarcophagae (Tsar), Trichogramma pretiosum (Tpre). Using TblastN (v2.2.26; E-value ≤ 1e−5), protein sequences were aligned to the E. sophia genome31, and GeneWise (v2.4.1)32 software was employed to align matching proteins with homologous genomic sequences for accurate splice alignment and prediction of gene structures within each protein region. We constructed seven RNA-seq libraries, including different developmental stages of female E. sophia (600 eggs, Bemisia tabaci nymphs parasitized for <24 hours, dissected for host sampling; 200 first-instar larvae, B. tabaci nymphs parasitized for 48–60 hours, dissected for host sampling; 200 second-instar larvae, B. tabaci nymphs parasitized for 72–84 hours, dissected for host sampling; 80 third-instar larvae, B. tabaci nymphs parasitized for 120–132 hours, dissected for host sampling; 40 prepupae, B. tabaci nymphs parasitized for 168–178 hours, sampled after removing the host shell; 30 pupae, B. tabaci nymphs parasitized for 216–228 hours, sampled after removing the host shell; 50 adults, eclosed within < 24 hours.). Total RNA extracted from the aforementioned samples were used for library preparation, and sequencing was performed on the Illumina NovaSeq6000 platform33. The sequencing output generated a total of 60.51 G raw data, and after filtering, 59.88 G clean data was used for genome annotation. For genome annotation, the transcriptome was assembled using Trinity (v2.1.1)34. To refine the annotation, RNA-Seq data were processed with Hisat (v2.0.4)35 under default settings to identify exonic regions and splice sites. The alignment results were subsequently used as input for Stringtie (v1.3.3)36 with its default parameters, facilitating genome-guided transcriptome assembly. A comprehensive, non-redundant reference gene set was then created by merging the predictions from all three methods using EvidenceModeler (EVM, v1.1.1)37, which incorporated masked transposable elements for accurate gene prediction. A total of 14,914 protein-coding genes were predicted in E. sophia genome. The average length of predicted genes was 11,273.01 base pairs, with an average protein-coding region length of 1,451.53 bp. The average lengths of exons and introns were 275.58 and 2,301.66 bp, respectively. On average, each gene contained 5.27 exons (Table 9, Fig. 3).

Table 9 Encarsia sophia statistical results of genome gene structure prediction.
Fig. 3
figure 3

Circular plot illustrating the chromosome-level genome assembly results for Encarsia sophia. A: chromosome information, B: gene density, C: GC content, D: ncRNA density, E: repeat density.

Gene functions were determined by aligning the E. sophia protein sequences with the Swiss-Prot database using Blastp, applying a threshold E-value of ≤1e−5 to identify the best matches. InterProScan70 (v5.31)38 was used to annotate protein motifs and domains through searches across various public databases such as ProDom, PRINTS, Pfam, PANTHER, PROSITE and SMART. Gene Ontology (GO) IDs were subsequently assigned based on the relevant InterPro entries. We mapped the genes to the NR20 database using the closest BLAST hits from the Swissprot20 database39 (E-value < 10–5) and DIAMOND (v0.8.22)/BLAST hits (E-value < 10-5). Furthermore, the genome was aligned with KEGG pathways40 to determine the best match for each gene. Ultimately, 14,245 genes (95.5% of the total) in E. sophia genome were successfully annotated in at least one database41 (Table 10).

Table 10 Functional annotation of Encarsia sophia proteins.

To annotate non-coding RNAs (ncRNAs) in the E. sophia genome, tRNA genes were predicted using the tRNAscan-SE tool (http://lowelab.ucsc.edu/tRNAscan-SE/)42. Given the high conservation of rRNA sequences, we used sequences from closely related species as references and applied BLAST to identify rRNAs. Other ncRNAs, such as snRNAs and miRNAs, were detected by querying the Rfam database43 with the infernal software (http://infernal.janelia.org/)44, employing default parameters. In the end, a total of 1,457 non-coding RNAs were predicted, comprising 513 micro-RNAs (miRNAs), 514 transfer RNAs (tRNAs), 328 ribosomal RNAs (rRNAs), and 102 small nuclear RNAs (snRNAs) (Table 11).

Table 11 Encarsia sophia genome non-coding RNA statistical results.

Data Records

The sequencing data for the E. sophia genome, including Illumina, PacBio, and Hi-C datasets, have been deposited in the Sequence Read Archive (SRA) at the National Center for Biotechnology Information (NCBI) under accession numbers SRR29702816, SRR29702817, SRR2970281845,46,47, and in the Genome Sequence Archive (GSA) of the National Genomics Data Center (NGDC) under accession numbers BioProject PRJNA1131600 (NCBI) and CRA01756948 (NGDC). The transcriptome data used for annotation, covering various developmental stages of female E. sophia, have been stored in the SRA of NCBI and the GSA of NGDC: Egg (SRR2970281149, CRR1218365), 1st instar larva (SRR2970281550, CRR1218361), 2nd instar larva (SRR2970281451, CRR1218362), 3rd instar larva (SRR2970281352, CRR1218363), prepupa (SRR2970281053, CRR1218366), pupa (SRR2970280954, CRR1218367), and adult (SRR2970281255, CRR1218364). This Whole Genome Shotgun project has been deposited at GenBank under the accession JBFBOU00000000056. The genome annotation files of Encarsia sophia are available in figshare under a https://doi.org/10.6084/m9.figshare.2642675241.

Technical Validation

The quality, concentration, and integrity of the DNA samples were evaluated using a NanoDrop 2000&8000, a Qubit 3.0 Fluorometer (Thermo Fisher Scientific, USA), and an Agilent 4200 Bioanalyzer (Agilent Technologies, CA, USA), respectively. RNA integrity was assessed using the RNA Nano 6000 kit on the Bioanalyzer 2100 system (Agilent Technologies, CA, USA). High-quality DNA and RNA were selected for library preparation and sequencing. Genome assembly integrity was verified using BUSCO (Benchmarking Universal Single-Copy Orthologs: http://busco.ezlab.org/) and CEGMA (Core Eukaryotic Genes Mapping Approach: http://korflab.ucdavis.edu/datasets/cegma/). The short-read sequences from the fragment library were mapped to the assembled genome using BWA software (http://bio-bwa.sourceforge.net/), and alignment rates, genome coverage, and depth distribution were analyzed to evaluate the completeness and uniformity of the assembly. Additionally, the genome’s quality value (Qv) was determined using the Merqury-mash module (https://github.com/marbl/merqury) to assess the sequence accuracy of the assembled genome.