Background & Summary

In agroecosystems, predatory stinkbugs of the subfamily Asopinae (Hemiptera: Pentatomidae) play a vital role in pest management1. Eocanthecona furcellata (Wolff, 1811), formerly Canthecona furcellata, is a key species within this group, renowned for its high fecundity and formidable predatory abilities. These traits enable E. furcellata to effectively suppress a broad range of pests, including Lepidoptera, Coleoptera, Diptera, and phytophagous Pentatomidae species2,3. The insect is widely distributed across Southeast Asia, inhabiting southern China, India, Malaysia, the Philippines, Myanmar, Indonesia, Thailand, and Sri Lanka4. As a paurometabolous insect, E. furcellata undergoes 3 developmental stages (egg, nymph, and adult), with the nymphal stage comprising 5 distinct instars (Fig. 1)4. In addition, E. furcellata employs extraoral digestion to process large prey, extracting nutrients through refluxing and non-refluxing mechanisms while injecting hydrolytic enzymes into the prey, a process analogous to that observed in Arma chinensis Fallou5,6. To optimize E. furcellata’s role as a biocontrol agent, researchers have expanded studies beyond morphological trait observation, focusing on phylogenetic relationships4,7, functional characterization of pheromones and pheromone receptors8,9, and diapause/dormancy mechanisms10,11. Studies have also highlighted its potential for combined application with nucleopolyhedrovirus (NPV)12,13,14. Notably, E. furcellata exhibits high tolerance to sub-lethal concentrations LC30 of insecticides such as chlorpyrifos15 and λ-cyhalothrin16,17. Intriguingly, sub-lethal LC30 concentrations of chlorpyrifos can enhance its predatory efficiency and maximum predation capacity15. Despite these insights, the absence of a high-quality genome for E. furcellata has severely limited deeper investigations into its gene functions. Thus, assembling a high-quality genome for this species is critical not only for advancing E. furcellata researches but also for understanding its closely related species.

Fig. 1
figure 1

Seven stages of Eocanthecona furcellata life cycle.

Thus, we employed a combined strategy of PacBio long-read HiFi and Hi-C sequencing, to assembled a high-quality chromosome-level reference genome for E. furcellata. Based on the previously reported karyotype of E. furcellata2, we employed a hybrid sequencing strategy combining PacBio HiFi long-reads and Hi-C scaffolding to assemble a high-quality, chromosome-level genome for E. furcellata. The final assembly spans 1.03 gigabases (Gb), with a contig N50 of 2.35 megabases (Mb) and a scaffold N50 of 135.69 Mb. Hi-C scaffolding resolved the genome into seven pseudochromosomes, consistent with prior karyotype studies of E. furcellata2. Repetitive elements constituted 52.07% (520.07 Mb) of the genome, and 15,802 protein-coding genes were predicted evidenced by transcripts from different organ or developmental stages of the stinkbugs, in which 90.72% functionally annotated based on different database. This high-quality genome provides a critical resource for advancing evolutionary, ecological, and biocontrol research within the Asopinae subfamily.

Methods

Insect rearing and sample collection

Male and female individuals of E. furcellata were collected from a wild population in Nanhua, Chuxiong City, Yunnan Province, China (25° 17′ N, 101° 23′ E), and reared in our laboratory for over 30 generations. The insects were fed Tenebrio molitor larvae under controlled environmental conditions: 29 ± 1 °C, 65 ± 5% relative humidity, and a 16-hour light/8-hour dark photoperiod. To minimize genetic variation, we sequenced the genome of female offspring from four consecutive generations of inbreeding. Prior to sequencing, insect surfaces were sterilized with 75% ethanol. Whole genomic DNA extracted from female adults was used for genome sequencing and assembly.

DNA and RNA sequencing

The genomic DNA, which was prepared using the Insect DNA Extraction Kit (Genenode, Beijing, China), was assessed for integrity and contamination via agarose gel electrophoresis and quantified using a Qubit fluorometer (Invitrogen, Carlsbad, CA, USA). A short-read DNA sequencing library was constructed with the VAHTS™ Universal DNA Library Prep Kit for MGI (Vazyme, China) following the manufacturer’s protocol. Three distinct sequencing data types, DNBSEQ short reads, PacBio HiFi reads, and Hi-C short reads, were generated and used for genome assembly. For short-read sequencing, paired-end libraries (2 × 150 bp) with ~ 350 bp insert sizes were prepared and sequenced on the DNBSEQ-T7 platform (BGI, Shenzhen, China). For PacBio long-read sequencing, genomic DNA (~8 μg) was sheared using g-TUBEs (Covaris, Woburn, MA, USA) and concentrated with AMPure PB magnetic beads (Beckman Coulter, Brea, CA, USA). SMRTbell libraries were prepared using the Single Molecule Real-Time (SMRT) Bell Template Prep Kit 2.0 (Pacific Biosciences, Menlo Park, CA, USA), size-selected on a BluePippin™ system (Sage Science, Beverly, MA, USA) with a ≥ 15 kb cutoff, and sequenced on the PacBio Sequel II platform in circular consensus sequencing (CCS) mode. For Hi-C library preparation, whole adult insects were crosslinked with 2% formaldehyde, and chromatin was digested with the MboI restriction enzyme (New England Biolabs, Ipswich, MA, USA). Hi-C libraries, with ~ 350 bp inserting size, were sequenced on the DNBSEQ platform to generate 150 bp paired-end reads. For transcriptomic analysis, RNA was extracted from whole insects or tissues with Insect RNA column Extraction Kit. Transcriptome libraries were prepared using the TruSeq RNA Library Prep Kit v2 (Illumina, San Diego, CA, USA) with polyA selection and insert sizes of 200–400 bp. Sequencing was performed on an Illumina NovaSeq6000 platform (Illumina, San Diego, CA, USA), yielding 150 bp paired-end reads.

K-mer analysis for genome size estimation

A total of 63.91 Gb of short read data, underwent quality control and filtering using fastQC (v0.11.3) and fastp (v0.23.2)18, respectively, to obtain clean data. Subsequently, 10,000 pairs of reads are randomly selected and aligned against the NCBI nucleotide database (NT database) using Blastn (v2.11.0+, -evalue 0.00001)19, which resulted in the top 80% of matched species being displayed (Supplementary Table 1). Based on the NT alignment results, the sequenced data shows no significant foreign contamination. The software Jellyfish (v2.3.0, -m 21 -s 1000000000)20 was used to generate K-mer frequency distribution information. GenomeScope (v2.0, -k 21 -p 2)21 was then employed to estimate genome heterozygosity, repeat content, and genome size based on the resulted K-mer frequency distribution. The results indicated the estimated maximum genome size is 1,015 Mb, with a heterozygosity rate of 1.49% and a repeat sequence proportion of 32.46% (Fig. 2A). Additionally, Smudgeplot (v0.2.3dev, -k 21 -n 24 –m 160 –ci 1 –cs 10000)21 was used to estimate the ploidy of E. furcellata, and the results showed that the AA type accounted for the highest proportion, suggesting that the species is most likely diploid (Fig. 2B, Supplementary Table 2).

Fig. 2
figure 2

Result of Survey. (A) K-mer (k = 21) depth and K-mer (k = 21) type frequency distribution plot. The horizontal axis represents the Kmer depth, while the vertical axis represents the corresponding number of Kmers at each depth. (B) Smudgeplots K-mer statistical distribution plot. The x-axis represents relative coverage (CovB / (CovA + CovB)), the y-axis shows total coverage (CovA + CovB), and the color indicates the frequency of k-mer pairs. Each haplotype structure appears as a “smudge” on the plot, where the heat intensity of the smudge reflects the occurrence frequency of that haplotype structure in the genome.

Genome assembly and contigs anchoring

Single Molecule Real-Time sequencing (SMRT) technology was used to generate a total of 33.5 Gb of HiFi data. The genome was assembled using HiFiasm22 was used for assembly, and the purge_haplotigs23 was employed to remove redundancy from the initially assembled and error-corrected genome. The resulting draft genome was further polished using short-read data with Pilon (v1.24)24 and long-read HiFi data with Racon (v1.4.21)25, respectively, resulting in a polished genome length of 1,126,600,320 bp, with an N50 of 2,349,459 bp, a GC content of 32.04% and a longest contig of 12,258,061 bp. The overall quality of the polished genome was evaluated using BUSCO (v5.4.3; -l hemiptera_odb10)26 revealed 97.0% completeness of conserved hemipteran genes. Alignment of long-read data to the assembled genome yielded a 99.86% mapping rate and 99.92% coverage, while short-read data alignment achieved a 98.60% mapping rate and 99.36% coverage (Supplementary Fig. 1 and Supplementary Table 3).

For scaffolding the contigs into chromosomes, a total of 128.5 Gb of Hi-C reads were generated and aligned to the genome using Bowtie2 (v2.5.1)27 to determine the orientation of primary contigs on the chromosomes. Based on the principle that cis-interactions (interactions within the same chromosome) are significantly stronger than trans-interactions (interactions between different chromosomes), and that interaction strength decreases with increasing linear distance within cis-interactions, the contigs were segmented, anchored, ordered, oriented, and merged to assemble a chromosome-level genome with ALLHiC (v0.9.14)28. Subsequently, JuiceBox (v1.11.08)29 was employed for visualization and manual error correction. During genome scaffolding, the original 934 contigs were split based on the interaction map, reordered, and ultimately assembled into 7 chromosomes (consistent with prior karyotype analysis findings2) and 122 unanchored scaffolds, yielding a total genome length of 1.03 Gb. The contig N50 was 2.19 Mb, and the scaffold N50 reached 135.69 Mb, with a chromosome anchoring rate of 95.38%. In addition, BUSCO results showed that it had 98.1% completeness. (Fig. 3, Supplementary Table 3).

Fig. 3
figure 3

Genome Hi-C interaction map at 500 kb resolution. The color bar indicates the interaction intensity of Hi-C contacts.

Genome annotation and assessment

Both homology-based and de novo methods were applied for repetitive sequence annotation. Homology-based prediction was conducted using RepeatMasker (v4.0.9) and its internal RepeatProteinMask tool30, leveraging the RepBase database (v21.12, http://www.girinst.org/repbase) of known repetitive sequences. A custom repeat library was then constructed using RepeatModeler (v1.0.11)31 and LTR_FINDER_parallel (v1.0.7)32, which was then integrated with RepeatMasker for de novo prediction of repetitive sequences. For the de novo approach, repetitive sequences were first identified by de novo prediction based on sequence alignment features and transposable element structures, followed by refinement and validation using RepeatMasker. And tandem repeats were predicted using Tandem Repeats Finder (TRF, v4.09)33. Additionally, Extensive de novo TE Annotator (EDTA, v2.2.2) was further employed to evaluate the repetitive sequences.

Gene structure prediction integrated three approaches included (1) RNA-seq-guided annotation: RNA-seq data from multiple developmental stages were aligned to the genome using HISAT2 (v2.1.0)34, and transcripts were assembled with StringTie (v1.3.5)35; (2) Homology-based prediction: Protein sequences from related species (e.g., Nezara viridula36, and Nesidiocoris tenuis37,38) were mapped to the genome using Liftoff (v1.6.3)39 and structurally intact genes were retained to train the AUGUSTUS (v3.5.0) gene model40; and (3) De novo prediction was performed using AUGUSTUS in self-training optimization mode combined with a phylogenetically-informed conserved gene database. The final gene set was generated by integrating repetitive sequences, transcriptomes, and homology-based predictions using Maker (v3.01.03)41 with default parameters.

Two approaches were implemented for functional annotation of the protein-coding gene set: (1) sequence similarity-based analysis using diamond (v2.0.14, -- evalue 1e-05)42 to align protein sequences against databased including SwissPro (https://www.expasy.org/resources/uniprotkb-swiss-prot), NCBI nucleotide sequence database (NT) (https://www.ncbi.nlm.nih.gov/nucleotide/), NCBI non-redundant databases (NR) (ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz), Pfam (http://xfam.org/), EggNOG (http://eggnogdb.embl.de/), Gene Ontology (GO) (http://geneontology.org/page/go-database), and the Kyoto Encyclopedia of Genes and Genomes (KEGG) (http://www.genome.jp/kegg/) databases for functional and pathway annotation, and (2) domain/motif-based analysis employing InterProScan (v5.61–93.0) to query InterPro sub-databases for conserved sequences/motifs/domains, supplemented by HMMER3 (v3.3.1)43 for HMM-based annotation of transcription factors and Pfam domains. A total of 14,336 genes (accounting for 90.72% of all protein-coding genes) were functionally annotated (including 1,168 unnamed, uncharacterised, and hypothetical genes, accounting for 8.14%), while 1,466 genes (9.28%) remained unannotated (Fig. 4, Supplementary Table 46, Supplementary Figs. 24).

Fig. 4
figure 4

Genome mapping of the E. furcellata. In this figure, circle (a) means GC content distribution; circle (b) means genomic information; circle (c) means repeat sequence distribution; circle (d) means LTR distribution; circle (e) means LINE distribution; and circle (f) means DNA-TE distribution. And the darker colors within the rings indicate more concentrated distribution of the elements.

Data Records

The chromosomal-level genome assembly data have been deposited in GenBank with the accession number JBPZNH00000000044. Additionally, the genome annotation files have been deposited Figshare45. The NGS, HiFi and HiC sequencing data have been deposited in the National Genomics Data Center (NGDC) under accession number CRA01391246, as well as the transcriptome data showed in the NGDC under accession number CRA02556847.

Technical Validation

Genome assembly accuracy was assessed by aligning clean short reads to the assembled genome using Burrows-Wheeler Aligner (BWA, v0.7.18)48, achieving a 98.30% mapping rate and 99.82% genome coverage, confirming high assembly continuity. Genome completeness was evaluated using two complementary approaches: BUSCO analysis (v5.4.3) with the Hemiptera_odb10 lineage dataset identified 97.0% complete genes (94.5% single-copy, 2.5% duplicated) and only 0.3% fragmented genes based on the protein-coding genes (Supplementary Table 6). LTR Assembly Index (LAI) assessment using LTR_Retriever (v2.9.0)49 yielded a score of 8.46, reflecting relative repetitive region integrity. These results collectively demonstrate the genome’s high continuity, accuracy, and completeness, establishing it as a reference-grade genome.