Background & Summary

Omiodes indicata (Fabricius) is an important pest of leguminous crops, and its incidence has become increasingly severe in major legume-producing regions of tropical and subtropical Asia, Africa, and the Americas in recent years1. This species, a polyphagous member of the family Crambidae, subfamily Spilomelinae (Lepidoptera), primarily damages a wide range of legumes including soybean (Glycine max), black gram (Vigna mungo), common bean (Phaseolus vulgaris), mung bean (Vigna radiata), cowpea (Vigna unguiculata), and lablab bean (Lablab purpureus)2,3. The larvae inflict damage by leaf rolling, webbing, and feeding, resulting in skeletonization of leaves. Severe infestations not only reduce the photosynthetic capacity of the crop but also adversely affect pod development and yield, making O. indicata one of the key constraints to legume productio3,4.

The larvae of O. indicata are adept at using silk to bind leaves together, constructing protective webbed shelters inside which they feed4. This behavior not only exacerbates crop losses but also increases the difficulty of effective pest management. The entire larval stage is spent concealed within leaf folds; pupation also occurs inside the rolled leaves, and adults subsequently emerge5. In tropical and subtropical regions, O. indicata is multivoltine, exhibiting overlapping generations and causing damage throughout the year, with particularly severe outbreaks during the vegetative and reproductive stages of host crops. Economic threshold investigations have indicated that when 8–9 rolled leaves per plant are observed, chemical intervention is warranted6,7,8.

Currently, field management relies mainly on chemical insecticides. However, the cryptic feeding habit of the larvae within leaf rolls renders chemical control less effective, and improper or untimely application can result in unsatisfactory outcomes, increased risk of resistance, and food safety concerns. Therefore, a lack of high-quality genomic resources has greatly hampered our in-depth understanding of the biology and ecology of O. indicata. This study integrated data from three sequencing platforms to obtain a high-quality chromosome-level genome assembly of O. indicata. Comprehensive annotation of repetitive elements, non-coding RNAs, and protein-coding genes was performed, providing a valuable genomic resource for future ecological and functional genomics research.

Methods

Sample collection and sequencing

The O. indicata population used in this study was originally collected on May 27, 2024, from a soybean test field at the Teaching Experimental Farm of Guizhou University in Guiyang, China (26°23′49.538″N, 106°40′31.616″E). The colony has since been maintained for more than five consecutive generations in an artificial climate chamber at the Natural Enemy Propagation Center of Guizhou University under controlled conditions: temperature of 26 ± 1 °C, photoperiod of 14 L:10D, and relative humidity of 75 ± 5%. Larvae were reared on fresh soybean plants, while adults were supplied with a 15% (w/v) honey solution for genome sequencing (Fig. 1). Using sterile forceps, gently transfer the target female adult into a pre-prepared centrifuge tube containing sterile PBS buffer. The tube was gently inverted or shaken to wash the insect’s surface for 10 minutes, effectively removing any adhering debris and microorganisms. After washing, excess liquid was blotted from the insect using sterile filter paper. The sample was then immediately flash-frozen in liquid nitrogen for 20 minutes and subsequently transferred to a –80 °C ultra-low temperature freezer for storage.

Fig. 1
figure 1

Life cycle of Omiodes indicata and its damage on soybeans. (a) Different developmental stages of O. indicata. (b) The symptom of soybean leaves damaged by O. indicata.

Genomic DNA and RNA were isolated from the specimen using the DNeasy Blood & Tissue Kit (Qiagen) and TRIzol Reagent (Thermo Fisher Scientific), respectively, by the manufacturers’ instructions. Short-read libraries were prepared without PCR amplification using the Illumina TruSeq DNA PCR-Free Kit, generating 150 bp paired-end reads with 350 bp inserts. For Hi-C sequencing, we implemented a standard protocol9, including DNA crosslinking, MboI digestion, end repair, and DNA purification. All short-read sequencing was conducted using an Illumina NovaSeq X Plus system. For long-read sequencing, we constructed a 20 kb SMRTbell library (PacBio SMRTbell Express Template Prep Kit 2.0) and sequenced it on the PacBio Revio system in HiFi mode. Library construction and sequencing were conducted at Berry Genomics (Beijing, China). A total of 110.04 Gb of high-quality sequencing data was generated, comprising 15.11 Gb of PacBio HiFi reads (30.65 × coverage), 34.73 Gb of Illumina short reads (70.44 × coverage), and 52.38 Gb of Hi-C data (106.23 × coverage) (Table 1).

Table 1 Sequencing data generated for the Omiodes indicata genome assembly and annotation.

Genome survey

Raw Illumina reads were processed for quality control using BBTools v38.8210. Duplicate reads were first removed using “clumpify.sh”. Subsequently, “bbduk.sh” was employed to trim adapter sequences and low-quality bases (Q < 20) according to stringent quality criteria. Specifically, sequences with quality scores below 20 were discarded, reads containing more than five Ns were filtered out, poly-A/G/C tails longer than 10 bp were trimmed, and overlapping paired reads were corrected. To estimate the genome size, heterozygosity, and repetitive sequence content in the O. indicata genome, a genome survey was conducted using GenomeScope v2.011. K-mer frequency analysis was performed using khist.sh (BBTools) with a k-mer length of 21. Based on the coverage and frequency distribution of the k-mers, the genome size of O. indicata was estimated to be approximately 477.29 Mb, with a heterozygosity rate of 1.33% (Fig. S1).

Genome assembly

The initial genome assembly was generated using PacBio HiFi long reads and assembled with Hifiasm v0.19.812 under default parameters. After that, the primary assembly was polished twice with Illumina reads and NextPolish v1.3.113. For chromosome-scale scaffolding, Hi-C reads was first quality-filtered and then aligned to the assembly using Juicer v1.6.214. Contigs were subsequently anchored and ordered into chromosomes using 3D-DNA v.18092215. The final assembly was manually verified and corrected in Juicebox v.1.11.014 to resolve potential misjoins or orientation errors. To ensure the assembly’s purity, we screened for contaminants using MMseqs2 v1.116 against the NCBI nucleotide (nt) and UniVec databases, removing any detected foreign sequences. Potential vector contaminants were identified using v2.11.017 against the UniVec database, with sequences showing >90% similarity flagged as contaminants. Additional sequences exhibiting >80% similarity were further validated through BLASTN searches against the NCBI nucleotide database (NT). All identified bacterial and fungal contaminants were thoroughly removed from the assembly scaffolds. The final chromosome-scale assembly of O. indicata spans 493.08 Mb, consisting of 59 scaffolds and 100 contigs, which is consistent with the genome size estimated in the genome survey. The assembly exhibited high continuity, with scaffold and contig N50 values of 17.25 Mb and 15.72 Mb, respectively (Table 3). Notably, 99.80% of the assembled sequences (492.12 Mb) were successfully anchored to 31 chromosomes (Figs. 2, 3). Furthermore, BUSCO analysis indicated a genome assembly completeness of 99.1% (Table 2). Collectively, these findings demonstrate that our genome assembly achieves outstanding continuity and structural integrity.

Fig. 2
figure 2

The chromosomal heatmap visualization of Omiodes indicata genome assembly displays complete chromosomes in blue, with individual contigs demarcated by green borders.

Fig. 3
figure 3

The genomic features of Omiodes indicata are displayed in a circular layout. Moving inward from the outermost ring, the visualization depicts (1) chromosome length, (2) GC content, (3) gene density, and (4) various repetitive elements, including transposable elements (DNA, SINEs, LINEs, and LTRs), along with simple repeat sequences.

Table 2 Genome assemblies results of Omiodes indicate.

Genome annotation

The species-specific repeat library of O. indicata was generated using RepeatModeler v2.0.418 and integrated with known repeats from RepBase-2013090919 and Dfam 3.520 to construct a comprehensive repeat database. The custom repeat database was employed as input for RepeatMasker v4.1.421 to systematically identify and mask repetitive elements throughout the genome, followed by soft-masking of these regions. The analysis revealed that repetitive sequences account for 38.13% of the O. indicata genome assembly. These elements were classified into major categories, including unclassified elements (17.92%), LINE transposons (6.71%), LTR transposons (2.77%), DNA transposons (2.60%), and other repeat types (Table 3).

Table 3 Genome annotation statistics of the Omiodes indicate.

Non-coding RNAs (ncRNAs) in O. indicata were identified using Infernal v1.1.422 with the Rfam v14.10 database23, while tRNA detection was performed with tRNAscan-SE v2.0.924. The analysis revealed a diverse ncRNA repertoire, comprising 490 tRNAs, 104 rRNAs, 75 microRNAs, and 91 small nuclear RNAs, totaling 822 ncRNAs (Table 3).

Protein-coding gene annotation of the O. indicata genome was performed using MAKER v3.01.0325, which integrated transcriptomic evidence, ab initio predictions, and protein homology information data. Transcriptome sequences were aligned to the genome using HISAT2 v2.2.126, followed by genome-guided assembly with StringTie v2.1.627. For ab initio gene prediction, BRAKER v2.1.628 was employed, incorporating GeneMark-ES/ET/EP 4.68_lic29 and Augustus v3.4.030, both of which were trained using transcriptomic sequences and protein data from OrthoDB v1131. Additionally, homology-based gene prediction was conducted using GeMoMa v1.932, utilizing protein sequences from six reference species: Drosophila melanogaster (GCF_000001215.4)33, Apis mellifera (GCA_003254395.2)34, Ostrinia nubilalis (GCF_963855985.1)35, Bombyx mori (GCF_014905235.1)36, and Tribolium castaneum (GCA_031307605.1)37. The annotation pipeline identified 14,713 protein-coding genes in the O. indicata genome, with an average gene length of 13,357.6 bp (Table 3). On average, each gene contained 7.6 exons, 6.6 introns, and 7.4 coding sequences (CDS). Gene structure analysis revealed mean exon, intron, and CDS lengths of 304.7 bp, 1,735.3 bp, and 223.2 bp, respectively. To evaluate the quality of the gene predictions, gene set completeness was assessed using BUSCO with the Insecta dataset (n = 1,367). An assessment of the completeness of the protein-coding genes was performed by BUSCO, which resulted in a high score of 99.6% (n = 1,367) (Table 3).

Functional annotation was performed by aligning protein sequences against the UniProtKB database using DIAMOND v2.0.1138. Additionally, Gene Ontology (GO) terms, KEGG/Reactome pathways, and protein domains were annotated using eggNOGmapper v2.0.1439 and InterProScan 5.53–87.040. The InterProScan analysis integrated data from five databases: Pfam41, SMART42, Superfamily43, Gene3D44, and CDD45. Functional annotation identified 12,194 COG categories, 8,653 GO terms, 4,967 enzyme codes, and 4,967 KEGG pathways in O. indicata, based on the integration of InterProScan and eggNOG annotations (Table 4). Chromosomal features, including repeat elements, gene density, and GC content, were visualized using TBtools v2.30546.

Table 4 Genome function annotation statistics of Omiodes indicate.

Data Records

The sequencing data generated in this study are available under the following National Center for Biotechnology Information (NCBI), which BioProject was PRJNA1193224 with the submission SAMN45134265, and the raw sequencing data SRA numbers: transcriptome reads (SRR33699163)47, Hi-C data (SRR33699162)48, Illumina short reads (SRR33699164)49, and PacBio HiFi long reads (SRR33699165)50. The final genome assembly is available under NCBI accession GCA_050947735.151. We have deposited the annotation results for repeated sequences, gene structure, and functional prediction in the Figshare database52.

Technical Validation

Genome assembly quality was evaluated using two complementary approaches. First, assembly completeness was assessed with BUSCO v5.0.453 against the Insecta reference dataset, which comprises 1,367 conserved single-copy orthologs. The assembly exhibited a BUSCO completeness of 99.1%, with 96.6% of genes present as single copies, 2.5% duplicated, 0.2% fragmented, and 0.7% missing (Table 2). Second, assembly accuracy was evaluated by calculating mapping rates through the alignment of PacBio, Illumina, and RNA-seq reads to the final assembly using Minimap2 v2.2354 and SAMtools v1.955. The assembly demonstrated high mapping rates for PacBio (99.90%), Illumina (95.57%), and RNA-seq (89.87%) reads (Table 2). The genome annotation completeness of O. indicata was confirmed to be 99.6% by BUSCO (Table 2). These comprehensive analyses confirm the high quality of our genome assembly and annotation.