Background & Summary

The beet webworm, Loxostege sticticalis Linnaeus (Lepidoptera, Pyralidae), is a major agricultural pest in North America, Europe, and Asia1,2. Outbreaks of this pest occur over a wide region (108-118° E, 37-35° N) of China, with three periods of outbreaks reported over the last 70 years. Since 2018, L. sticticalis has entered the fourth outbreak period in China, resulting in serious ecological and economic losses, especially in the northern part of China3,4.

In 2020, the Ministry of Agriculture and Rural Affairs listed L. sticticalis in the List of National Class I of Crop Diseases and Insect Pests. In China, L. sticticalis generations increase from 1 to 4 with decreasing latitude and altitude. However, the main damage generations of L. sticticalis are the first and second generations5. L. sticticalis is classified as a facultative migrant pest and shows significant ability to crawl and migrate. The fourth and fifth instar larval stages are the most destructive as they usually consume all available leaves within an area6. L. sticticalis can damage more than 200 species belonging to 35 families of host plants, including Glycine max, Helianthus annuus, Chenopodium album, Beta vulgaris, and Zea mays, thus posing a major threat to the production of crops, such as grains, oilseeds, and pastures (Fig. 1)3.

Fig. 1
Fig. 1
Full size image

Life cycle of Loxostege sticticalis and its damage on maize. (a) Diferent developmental stages of L. sticticalis. (b) The symptom of maize leaves damaged by L. sticticalis.

Currently, a variety of chemical pesticides have been used for widespread control of L. sticticalis, however, this management strategy can lead to insecticide resistance increase and negatively impact the agro-ecosystem. Genomic analysis can be helpful for the development of integrated pest-management approaches, as has been demonstrated7,8. To date, the genomes of several typical migrant pests have been sequenced and published, such as Mythimna separate9, Helicoverpa armigera10, and Spodoptera frugiperda11. This information leads to an understanding of molecular and genetic mechanisms for pest control. However, the genome of L. sticticalis, a member of the Class I of Crop Diseases and Insect Pests, has not yet been sequenced. Therefore, to promote the development of innovative management strategies for controlling L. sticticalis, chromosome-level genome is necessary and, in this study, we present the chromosome-level genome of L. sticticalis. This provides a resource for studies on L. sticticalis, as well as novel insights into the evolution and ecology of migrant pests.

In this study, three male pupae were used for the whole-genome assembly. Firstly, the genome of L. sticticalis was surveyed before sequencing, finding that the size of the genome was an estimated 1.9 Gb, with a repetitive sequence content of 67.87% and a heterozygosity rate of 0.31% (Fig. 2a). Then, we generated 31.4 Gb of CCS data using the PacBio SequeII platform and assembled the genome, with an overall content of 485.9 Mb and a repetitive sequence content of 41.71%, consisting of 118 contigs with an scaffold N50 length of 16.4 Mb, associated with 31 chromosomes (Table 1, Fig. 2b,c). The L. sticticalis genome size is inconsistency of the results of survey and third-generation sequencing, which is due to the high heterozygosity of insect12. In addition, the Benchmarking Universal Single-Copy Orthologs (BUSCO) (v5.2.1, odb10) tool was used to assess the completeness of the L. sticticalis genome, yielding a value of 98.7% with a signal copy value of 97.7%, and duplicated copy rate of 1.0% (Table 2), indicating the good quality of the genome assembly and annotation13.

Fig. 2
Fig. 2
Full size image

Overview of the genomic landscape of the beet webworm, Loxostege sticticalis. (a) Characteristics of the Illumina short-read sequencing of the L. sticticalis genome. (b) Circle genomic landscape of L. sticticalis. The circles from the outside to the inside represent chromosome sequence, gene density, GC content, and repeat sequence content, respectively. The middle lines of the circles indicate genes showing collinearity. (c) Hi-C interactive heatmap of L. sticticalis.

Table 1 Features of the L. sticticalis genome assembly.
Table 2 BUSCO statistics of L. sticticalis genome assembly and annotation.

Meanwhile, 41.71% of the genome was annotated as repeat sequences (Supplementary Table 1), while DNA transposons, long interspersed nuclear elements (LINEs), short interspersed nuclear elements (SINEs), and long terminal repeats (LTRs) accounted for 4.4, 9.69, 0.79 and 6.44%, respectively, of the whole L. sticticalis genome. In addition, 11.52% repeat sequences within the whole genome were not classified. In terms of protein-coding genes, a total of 17 431 genes were predicted and annotated in the assembled genome (Supplementary Table 2). Of these, 17 113 (98.17%) of the encoded proteins were annotated using different databases, including NR, Swiss-Prot, Gene Ontology (GO), Clusters of Orthologous Genes (COG), Eukaryotic Orthologous Groups (KOG), Kyoto Encyclopedia of Genes and Genomes (KEGG), and Pfam (Supplementary Table 3). Moreover, different types of non-coding RNA were identified, including 15 930 tRNAs with tRNASCAN, 90 miRNAs, 130 rRNAs, and 97 snRNAs (Supplementary Table 4, 5, 6).

Orthologous genes were identified by comparison of L. sticticalis and 15 other fifty lepidopteran insects. In total, 11 719 gene families were identified, within which 1 867 single-copy and 4 065 multiple-copy orthologs were found (Supplementary Table 7). The phylogenetic relationships and estimated times of species divergence were analyzed using the protein sequences of the single-copy orthologous genes. The results indicated that L. sticticalis and Ostrinia furnacalis shared the last common ancestor approximately 42.8 million years ago (Fig. 3).

Fig. 3
Fig. 3
Full size image

Phylogenetic tree and gene orthology of 16 lepidopteran insect genomes. The maximum-likelihood phylogenetic tree was constructed with 1000 bootstrap replicates. The blue node indicates the correction point, while light blue indicates the 95% confidence interval. The number adjacent to the node represents the time of divergence in million years ago. The numbers on the branch represent expansions (red) or contractions (blue) of the gene families.

The expansion and contraction of L. sticticalis gene families, as well as those of related species, were analyzed using CAFÉ (v4.2.1) software. The results showed expansion of 785 gene families in L. sticticalis, and 688 families have contracted (Fig. 3). Meanwhile, GO enrichment analysis showed that these 785 expanded gene families in L. sticticalis genome were mainly involved in “integral component of membrane (GO:0016021)”, “RNA-directed DNA polymerase activity (GO:0003964)”, and “DNA integration (GO: 0015074)”. The 688 contracted families were mainly involved in “extracellular region (GO:0005576)”, “RNA-directed DNA polymerase activity (GO:0003964)”, and “DNA integration (GO:0015074)” (Fig. 4, Supplementary Fig. 1, 2).

Fig. 4
Fig. 4
Full size image

GO enrichment analysis of Loxostege sticticalis gene families showing expansion (a) and contraction (b).

Four lepidopteran insects were selected for syntenic analysis. Generally, the Lepidoptera shared high chromosomal synteny, although several fusion events were detected between L. sticticalis and other Lepidopteran species (Fig. 5), specifically, between Cydia pomonella chromosome 1 and chromosomes 1 and 3 of L. sticticalis and C. pomonella chromosome 2 with chromosomes 11 and 25 of L. sticticalis. A number of intrachromosomal inversions and other local rearrangements were also detected.

Fig. 5
Fig. 5
Full size image

Synteny analysis between four species, Loxostege sticticalis, Cydia pomonella, Chilo suppressalis and Plutella xylostella. Each rounded rectangle represents a chromosome, and the line in the middle indicates a collinear block.

For phytophagous insects, detoxification and chemosensation abilities are both essential for selecting and locating host plants14,15,16. These feeding preferences can be reflected at the level of the genome, as previously described13,15,17. In this study, associations were also observed between gene numbers and host ranges (Fig. 6). For instance, in terms of detoxification-related genes, the numbers of genes in the cytochrome P450 (P450) families were observed to increase sequentially in monophagous, oligophagous, and polyphagous insects. Furthermore, greater numbers of chemosensory-related genes have been observed in insects with broader host ranges13. In L. sticticalis, 81 gustatory receptors (GRs) were identified, which is higher than the number in monophagous and polyphagous insects The results indicate that the feeding preference of the insect is correlated with the number of genes involved in detoxification and chemosensing. Taken together, a high-quality chromosome-level genome of L. sticticalis was assembled, providing a valuable genomic resource for the further understanding of insect olfactory, evolutionary, and feeding preferences. Moreover, this genomic resource provides a reference for the implementation of integrated pest-management strategies for migratory insects.

Fig. 6
Fig. 6
Full size image

Distribution of detoxification and chemosensory genes in 18 Lepidopteran insects. The numbers in the cells indicate the size of the corresponding gene family for each species. A darker background color in the cells indicates that more genes were encoded in the corresponding species.

Methods

Sample collection and DNA extraction

The first generation of L. sticticalis was collected in Huhhot, Inner Mongolia Province, China (40°82′ N, 111°71′ E). The larvae were reared under laboratory conditions at a constant temperature of 22 ± 1°C, photoperiod of 16:8 (L:D), and relative humidity of 75 ± 5%, with feeding with fresh Chenopodium album. The fifth (last) instar larvae were transferred to a box for pupation, containing clean sandy soil with 15% humidity. The food of the adults was supplemented with a 5% honey solution. Overall, 1 male and 1 female collected in the field were reared in the lab, and the pupae of the third generation were used for genome sequencing. Three male pupae were used for high-quality genomic DNA extraction, using the sodium dodecyl sulfate (SDS) method18. The RNA contaminants were removed by RNase A. The DNA quality, concentration, and integrity were assessed using 0.5% agarose gel electrophoresis (AGE), a Nanodrop spectrophotometer (Thermo Fisher, Waltham, MA, USA; NANODROP2000), and Qubit fluorometry (Invitrogen, QubitTM3Flurometer).

Library construction and genome sequencing

For short-read sequencing, a paired-end library was constructed with an insert size of 350 bp, after which 150 bp paired-end reads (PE 150) were used for whole-genome sequencing on the Illumina NovaSeq. 6000 platform (Illumina, San Diego, CA, USA), following the provided instructions. For long-read sequencing, the genomic DNA was sheared into 15 kb fragments by g-TUBE (Covaris), and a SMRTbell library was constructed using a SMRTbell Express Template Prep kit 2.0 (PacBio, 100-938-900). The size distribution and concentration of the library were assessed using a FEMTO Pulse automated pulsed-field capillary electrophoresis instrument (Agilent Technologies, Wilmington, DE, USA) and a Qubit 3.0 Fluorometer (Life Technologies, Carlsbad, CA, USA). The genomic DNA library was purified with 1X AMPure PB beads (Pacbio, Menlo Park, CA, USA) and sequenced on both the Illumina and PacBio Sequel II platforms, following the accompanying instructions. A Sage ELF system (Sage Science, Beverly, MA, USA) was used to select the SMRTbell range of 15 to 18 kb, and the library was then purified using 1X AMPure PB beads. The final SMRTbell library size and quantity were assessed using the FEMTO Pulse and the Qubit dsDNA HS reagents assay kit. The sequencing primer and Sequel II DNA Polymerase of the library were annealed and bound. SMRT sequencing was performed using a single 8 M SMRT Cell on the Sequel II System with the Sequel II Sequencing Kit and 1800-minute movies by Biomarker Technologies Co., Ltd. (Qingdao, China).

Estimation of genome features

The fastp tool (version 0.20.0, https://github.com/OpenGene/fastp) (Chen et al., 2018) was used for quality filtration of the short-reads from the Illumina sequences using the parameters -q 10 -u 50 -y -g -Y 10 -e 20 -l 100 -b 150 -B 150. First, the adaptors were removed from the sequenced reads. Second, read pairs were excluded if any one end had an average quality of less than 20. Third, the ends of the reads were trimmed if the average quality was lower than 20 in the sliding window size of 5 bp. Finally, read pairs with any ends shorter than 75 bp were removed. The quality-filtered reads were then used for genome size estimation. The genome size, heterozygosity, and the repeat contents were estimated using the filtered reads of 21-mer with Jellyfish software (-h 1000000000)19 and the genome characteristics were determined using Genomescope 2.0 (-k 21 -p 6 -m 100000)20. The completeness of the genomic annotation was evaluated by Benchmarking Universal Single-Copy Orthologs (BUSCO, version: 5.2.1, odb10) software.

Genome assembly from CCS data and anchor contigs

High-accuracy CCS data were assembled using hifiasm (v0.16.1, https://github.com/chhylp123/hifiasm) software to obtain the genome sequences. Paired reads with mates mapped to a different contig were used for the Hi-C-associated scaffolding. The Hi-C technology was used to help anchor contigs. The contigs were clustered using ALLHIC (v0.9.8) software21 to determine the closeness of associations between contigs. The interactions between two contigs were converted into the specified binary files (Hi-C files) using juciertools (v3.0, https://github.com/aidenlab/JuicerTools) software. The sequenced and oriented contigs were then manually corrected using Juciebox (v2.15.07)22 to obtain the chromosome-level assembly results and were mapped to the polished L. sticticalis genome using BWA (bwa-0.7.17) with default parameters. Self-ligation, non-ligation, and other invalid reads, such as Start NearRsite, PCR amplification, random break, Large Smal lFragments, and Extreme Fragments, were filtered.

Genome annotation

De novo and homology-based methods were used to identify repetitive sequences in the L. sticaticalis genome. First, RepeatModeler (v2.0.1, http://www.repeatmasker.org/RepeatModeler/)23 was used to construct a de novo repeat library, and the predicted results were merged by the RepBase database (http://www.girinst.org/repbase). Second, RepeatMasker (v4.1.0, http://www.repeatmasker.org) was used to predict the repetitive sequences of the L. sticticalis genome, and theRepeatProteinMask tool in RepeatMasker was used to predict the repetitive sequences. The two results were then integrated.

Three methods were used to predict mRNA, including de novo prediction (Augustus v3.3.3, https://github.com/Gaius-Augustus/Augustus and GlimmerHMM v3.0.424, homology searches (exonerate v2.4.0, https://github.com/nathanweeks/exonerate/ and GeMoMa v1.6.425, and transcript prediction, which the the RNA-seq transcripts were re-constructed through StringTie v2.1.3 (TransDecoder v5.1.0)26, followed by the use of TransDecoder v5.1.0 (https://github.com/TransDecoder/TransDecoder) to predict protein-coding genes. The obtained multiple datasets were integrated using EVidenceModeler (v1.1.1)27, and finally the integrated data were updated, UTR regions were added, and new transcripts were identified using PASA (v2.5.2, https://github.com/PASApipeline/PASApipeline). BLAST v2.10.1+28 was used to compare the the longest nucleic acid transcript sequences to six database, namely, NR, Swiss-Prot, GO, COG, KOG and KEGG. The protein sequences were analyzed in the Pfam database by HMMER V3.2.129 software.

For annotation of non-coding RNA, barrnap 0.9 (https://github.com/tseemann/barrnap) and tRNASCAN v2.0.030 were used to predict ribosomal RNA (rRNA) and transfer RNA (tRNA), respectively. The Rfam database was used to predict non-coding RNA using infernal 1.1.3 (https://github.com/EddyRivasLab/infernal).

Gene family orthology and phylogenetic analyses

For analysis of orthologous genes in the gene families, 16 lepidopteran species, including L. sticticalis, Bombyx mori31, Ostrinia furnacalis32, Helicoverpa armigera10, Spodoptera frugiperda11, Cydia pomonella7, Mythimna separate9, Mythimna loreyi9, Manduca sexta33, Heliothis virescens34, Spodoptera exigua35, Spodoptera litura36, Chilo suppressalis37, Cnaphalocrocis exigua8, Trichoplusia ni38, and Plutella xylostella39 were selected. The longest amino sequences were used for ortholog identification by OrthoFinder (2.3.5) with clustering by diamond (v2.0.6.144). The signal-copy orthologous genes from the OrthoFinder results were aligned using MAFFT (v7.427), and phylogenetic trees were constructed using RAxML (8.2.12) with 1000 bootstrap replicates and model PROTGAMMAJTT. The phylogenetic tree with divergence times was constructed using MCMCTREE (v4.9 h).

Analysis of gene family expansion and contraction

The results of the gene clustering were filtered to remove families containing >100 genes in one species. Analysis of gene family expansion and contraction was performed with CAFÉ (v4.2.1) with a P-value threshold <0.05 as the cut-off. In addition, genes in families showing expansion and contraction were compared using the GO database by BLAST (v2.10.1+) for analysis of the gene function.

Chromosome synteny analysis

For the chromosome synteny analysis, four lepidopterans species, namely, L. sticticalis, Cydia pomonella7, Chilo suppressalis37 and Plutella xylostella39, were selected. The Multiple Collinearity Scan toolkit (MCScanX) was used to identify collinearity.

Gene family identification

To investigate the reasons underlying the feeding habits of L. sticticalis at the genomic level, detoxification- and chemosensing-associated genes from 18 lepidopteran insects (16 species in the part of “Gene family orthology and phylogenetic analyses”, Leguminivora glycinivorella40 and Danaus plexippus41) were compared using BLASTP (E < 10−5) and TBLASTN (E < 10−5). Specifically, detoxification-associated genes, including cytochrome P450 (P450), ATP-binding cassette (ABC), carboxyl/cholinesterases (CCE), UDP-glycosyltransferases (UGT), and glutathione-S-transferase (GST), as well as chemosensing-associated genes, including odorant binding proteins (OBPs), chemosensory proteins (CSPs), odorant receptors (ORs), ionotropic receptors (IRs), and gustatory receptors (GRs) were analyzed.

Data Records

The NCBI BioProject number for the data reported in this paper is PRJNA111849242. The cleansequencing data for short read, Hi-C and HiFi have been deposited in the NCBI Sequence Read Archive with accession number SRR2936623243, SRR2936623144, SRR2936623045, respectively. The chromosomal-level genome assembly file was deposited in the NCBI GenBank with accession number JBEDNZ00000000046.

Technical Validation

The chromosome-level genome of the L. sticticalis assembly was 485.9 Mb with a scaffold N50 length of 16.6 Mb. Assessment of the completeness of the genome was performed using BUSCO (version: 5.2.1, odb10). The results showed that more than 98% of BUSCO genes were identified in the genome assembly (Table 2), indicating a high level of completeness of the L. sticticalis genome assembly. Moreover, the Hi-C heatmap showed obvious grouping, and the intensity of the interaction was higher than that of the off-diagonal position, indicating a high degree of interaction between neighboring sequences in the chromosomal results of the Hi-C assembly, while the interaction signals between non-adjacent sequences were weak, consistent with the principle of Hi-C auxiliary assembly, indicating better genomic anchoring effect. The genome of L. sticticalis was divided into 31 groups, indicating 31 chromosomes (Fig. 2c). Thus, the chromosome-level genome of L. sticticalis was of high quality.