Background & Summary

Lepidoptera, encompassing butterflies and moths, is the second most diverse pest insect, with 180,000 described species. They commonly possess 31 chromosomes and constitute one-tenth of Earth’s described species1. Moreover, in both nature and agriculture settings, there is hardly any plant or crop that is not attacked by at least one lepidopteran pest2,3,4. Indeed, the larval stages (caterpillars) are major pests in forests, stored grains, and fiber and food crops. Besides, resistance to insecticide is an increasing problem and moths are among the most feared invasive species.

In the family of moths or Noctuidae, stem borers are notorious pest insects; the stem borer caterpillars damage crops by boring or tunneling inside their plant stems. The pink stem borer or Sesamia inferens Walker (Lepidoptera: Noctuidae) is very destructive for rice in the world5,6,7, but this polyphagous pest is also a major pest to a broad spectrum of crops, encompassing economically important graminaceous crops such as maize, sorghum, wheat, oats and sugarcane8,9,10. The adults can fly long distance, and the females release sex pheromone to attract the male for copulation, where the sex pheromones and the pheromone binding protein (PBP) gene family are relatively conserved in the Noctuidae and act according a lock-and-key principle11,12,13,14. After eclosion, adult moths engage in courtship and mating behavior in 0-day-old, with a mating rate reaching as high as 83.3%15; one female moth can produce 300–600 eggs. Hence, the females maximize their fitness by laying eggs preferentially on plants that maximize their offspring performance16,17,18,19,20,21. Our experimental results22, employing the age-stage, two-sex life table theory, and based on indoor experiments, along with statistical analyses of the offspring from oviposition and hatching of S. inferens, as well as other multiple parameters, have revealed its potential for widespread damage. Larvae tunnel through stalk internodes, weakening them and making them susceptible to breakage by strong winds, while also exposing plants to infection by the red rot fungus, leading to a significant decrease in sucrose content23. S. inferens successfully accomplished its entire developmental cycle on different gramineous crop hosts22 Symptoms known as “dead hearts” or “white heads”24, cause plant lodging and unfilled grains, leading to high yield losses25,26,27. Due to high levels of insecticide resistance and the hidden behavior of the insects into the plant stems, reducing the efficacy of chemical and biological control with parasitoids, the best options today for pest population suppression include field trapping using sex pheromones22,28 and the cultivation of trap plants29,30.

In this study, we present the first chromosome-level genome assembly and sex chromosome identification of S. inferens, providing valuable genomic resources for further research and development. The resulting assembly has a high quality, with a scaffold N50 size of 33.39 Mb. The completeness of the assembly was assessed using the BUSCO analysis, which revealed a high completeness of 98.90%. Repetitive elements were found to constitute a significant portion of the S. inferens genome, accounting for 58.59% of the total genome size. A total of 26628 protein-coding genes were identified in genome assembly. In conclusion, this chromosome-level assembly of the S. inferens genome does not only provide valuable genomic resources for understanding the biology and genetic basis for Lepidoptera, and supports the development of effective strategies for pest insect control based on sex pheromone traps and without use of chemical pesticides.

Methods

Sample collection and sequencing

Insect materials

Specimens of S. inferens were collected from the Shibanzhen, Bozhou District, Zunyi City, Guizhou Province, China. The larvae were collected from sorghum plants (Sorghum bicolor) (Fig. 1). The samples included 3rd instar larvae, pupae, and adult males and females. Among them, 3rd instar larvae were subjected to 24-hour starvation treatment. To ensure the thorough removal of microbial contaminants from the surface of the samples, both larvae and pupae were subjected to surface sterilization. This process involved immersion in 75% ethanol for 1 min followed by three subsequent rinses with sterile water. The detailed protocols are as follows.

Fig. 1
figure 1

Chromosome-level genome assembly results information circle plot. From outer to inner layers: chromosome information; gene density; repeat sequence content; GC content; photograph of Sesamia inferens larva, male adult (a, b) and female adult (c, d).

We extracted genomic DNA from S. inferens 3rd instar larvae samples using the Genome DNA extraction Kit (TIANGEN) as per the product manual. After extraction, utilizing the NanoDrop One, we detected the purity, concentration, and nucleic acid absorption peaks of S. inferens genomic DNA, focusing on the OD260/280 and OD260/230 ratios. For precise concentration determination, we employed the Qubit 3.0 system. A comparative analysis between Qubit 3.0 fluorescence photometer and NanoDrop One was conducted to assess sample purity. Additionally, we performed agarose gel electrophoresis to ascertain the integrity of the genomic DNA. For sequencing preparations, we employed Qubit 3.0 for precise quantification and Agilent 2100 Bioanalyzer for size analysis to ensure the library’s compliance with anticipated dimensions. Upon successful library validation, we initiated sequencing on the PacBio Sequel II (Pacific Biosciences), aligning the sequencing output with the predefined target data volume. The processed genomic DNA was subsequently employed for the generation of a single-molecule real-time (SMRT) bell sequencing library, utilizing the SMRTbell Template Prep Kit 2.0 developed by Pacific Biosciences31. As a result, we obtained a total of 80.0 Gb Illumina short read sequencing and 504.2 Gb PacBio sequencing reads. In total, 64.70 million raw reads (approximately 97.05 Gb) were obtained for scaffolding in genome assembly.

Genome assembly

To achieve a high-quality assembly, we initiated the process with rigorous quality control of the initial raw reads. In the process of data quality control, several steps are implemented to ensure the integrity and reliability of the sequencing data. Initially, reads containing adapter sequences are eliminated. Following this, bases with consecutive quality scores below 20 at both ends of the sequencing read are subjected to trimming. Reads with a resulting length of less than 50 bp are subsequently excluded. Ultimately, only paired-end reads are retained for subsequent analysis. We used HiFiasm (v 0.15.1) to preliminarily assemble the S. inferens genome, which could resolve near-identical repeats and segmental duplications to generate better haplotype assemblies32. The HiFiasm outputs a primary assembly after performing all-versus-all read overlap alignment and correcting sequencing errors. Purge_Haplotigs software was used to complete genome deredundancy after initial assembly and error corrected, and the redundant heterozygous contigs were identified and removed according to reads depth distribution and sequence similarity33. The total length of HiFiasm, HiFiasm + purge haplotigs and HiFiasm + purge haplotigs + contamination removal was 99712 Mb, 97610 Mb and 97320 Mb, respectively (Table S1). The hybrid was used to obtain a de novo genome assembly for S. inferens with total length of 973.20 Mb and contig N50 length of 30.57 Mb (Table S2). The genome assembly quality was comprehensively evaluated through BUSCO alignment against the Lepidoptera_odb10 orthologue database, assessing the overall integrity of the assembled genome. After aligning the second-generation reads to the genome, mutations were identified using software samtools, picard and GATK. Homozygosity and heterozygosity rates for SNPs and InDels were then calculated separately. The homozygous SNP rate was found to be <0.001%, while the homozygous InDel rate was 0.001%. In contrast, the heterozygous SNP rate was 1.070%, and the heterozygous InDel rate was 0.247%.

Chromosomal-level genome scaffolding with Hi-C data

To obtain the genome at the chromosomal level, Hi-C technology (High-throughput/resolution chromosome conformation capture) was applied34,35. The Hi-C library was prepared using a modified method according to standard protocol36. The samples were 3rd instar larvae.

Cells were treated with paraformaldehyde to fixed DNA conformation for 10 mins and stopped crosslinking by 2.5 M glycine for 20 mins. Subsequent to cell lysis, Crosslink DNAs were cut with a restriction enzyme and produced fill ends with biotin, DNA fragments were ligated using DNA ligase. To reverse the cross-linked state of DNA, proteinase digestion was applied, followed by purification of DNA, which was subsequently randomly sheared into fragments ranging from 300–500 bp. Biotin-labeled DNA was selectively captured using streptavidin magnetic beads, which was used to build the library and subsequent sequencing via the Illumina platform. We used bowtie 2 (v 2.2.3)37 to map the paired-end reads to the preliminary assembly. Then, HiC-Pro (v 2.7.8)38 was used to detect the ligation site of unmapped reads, which were mainly composed of the chimeric regions spanning across the ligation junction. High-quality clean data 94.998 Gb (read length: 150 bp) were generated after sequencing and filtering, then used for preliminary assembly by applying a 3D-DNA pipeline35 and LACHESIS39 using default parameters. We employed Juicer to construct the chromosome interaction map and then utilized Juicebox for visual correction. This allowed us to identify potential errors in contig sequence, direction, or assembly within the contig, ensuring the accuracy and reliability of our genome assembly. After the assisted assembly of the genome, a comprehensive genome-wide interaction map was constructed using Juicer40. Analysis of Hi-C data revealed assembly errors in the 3D-DNA assembly process, encompassing contig order, orientation, and internal arrangement. Performed manual visual error correction using JuiceBox (v 2.13.07)40,41. The corrected genome-wide interaction map exhibits enhanced intra-chromosomal interactions, with stronger interactions occurring between contigs in closer linear proximity. A chromatin contact matrix was manually curated in JuiceBox and the 31 scaffolds are clearly distinguishable in the heatmap in Fig. 2a, the interaction signal around the diagonal is strongly apparent. Contig distribution on genome chromosomes in Fig. 2b. The 88 contigs were divided, anchored, sorted, oriented, and merged into 31 chromosomes using LACHESIS and corrected by JuiceBox. The chromosomal heatmap showed good collinearity on the diagonal, which confirms the high quality of scaffolding. The final genome assembly was 973.18 Mb with a scaffold N50 of 33.39 Mb (Table 1, Fig. 1).

Fig. 2
figure 2

Genome assembly of Sesamia inferens. (a) Hi-C assembly of chromosome interactive heat map. Abscissa and ordinate represent order of each bin on corresponding chromosome group. Color block illuminates intensity of interaction from white (low) to red (high). (b) Contig distribution on genome chromosomes. The grey color represents the length of the corresponding chromosome, while other colors represent contigs of different length ranges. (c) Association analysis of GC content and coverage depth of second-generation reads. (d) Association analysis of GC content and coverage depth of third-generation reads.

Table 1 Hi-C assisted assembly statistics for Sesamia inferens.

Sex chromosomes identification

In this study, we performed whole-genome resequencing of 10 male and 10 female adult of S. inferens using the Illumina platform and producing a total of 294.63 Gb clean data. Quality-controlled sequencing reads were aligned to the reference genome scaffolds using BWA software (v 0.7.17)42. The resulting BAM files were utilized for further coverage analysis. Coverage rates for males and females were calculated separately using Samtools (v 1.10)43. The inherent copy number differences between the sexes for sex chromosomes, where the Z chromosome exhibits a higher copy number in males, while the W chromosome is present only in females, were analyzed44,45,46. The log ratio of male to female coverage (log2(M:F)) was computed, and changepoint analysis was performed using the R package “changepoint” (https://CRAN.R-project.org/package=changepoint) to detect points of variation. Chromosomes were categorized based on their log2(M:F) values: chromosomes with values ranging from 0 to ±0.1 were considered autosomes; those with values less than −0.25 were designated as W chromosomes; and those with values greater than or equal to 0.25 were identified as Z chromosomes (ZZ: ♂; ZW: ♀). Based on the log2(M:F) ratio, chromosome 1 was identified as the Z chromosome, and chromosome 31 as the W chromosome (Fig. 3).

Fig. 3
figure 3

Identification of sex chromosomes in Sesamia inferens genome assembly. 10 male and 10 female adults were resequenced, and the obtained reads were analyzed for coverage comparison. Chromosomes with a log2 (M:F read counts) value of 0 were regarded as autosomes (black dots), that with a value less than or equal to the −0.25 were considered W chromosome (red dot), and that with a value greater than or equal to 0.25 were considered Z chromosome (blue dot).

Transcriptome sequencing

To assist in the annotation of genome structure, transcriptomic libraries were prepared from the 3rd instar larvae, pupae, adult males and adult females of S. inferens. Each sample designated for sequencing had an individual library constructed for the procedure. Total RNA was isolated from individual S. inferens sample utilizing the TRIzol (Invitrogen, Carlsbad, CA, USA) reagent method. Following homogenization, samples were allowed to stand at ambient conditions before chloroform was introduced. The mixture underwent centrifugation at 12,000 g at 4 °C, allowing for phase separation. The aqueous phase was subsequently subjected to isopropanol precipitation and centrifugation. The RNA pellet obtained was rinsed in 75% ethanol (prepared in RNase-free water) and centrifuged twice to ensure purity. The air-dried pellet was reconstituted in DEPC-treated water, and its integrity and concentration were quantified using a NanoDrop-2000 spectrophotometer at 260 nm. The RNA samples that had good quality were then utilized for cDNA library construction. Sequencing was carried out on the Illumina NovaSeq 6000 platform47. The obtained spliced transcript was used for genome structure annotation to provide evidence of transcription level.

Genome quality assessment

The best five hits of BLASTN again NCBI NT database were from Atethmia, Cosmia, Mythimna, Amphipyra and Xestia (Table 2). Moreover, we compared the Lepidoptera_odb10 database using BUSCO. The assessment showed 98.9% of BUSCO genes were successfully detected, of which 98.9% were single copy and 1.1% duplicated (Table 3). The results of these evaluations indicate that the genome assembly has a high level of completeness and accuracy.

Table 2 Blast search of interrupted contig sequences in NCBI NT database.
Table 3 Statistical result of BUSCO evaluation results of genome assembly.

The assembled S. inferens genome size is 973.18 Mb with a scaffold N50 of 33.39 Mb (Fig. 1, Table S2), close to the estimated size in other Lepidoptera48. Using blobtools (v. 1.1.160), we created a blobplot to evaluate possible contamination of the contigs used for genome assembly (Fig. 2c,d). Taken together, these confidently confirm the accuracy of the chromosome scaffolding.

Repeat sequence annotation

We identified repeat sequences and transposable elements (TEs) using the methods of de novo assembly35 and homologous prediction. First, we used RepeatModeler (v 2.0.2) (https://github.com/Dfam-consortium/RepeatModeler) to predict the repeat sequence with default parameters. Then, RepBase database49 and RepeatMasker (v 4.1.2) (https://github.com/rmhubley/RepeatMasker) were used to annotate the sequence homologs. The results showed that 564.58 Mb are repeat sequences, accounting for 58.59% of the S. inferens genome. Among these repeat sequences, most (24.51%) are long interspersed nuclear elements (LINEs), followed by 12.92% of unclassified elements, 10.75% of long terminal repeats (LTRs)50,51,52, 5.55% of short interspersed nuclear elements (SINEs), 5.38% of rolling-circles and only 4.82% of DNA elements (Table 4).

Table 4 Statistics of repetitive elements in the Sesamia inferens genome.

Gene prediction and function assignment

We annotated protein coding genes in the S. inferens genome using a pipeline that combines de novo prediction, homology searching and transcriptome evidence48. The repeat-masked genome was then subjected to further analysis according to the MAKER (v 3.01.03) genome annotation pipeline53,54. First, we utilized BRAKER (v 2) to construct the parametric species model for the S. inferens genome55,56,57. Next, we employed Trinity (v 2.14.0) to perform transcript splicing with the default parameters for genome structure annotation58,59. The obtained spliced transcript was used for genome structure annotation to provide evidence of transcription level. Finally, we executed MAKER incorporating the transcriptome, genome, parametric model of species, and the protein sequences of 10 other Lepidoptera (Abrostola tripartita, Bombyx mori, Cnaphalocrocis medinalis, Habrosyne pyritoides, Helicoverpa armigera, Hyphantria cunea, Plutella xylostella, Spodoptera exigua, Spodoptera frugiperda and Spodoptera litura) with good annotations down from InsectBase 2.0 (http://v2.insect-genome.com) as input data to predict genes48. A total of 26628 protein coding genes were annotated following the pipeline combined with above-mentioned three methods. Our comparative analysis between our genome assembly and the previously published chromosome-level assembly of S. inferens60 highlighted several key differences. Specifically, our genome assembly exhibited a larger genome size of 973.18 Mb compared to the previously published size of 865.04 Mb. Additionally, while the previous assembly consisted of 1135 contigs and 69 scaffolds, our assembly comprised 88 scaffolds. Notably, our assembly featured a higher Contig N50 value of 30.17 Mb and a slightly lower Scaffold N50 value of 33.39 Mb. Furthermore, our analysis included the identification of the sex chromosomes of S. inferens, providing further elucidation of its karyotype, as detailed in Table S3.

Phylogeny

OrthoFinder61,62,63 (v 2.5.1) was used to analyze the orthologous and paralogous genes of 10 insect genomes, including Drosophila melanogaster (assembly accession: GCF_000001215.4), P. xylostella (assembly accession: GCA_019096205.1), A. tripartita (assembly accession: GCA_905340225.1), B. mori (assembly accession: GCF_014905235.1), C. medinalis (IBG_00192), H. pyritoides (assembly accession: GCA_907165245.1), H. cunea (assembly accession: GCA_003709505.1), S. frugiperda (assembly accession: GCF_011064685.1), S. exigua (assembly accession: GCA_011316535.1), S. litura (assembly accession: GCA_002706865.1), H. armigera (assembly accession: GCF_002156985.1), and D. melanogaster was selected as an outgroup (Fig. 4).

Fig. 4
figure 4

Phylogenetic analysis of Sesamia inferens and 10 other Lepidoptera species.

Phylogenetic trees were constructed based on single-copy orthologous gene families. The phylogenetic tree was constructed by maximum likelihood (ML) using IQ-TREE (v 2.1.2) with the best model (JTT + F + R5) and 1000 rapid bootstrap replicates to assess the robustness of the tree64. Additionally, we used Astral-III65 to merge all gene trees obtained through OrthoFinder into a unified species tree. It is essential to emphasize that the two trees generated from these methods must be congruent, validating the consistency and accuracy of our phylogenetic analysis. Divergence time was estimated by MCMCtree66 program in the PAML package (v 4.8) based on the multiple sequence alignment protein sequences. The calibration time points of D. melanogaster (99.15 MYA), P. xylostella (81.66 MYA), S. inferens (13.26 MYA), B. mori (49.25 MYA) and H. pyritoides (40.98 MYA) were obtained from TimeTree67 (http://timetree.org/) (Fig. 4). Gene family contraction and expansion were analyzed using CAFE (v 4.2), incorporating the results from OrthoFinder and the phylogenetic tree with divergence time information68. Finally, iTOL (https://itol.embl.de/#) was used to visualize and enhance the appearance of the phylogenetic tree. S. inferens exhibits an explanation of 825, which is equivalent to half of that observed in S. frugiperda, lower than in P. xylostella, and on par with that in D. melanogaster. The expansion of gene families is considered a pivotal factor contributing to biodiversity and evolution. Data regarding gene family expansion in S. inferens reveals a relatively rapid rate of renewal and iteration. This accelerated gene family evolution enables the organism to adapt to the diverse and continually changing challenges presented by its environment. This result aligns with the phenomenon observed in the field, where the infection of S. inferens has transitioned gradually from localized edge infections to widespread field infestations69.

Data Records

The raw sequencing data and genome assembly of S. inferens have been deposited at the National Center for Biotechnology Information (NCBI). Illumina, PacBio and Hi-C data for S. inferens genome sequencing have been deposited in the NCBI Sequence Read Archive with accession number SRR26501366, SRR27137600 and SRR27032946 under BioProject accession number PRJNA101423470.

Illumina transcriptome data 3rd instar larva (SRR26056362), pupa (SRR26056882), female adult (SRR26050603), male adult (SRR26056479) are available under Bioproject PRJNA101423470.

Genome resequencing data for female adults (SRR28744322, SRR28744323, SRR28744324, SRR28744325, SRR28744326, SRR28744327, SRR28744328, SRR28744328, SRR28744330, SRR28744331) and male adults (SRR28778051, SRR28778052, SRR28778053, SRR28778054, SRR28778055, SRR28778056, SRR28778057, SRR28778058, SRR28778059, SRR28778060) are available under Bioproject PRJNA101423470.

This Whole Genome Shotgun project has been deposited at GenBank under the accession JAYKGN00000000071. The version described in this paper is version JAYKGN010000000.

The annotation file is available in figshare72.

Technical Validation

After extraction, the DNA purity, concentration and integrity were detected using NanoDrop One, Qubit 3.0 fluorescence photometer and Agilent 2100 Bioanalyzer (Agilent Technologies, CA, USA), respectively. RNA integrity and concentration were quantified using a NanoDrop One spectrophotometer (Thermo Fisher Scientific, Waltham, MA, United States). High-quality DNA and RNA were used for sequencing.

We used three methods to assess the completeness and quality of the assembly. First, a data accuracy assessment was conducted to confirm the belonging of the assembly results to the target species. The genome sequence was fragmented at 10 kb intervals, and the resulting sequences were aligned to the NCBI nucleotide database (NT library) using Blast software73. Second, a sequence consistency evaluation was performed by aligning second and third-generation data to the assembled genome using BWA (v 0.7.17)42 and Minimap2 (v 2.24)74. As depicted in Table 5, The alignment statistics for the second-generation reads show a mapping rate of 99.67%, a paired mapping rate of 92.40%, an average sequencing depth of 69.38 X, and 99.98% coverage. For third-generation reads, the mapping rate was 99.98%, the average sequencing depth was 26.77 X, and the coverage was 100.00%. Higher mapping and coverage rates indicate a higher consistency between the assembly results and the reads, reflecting better assembly performance. Third, the quality of the genome sequence was evaluated by BUSCO (v 4)75,76,77,78 with Lepidoptera_odb10 and default parameters. In addition, after aligning second-generation reads to the genome, mutations were identified using samtools, picard, and GATK (v 4.4.0.0)79. The rates of homozygous and heterozygous SNPs and InDels were calculated. The homozygous SNP rate was <0.01%, the homozygous InDel rate was 0.001%, the heterozygous SNP rate was 1.070%, and the heterozygous InDel rate was 0.247%.

Table 5 Statistical results of reads alignment.