Background & Summary

Herbivorous insects are one of the most abundant groups in both natural and managed ecosystems, accounting for half of the metazoan species in the world1. As the most diverse superfamily of Lepidoptera, Gelechioidea comprises more than 18,400 described species2, which attack plants from Solanaceae, Juglandaceae, Rosaceae, Fabaceae and others3,4,5,6. For their host plants, loss rates to plants’ yield can reach 80–100%7, causing an economic loss of up to 1.1 billion US dollars per year in China alone, seriously impacting the development of agroforestry in the country8,9. Even so, Gelechioidea species have received only limited attention. For example, as of October 2024, only 30 species of the Gelechioidea have assembled genomes according to NCBI, far fewer than the Noctuoidea (212). The main reason for the lack of research is that they are small and easily overlooked, making them difficult to collect on a large scale in the field, despite their value as evolutionary models10,11.

The specialized moth Atrijuglans aristata, also known as Atrijuglans hetaohei, belongs to the family Stathmopodidae (Lepidoptera: Gelechioidea)2, a family whose larvae exhibit varied lifestyles, including feeding on fruits, buds, fern-spores, and galls11,12,13,14. It is mainly distributed in walnut-producing areas in China, and has been sporadically recorded in Japan and Korea according to the GBIF (https://www.gbif.org/) and BOLD Systems v415 databases. The larvae of A. aristata mainly attacks the green husk of Persian walnut (Juglans regia), and a small number of studies have also reported harm to Manchurian walnut (Juglans mandshurica)16,17. According to FAO (http://www.fao.org/, accessed 2005), China’s annual production of walnuts reaches 420,000 t, far surpassing the 322,000 t in the United States and becoming the world’s largest producer of walnuts18. However, A. aristata continues to damage walnut fruit during the growing season, with average infestation rates as high as 80%, resulting in substantial early fruit drop (Fig. 1e,f), severely reducing walnut yields by as much as 40–50% in China4,19,20. Unfortunately, the lack of a high-quality genome for A. aristata has severely hindered efforts to control and manage it.

Fig. 1
figure 1

Morphology and hazard of Atrijuglans aristata. (a) Egg. (b) Larva. (c) Pupa. (d) Adult. (e) The formation of a cavity in the walnut fruit by the A. aristata. (f) Walnut fruit drops early when it was damaged by A. aristata.

In this study, we assembled and annotated a chromosome-level genome for the specialist A. aristata based on PacBio CLR and HiFi reads, Nanopore reads, Hi-C reads, and Illumina sequencing reads, also further identified its Z and a portion of the W chromosome. This is also the first reference genome of the family Stathmopodidae. This genome provides a research foundation for the development of novel nano-pesticides for the A. aristata, as well as a basic data for the study of the adaptive evolution of specialists to host plants.

Methods

Sampling and sequencing

The adults of the A. aristata were collected in Jinan, Shandong Province in China (117°26′46″E, 36°32′6″N) from 2021–2022. After sex identification, the surface of these fresh samples was washed with PBS buffer and stored in a −80 °C refrigerator after liquid nitrogen flash freezing. The heads and thoraxes of female A. aristata used for different types of library construction and sequencing were separately executed for genomic DNA extraction according to the protocols, with different individual sets for most downstream procedures. 1% agarose gels were used to detect the extraction quality and contamination status. DNA purity was analyzed via OD 260/280 ratio using a Nanodrop and DNA concentration with a Qubit® 3.0 Flurometer (Invitrogen, USA). A 350 bp paired-end short fragment library was carried out with the NEB Next ®Ultra™ DNA Library Prep Kit (NEB, USA) and further sequenced on Illumina HiSeq platform (DNA from six individuals). 20 kb PacBio CLR (continuous long reads) library was constructed using the SMRTbell Express Template Preparation Kit 2.0 and sequenced on PacBio Sequel platform (The DNA comes from the remainder of the DNA of the six individuals above). 15 kb PacBio HiFi (high-fidelity) reads were generated on PacBio Sequel IIe sequencing platform (DNA from one individual). 20 kb Nanopore library was prepared with the Ligation Sequencing gDNA Kit (SQK-LSK109) and performed on PromethION platform (DNA from 30 individuals). The 150 bp Hi-C library was carried out on Illumina Hiseq platform (DNA from 204 individuals). The raw Illumina sequencing reads were processed with the non-open-source software pk_qc.v2 by Novogene to remove reads containing adapters and N’s proportions greater than 10%. Paired reads were removed when the number of low-quality (sequencing quality values less than 5) bases in a single read exceeded 20% of the read length. Other raw sequencing reads were filtered with default parameters. In the end, a total of 57 Gb Illumina paired-end reads, 18 Gb PacBio HiFi reads, 164 Gb PacBio CLR reads, 10 Gb nanopore reads, and 58 Gb Hi-C reads were generated in the clean data (Table 1).

Table 1 Sequencing strategies for genome assembly.

In order to acquire transcriptome data, walnuts fruits damaged by A. aristata were collected at the sampling sites mentioned above, and the larvae within were raised in the laboratory under 25 ± 1 °C, 75 ± 10% RH (relative humidity), photoperiod 14 L:10 D. The heads, guts and Malpighian tube of the larvae and one cocoon pupa were taken and placed in liquid nitrogen for rapid freezing respectively. RNA was extracted according to the TRIzol protocol (Life Technologies, USA), the RNA integrity was detected using 1% agarose gel electrophoresis, while the concentration and purity were examined using a Nanodrop2000. library construction was performed with TruSeq RNA v2 Kit (Illumina), and finally 150 bp paired-end sequencing was performed on the Illumina NovaSeq 6000 platform. After filtering the raw reads using the default parameters of the fastp v0.23.221, 6 Gb clean reads are generated.

Genome assembly

We used the paired-end reads from the Illumina sequencing to estimate the genome size of A. aristata based on the k-mer method. In brief, jellyfish v2.2.722 was used to generate the k-mer frequency distribution table (k = 21) and to further analyze genome size. Here, genome size was assessed by the total number of k-mer divided by the peak value of k-mer distribution. The output of the above analysis was further visualized using GenomeScope2.023. The results showed that the estimated genome size of A. aristata was 436.67 Mb, with a heterozygosity of 0.804% (Fig. 2).

Fig. 2
figure 2

Genome size estimation based on k-mer frequency distribution (k = 21).

Based on the previous evaluation of genome size, we assembled the frame of the draft genome with PacBio CLR reads in wtdbg2 v2.524. The first correction was performed using Illumina sequencing data and all PacBio sequencing data based on NextPolish v1.3.125. Heterozygous areas in the genome were removed using purge_haplotigs v1.0.2+ (https://bitbucket.org/mroachawri/purge_haplotigs/src/master/). To construct a reference genome at the chromosome level, high throughput chromosomal conformational capture (Hi-C) technology was used to anchor contigs onto chromosomes based on ALLHiC pipeline v0.9.1326 for anchoring and clustering, with results further corrected manually using Juicebox v1.9.827 based on the intensity of chromosome interaction. To increase N50 length, nanopore reads were added for gap filling using TGS-GapCloser v1.1.128. Further error correction was conducted using racon v1.4.2029. Finally, a reference genome with a long N50 and chromosome-level was obtained. To detect the accuracy and completeness of the genome assembly, an assessment was performed using the Insecta gene set (odb10, containing 1,367 core genes) with BUSCO v5.4.730. The final assembled reference genome is 480.99 Mb in length. Contig N50 and scaffold N50 of the genome are 2.68 Mb and 16.01 Mb respectively (Table 2). The contigs were anchored to 31 pseudo-chromosomes, accounting for 93.83% of the genome size, of which Chr01, at 37.57 Mb in length, is the longest chromosome (Fig. 3; Supplementary Table S1). The BUSCO completeness assessment was 95.6% (Table 2). In addition, we further compared the genome sizes of seven species of Gelechioidea and found that the genome sizes of Gelechioidea varied greatly, with the acer sober Anarsia innoxiella having a genome of only 302.93 Mb and the dotted grey groundling Athrips mouffetella having a genome of 869.73 Mb (Supplementary Table S2). This large genome difference may be related to their lifestyles31.

Table 2 The genome assembly and annotation results of Atrijuglans aristata.
Fig. 3
figure 3

Heatmap of chromosomes anchored via Hi-C technology.

Genome annotation

Repetitive element detection was conducted using the EDTA pipeline v2.0.032 with the main parameter “-species others -sensitive 1 -anno 1”. The non-redundant repeat database generated in the previous steps was passed to RepeatMasker v4.1.233 to identify the content and proportion of repeat sequences in the genome. Three different lines of evidence were used for gene prediction: ab initio, homologous protein, and transcriptome alignment. For ab initio annotation, SNAP v2013-02-1634 and Augustus v3.4.035 were used for gene prediction. For homologous annotation, more than 100,000 protein-coding genes from model insect species (Apis mellifera GCF_003254395.2, Bombyx mori GCF_000151625.1, Drosophila melanogaster GCF_000001215.4, Tribolium castaneum GCF_000002335.3, Helicoverpa armigera GCF_002156985.1, Plutella xylostella GCF_932276165.1 and Spodoptera litura GCF_002706865.1) were downloaded from NCBI. These protein sequences were aligned onto the reference genome using GenBlastA v1.0.136. After extracting high-scoring pairs (HSPs) from the results, the aligned region was extended 2 kb to both sides and gene prediction was performed using GeneWise v2.4.137. For transcriptome alignment-based gene prediction, we used the transcriptome data sequenced in this study (SRR2346268638, SRR31891377-SRR3189137938) and the open access transcriptome data in SRA under SRR10321778 and SRR10242502-SRR1024250739. These data were assembled using StringTie v2.2.140 and the genome-guided mode of trinity v2.1.141, respectively. PASApipeline v2.5.242 was used to align the transcript to the reference genome and obtain gene prediction results based on the transcriptome data. The EVidenceModeler v1.1.143 software was used to integrate the three types of evidence and obtain the final annotations. To verify the completeness of the annotations, BUSCO assessment was conducted on the protein-coding genes. At the same time, Circos v0.6944 was used to visualize the GC content and gene distribution on each chromosome of the genome.

After our analysis, nearly half of the genome (48.17%; 231.68 Mb in length) is repetitive sequences (Table 2). Among them, DNA transposons were the most abundant repeats, accounting for 26.03% of the genome, followed by retroelements (15.70%; 75.49 Mb). A total of 22,542 protein-coding genes were predicted in the genome. BUSCO assessment with protein-coding genes showed 94.3% completeness, indicating a well-annotated genome. Overall, both the density of repeated sequences and genes are more evenly represented on each chromosome of the genome (Fig. 4).

Fig. 4
figure 4

Circular diagram of the genomic features on each chromosome of Atrijuglans aristata. After splitting each chromosome into windows every 200 kb, the following statistics were collected, displayed from outer to inner ring: a: Chromosome ideograms. b: Gene density. c: GC content. The lighter to darker green color in the heat map represents a gradual increase in the GC content value. d: Repeat density.

For functional annotation, we searched genes to the eggNOG 5.0 database using eggNOG-mapper v245. At the same time, InterProScan v5.52–86.046 was also used for annotations based on its own database. Finally, 14,966 genes were annotated based on eggNOG and 18,128 genes were annotated based on InterProScan, for a total of 19,167 genes annotated by combining the two methods (Table 2).

Whole-genome collinearity and sex chromosomes identification

We performed whole-genome collinearity analysis for three species: A. aristata, the rice leaffolder Cnaphalocrocis medinalis and tobacco cutworm S. litura. The latter two species are well-studied species and have high-quality reference genomes47,48. After the amino acid sequences of three species were aligned with blastp (E-value ≤ 1e-5), collinearity analysis between two species was performed by MCScanX49, and collinearity blocks of gene pairs greater than 5 were plotted using MCscan Python v1.1.1250. To characterize the sex chromosomes of A. aristata, we reused the sequencing reads used for genome assembly; in brief, we first randomly selected 1,000,000 reads from the PacBio CLR reads used for genome assembly. These PacBio CLR reads and all nanopore reads were then mapped to the reference genome of A. aristata using Minimap2 v2.17-r94151. Finally, the coverage on each chromosome was calculated using the flagstat function of SAMtools v1.16.152. In general, the W chromosome has female-biased coverage, while the coverage of the Z chromosome in the female is about half of that in the male, and the autosomes remain roughly the same between the sexes. This difference was often used to identify the sex chromosomes of species48,53,54.

The collinearity analysis indicated that the chromosomes of A. aristata showed substantial collinearity with those of the rice leaf-folder C. medinalis and the tobacco cutworm S. litura. We observed a chromosome fusion event on Chr01, which was formed by the fusion of a Z chromosome and an autosome (Fig. 5a). In addition, the W chromosome of the rice leaf-folder C. medinalis has strong collinearity with Chr31 of A. aristata, although only 1.23 Mb was detected on this chromosome (Supplementary Table S1). In addition, we see that the Chr01 coverage (15.78×) based on PacBio CLR reads is lower than the average coverage of autosomes (25.70×), while Chr31 (58.56×) is higher than that, and the Nanopore reads produced a similar result (Fig. 5b). Therefore, these results indicated that Chr01 and Chr31 of A. aristata may be Z and a portion of the W chromosomes, respectively.

Fig. 5
figure 5

Chromosome analysis of the Atrijuglans aristata. (a) Collinearity relationship between three species (Cnaphalocrocis medinalis, Atrijuglans aristata, and Spodoptera litura). Each long bar represents a chromosome. Regions of the genomes of the two species that are aligned with each other are connected with gray lines. (B) 1,000,000 PacBio CLR reads and all nanopore reads were mapped to the reference genome of A. aristata separately, and the coverage of each chromosome was counted.

Data Records

The raw genome and transcriptome sequencing reads of A. aristata have been deposited as BioProject PRJNA93337838. The corresponding SRA accession numbers for the genomic sequencing reads are SRR23818966, SRR23796504, SRR23795132, SRR23693017 and SRR23622219. The corresponding SRA accession numbers for the transcriptomic sequencing reads are SRR23462686, SRR31891377-SRR31891379. The genome assembly of A. aristata has been released to NCBI GenBank with the accession number JAQYUT00000000055. Moreover, all the genome assembly and annotation results have been made available in the Figshare repository56.

Technical Validation

Although haplotype duplication is inevitable, we have minimized genomic heterozygosity by collecting samples with consistent genetic backgrounds and removing heterozygous regions during the assembly process. The predicted genome size based on k-mer analysis was 436.67 Mb, which is 44.32 Mb smaller than the final genome. 93.83% (451.31 Mb) of genome contigs were anchored to chromosomes. Clean reads from Pacbio HiFi data and nanopore data were mapped to the reference genome using Minimap2 v2.17-r94151, and the mapping rates were 99.92% and 94.53%, respectively. A. aristata showed substantial collinearity with both C. medinalis and S. litura. The BUSCO completeness based on genome and protein-coding gene analysis was 95.6% and 94.3%, respectively, demonstrating the completeness and accuracy of the genome assembly and annotation.