Background & Summary

The tea tussock moth, Euproctis pseudoconspersa, is an extremely destructive chewing pest found in tea plantations in China, Japan, and Korea1. This species has high fecundity and its voracious larvae heavily consume the leaves and shoots of tea plants, consequently causing severe losses in both yield and quality. Furthermore, its urticating setae cause serious skin allergic reactions in humans2. Currently, chemical control remains the primary method for managing tea tussock moth infestations. The extensive use of chemical pesticides poses a series of serious problems, including ecological and environmental destruction, threats to the health of tea drinkers, and insecticide resistance. Sex pheromone-based pest management technology, which is eco-friendly and more targeted at specific pests, is used to monitor and control insect pests3. Elucidating the molecular mechanisms and specificity principles underlying pest sex pheromone communication can provide a theoretical foundation for the development of sex attractants4,5.

In addition to their economic importance, sex pheromones play key roles in reproduction and are associated with reproductive isolation and species differentiation6. Based on their chemical structures, moth sex pheromones can be classified into four types. Among these, Type III sex pheromones are rare and have evolved independently several times in several lepidopteran families7,8. Owing to the lack of experimental materials, no draft genome of moths producing Type III sex pheromones is available. The tea tussock moth, which produces Type III sex pheromones (10,14-dimethyl-pentadecyl isobutyrate and 14-methylpentadecyl isobutyrate), is an ideal candidate for the evolutionary analysis of type III sex pheromone moths1.

Whole-genome sequencing is a fundamental tool for studying evolution and provides a complete set of gene resources, contributing to the development of management strategies9,10. In the present study, we assembled a chromosome-level genome of E. pseudoconspersa using a combination of PacBio HiFi sequencing and high-throughput chromosome conformation capture (Hi-C) technology. This high-quality chromosome-level genome provides a valuable resource for research on the biology, behavior, and genetic evolution of E. pseudoconspersa.

Methods

Insect collection and sequencing

E. pseudoconspersa larvae were obtained from tea plantations of the Fujian Tianhu Tea Industry Co., Ltd. (120.19°E,27.10°N). Larvae were reared on fresh tea shoots under laboratory conditions at a temperature of 25 ± 1 °C, relative humidity of 65% ± 10%, and under a 14:10 h (light:dark) photoperiod11. Genomic DNA was extracted from a female adult using a QIAamp DNA Mini Kit (Qiagen, Hilden, Germany). The integrity and purity of extracted DNA was monitored on 1% agarose gels and using a NanoDrop™ One UV-Vis spectrophotometer (Thermo Fisher Scientific, Waltham, MA, USA), respectively. Genomic DNA was sheared in G-tubes (Covaris, Woburn, MA, USA) and concentrated using AMPure PB magnetic beads.

Two libraries were constructed for genome sequencing. A short-read sequencing library was constructed using a TruSeq Nano DNA HT Sample Preparation Kit (Illumina, San Diego, CA, USA) and sequenced on an Illumina HiSeq X platform at Grand Omics Biosciences Co., Ltd. (Wuhan, China). The PacBio SMRT bell library was constructed using the SMRTbell® prep kit 3.0 (PacBio, Menlo Park, CA, USA) and sequenced using the PacBio Revio equipment at GrandOmics Biosciences Co., Ltd. (Wuhan, China). In total, 66.38 and 11.92 Gb of clean data were generated from the Illumina paired-end and PacBio libraries, corresponding to 176.43 × and 31.23 × coverage of the genome, respectively (Table 1).

Table 1 Statistics of sequencing data of E. pseudoconspersa genome.

For Hi-C sequencing, a library was constructed following a standard library preparation protocol using the NEBNext Ultra II DNA Library Prep Kit for Illumina (New England Biolabs, Beijing, China). The purified DNA was digested with DpnII. Biotinylated nucleotides were used to repair the tails and ligated DNA was sheared to a length of 400 bp. The Hi-C library was sequenced on the MGISEQ-T7 platform. Finally, a total of 56.04 Gb Hi-C clean reads were obtained with approximately 149.67 × coverage of the genome (Table 1).

Genome survey and assembly

Before genome assembly, k-mer (K = 17) analysis was performed using Illumina DNA data to estimate the genome size and heterozygosity12. The estimated genome size was 368.76 Mb with a heterozygosity of 0.70% (Fig. 1). For de novo genome assembly, raw sequencing data produced by the Pacific Bioscience Sequel were processed using SMRTlink (v8.0) with default parameters. The high quality region finder algorithm was employed to identify the longest regions with single-molecule enzymatic activity. Low-quality regions were filtered out based on the signal-to-noise ratio. After quality control13, the total data volume was 11.92 Gb, comprising 678 097 reads with a read N50 of 18.03 kb. The filtered PacBio HiFi reads were used to produce a preliminary assembly using Hifiasm (v0.19)14,15,16. To discard potentially redundant contigs and generate a draft genome assembly, the contigs were polished with NextPolish (v1.2.4) using Illumina short reads with default parameters17.

Fig. 1
figure 1

Overview of the 17-mer frequency distribution in the Euproctis pseudoconspersa genome.

Karyotyping and Hi-C scaffolding

The chromosome number was determined using Giemsa staining of testicular samples from male adults at Hangzhou Kayou Taipu Biotechnology Co., Ltd. (Hangzhou, China). A total of 22 chromosomes were identified in these sperm samples (Fig. 2a,b). Because genomic sequencing was performed using a female adult, which contains Z and W sex chromosomes, the chromosome number was set to 23 for Hi-C scaffolding to enable assembly of the female-specific W chromosome. Draft genome sequences were assembled at the chromosome level using Hi-C scaffolding13,18,19,20. Uniquely mapped paired-end reads were further processed using HiC-Pro (v2.8.1)21 to filter invalid read pairs, including dangling-end, self-cycle, religation, and dumped pairs. In total, 91 347 701 valid interaction pairs were further clustered, and the scaffolds were ordered and oriented onto chromosomes using LACHESIS with the following parameters: CLUSTER_MIN_RE_SITES = 100, CLUSTER_MAX_LINK_DENSITY = 2.5, CLUSTER NONINFORMATIVE RATIO = 1.4, ORDER MIN N RES IN TRUNK = 60, and ORDER MIN N RES IN SHREDS = 60. Finally, 41 scaffolds were anchored onto 23 chromosomes, with a total assembly length of 374.29 Mb, representing 99.94% of the draft genome assembly (Fig. 2c,d). The scaffold/contig N50 was 16.84/15.29 Mb (Tables 2, 3).

Fig. 2
figure 2

Karyotyping and overview of the genomic landscape of Euproctis pseudoconspersa genome. (a) and (b) The chromosomes were identified in the sperm. (c) The heatmap of chromosome interactions of E. pseudoconspersa genome (resolution = 100 Kb). The colour demonstrates the intensity of the interaction from white (low) to red (high). (d) Circos plot of distribution in E. pseudoconspersa genome, which circle I-VII indicated GC content, protein coding genes counts, density of repeat contents DNA transposons, density of repeat contents LINEs and long interspersed elements, density of repeat contents SINEs and short interspersed repeats elements, density of repeat contents LTR and long terminal repeat elements, density of simple repeats of each respective chromosome.

Table 2 Statistics for the chromosomal-level genome of the E. pseudoconspersa.
Table 3 Statistics of Hi-C assembly results.

Genomic repeat annotation

First, tandem repeats (TRs) were annotated using GMATA (v2.2)22 to identify simple repeat sequences, while Tandem Repeats Finder (TRF) (v4.07b)23 was used to recognize all tandem repeat elements across the genome. To identify transposable elements (TE), an ab inito repeat library for E. pseudoconspersa was predicted using MITE-hunter (-n 20 -P 0.2 -c 3)24 and RepeatModeler (v1.0.11)25,26,27,28 with default parameters. Subsequently, the predicted repeats were classified via alignment using TEclass Repbase (http://www.girinst.org/repbase)29. RepeatMasker (v1.331)28 was used to identify TEs via homology searching against the de novo repeat and Repbase TE libraries. In total, 126.73 Mb of TE sequences were identified, accounting for 33.86% of the genome assembly, including LTR (4.51%), LINE (12.16%), SINE (1.95%), DNA (8.5%), RC (6.4%), and MITE (0.34%) elements (Table 4).

Table 4 Statistics of repeat elements of E. pseudoconspersa genome.

Gene prediction

Gene prediction was performed using three strategies: homolog-based, transcriptome-based, and ab initio prediction. Homolog-based prediction was conducted using GeMoMa (v1.6.1)30 to align homologous peptides from five lepidopteran insects (Bombyx mori31, Spodoptera litura32, Plutella xylostella33, Papilio xuthus34, and Danaus plexippus35) and Drosophila melanogaster. For transcriptome-based prediction, RNA sequencing data were derived from the antennae, pheromone glands, heads, and bodies of 2-d-old mature virgin female adults and from the antennae, heads, and bodies of 2-d-old mature virgin male adults (accession numbers: SRR31971029-SRR31971049)11. The filtered RNA-seq reads were aligned to the reference genome using STAR (v2.7.3a)36. The transcripts were then assembled using StringTie (v1.3.4 d), and open reading frames were predicted using PASA (v2.3.3)30. Augustus (v3.3.1) with default parameters was used for ab initio gene prediction37. Finally, a unified gene model was predicted by integrating the gene sets obtained from the three methods using EVidenceModeler (v1.1.1)30. In total, 12 371 protein-coding genes were identified in the E. pseudoconspersa genome (Table 5).

Table 5 Statistics of functional annotation of E. pseudoconspersa genome.

Functional annotation of gene models

To annotate the functions of the protein-coding genes, the predicted genes were aligned against various databases, including SwissProt, NR, Kyoto Encyclopedia of Genes and Genomes, EuKaryotic Orthologous Groups, and Gene Ontology (GO). Putative domains and GO terms of the genes were identified using the InterProScan program with default parameters. BLASTp was used to search the remaining four databases. The results of the five database searches were concatenated (Table 5).

Data Records

The raw NGS data of the E. pseudoconspersa genome were deposited in the NCBI Sequence Read Archive under the BioProject accession number PRJNA1299575. The RNA-Seq data are available under Bioproject PRJNA1198692. The final assembled E. pseudoconspersa genome was deposited in the China National GeneBank DataBase under the accession number CNA0509021. Genome assembly and annotation files were deposited in Figshare database.

Technical Validation

The completeness of the assembly was evaluated using Benchmarking Universal Single-Copy Orthologs (BUSCO) v4.0.538 (version of the reference database: -l lepidoptera_odb10 -g genome) and Core Eukaryotic Gene Mapping Approach (CEGMA) v239,40,41,42. The analysis identified 98.69% complete BUSCOs (98.07% single-copy BUSCOs and 0.062% duplicated BUSCOs) and 96.77% core genes. To evaluate the accuracy of the assembly, all Illumina paired-end reads were mapped to the draft assembly using the BurrowsWheeler aligner (BWA) v0.7.1243, and the mapping rate and genome coverage were assessed using SAMtools v0.1.444. The base accuracy of the assembly was calculated using BCFtools v1.8.045. The results showed that 99.9% of Illumina paired-end reads mapped to the draft assembly. These results suggested that the assembled genome was highly complete and accurate.

The Hi-C heatmap showed that the interaction intensity at diagonal positions was higher than that at nondiagonal positions (Fig. 1c), indirectly confirming the accuracy of the chromosome assembly.